CN106648654A

CN106648654A - Data sensing-based Spark configuration parameter automatic optimization method

Info

Publication number: CN106648654A
Application number: CN201611182310.5A
Authority: CN
Inventors: 罗妮; 喻之斌; 贝振东; 姜春涛; 须成忠; 熊文
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-05-10

Abstract

The invention belongs to the technical field of electronic information, big data, cloud computing and the like, and particularly relates to a data sensing-based Spark configuration parameter automatic optimization method. The method comprises the steps of predetermining a Spark application and parameters influencing Spark performance; randomly configuring the parameters to obtain a training set; building a performance model by the training set through a random forest algorithm; and searching out optimal configuration parameters through a genetic algorithm. According to the method, under the condition that a user is not required to understand a Spark running mechanism, a parameter meaning effect, a value range, application characteristics and an input set, the optimal configuration parameters of a specific application running in a specific cluster environment can be found for the user; compared with a conventional parameter configuration method, the automatic optimization method is simpler and quicker; and the used random forest algorithm combines the advantages of machine learning and statistical reasoning, so that relatively high precision can be achieved by using relatively few training sets.

Description

A kind of Spark configuration parameter automatic optimization methods of data perception

Technical field

The invention belongs to the technical field such as electronic information, big data, cloud computing, more particularly to a kind of data perception Spark configuration parameter automatic optimization methods.

Background technology

Spark is the class that UC Berkeley AMP lab (the AMP laboratories of University of California Berkeley) are increased income Hadoop MapReduce universal parallel frameworks.It quickly grows, and has only used short five year, just becomes Apache funds Top project.Because Spark has the characteristics of intermediate result is stored in internal memory, Spark operation iteration and interactive program 10 times are improve than traditional disk Computational frame Hadoop.Because Spark has critical role, root in big data analysis field According to the investigation of Typesafe companies, 500 enterprises have been had more than within 2015 and has used Spark.

Configuration parameter optimization is always one of big data systematic research focus, and (100 are more than because configuration parameter is numerous It is individual), performance is affected very big by configuration parameter, and application program has different characteristics.Therefore reached far away most preferably using default configuration Performance.Spark is a kind of emerging big data internal memory Computational frame, because Spark has the characteristic of " internal memory calculating ", in cluster All resources：CPU, the network bandwidth, internal memory, can all become the bottleneck of restriction Spark programs.And different Spark application journeys Sequence has different characteristics again, such as Kmeans instructions locality is good but data locality is poor, the shuffle and iteration of PageRank Select all many than KMeans, WordCount is not comprising iteration etc..The problem to be solved in the present invention is to specific collection group rings Border, input data set and application program, are that automatic Spark finds optimum configuration parameter.

Hadoop parameter automatic optimization method RFHOC (A Random-Forest Approach to based on random forest Auto-Tuning Hadoop ' s Configuration, abbreviation RFHOC) it is a kind of for operating on a given cluster The configuration parameter optimization method of application program, is broadly divided into three steps：

1. performance test

2. performance model is built

3. iterative search allocation optimum

When user runs a Hadoop application program for the first time, RFHOC workload profiler collect operation When Hadoop configuration parameter and the execution time in MapReduce stages.Subsequently, the execution time of different phase and corresponding match somebody with somebody Putting parameter will be used to build performance prediction model as the input of random forests algorithm.RFHOC distinguishes for map the and reduce stages Building regression model is used to predict the performance in each stage.Each stage first will produce a training set S, each behavior of S Vector v j, vj contains each execution time and corresponding Hadoop configuration parameter values.After building up performance model, RFHOC is used Genetic algorithm searching Hadoop optimized parameters.Genetic algorithm uses the performance of Random Forest model prediction and corresponding configuration conduct Global search is done in input.The execution time in Map and reduce stages adds up to the total time of program operation, is also genetic algorithm Adaptive value.

Prior art is manual configuration parameter and automatic configuration parameter.Manual configuration parametric technique drawback be it is too time-consuming, And require that user has deeper understanding to the operating mechanism of Spark, the meaning of parameter, effect and span.User needs To increase manually or reduce Spark parameter values, then configure Spark, run application program, find the ginseng for making the execution time most short Numerical value.Because the allocation optimum parameter of different cluster environment, different application and different input data sets is different, match somebody with somebody manually It is a time-consuming, scissors and paste to put parametric technique.

Existing automatic configuration parameter method shortcoming is that performance model precision is low, modeling cost is high.Some method employments Artificial neural networks (Artificial Neural Network), SVMs (Support Vector Machine) modeling, But to reach degree of precision (within 10%), need to use very huge training set.

The content of the invention

Based on above-mentioned situation, it is necessary to there is provided a kind of Spark configuration parameter automatic optimization methods of data perception.

A kind of Spark configuration parameter automatic optimization methods of data perception, comprise the steps：

Collect data；The collection data are specifically included：Selected Spark application programs, further determine that above-mentioned application journey The parameter of Spark performances is affected in sequence, the span of above-mentioned parameter is determined；The random generation parameter in span, and it is raw Spark is configured into configuration file, application program is run and is collected data with postponing；The data are included but is not limited to：Spark is transported Row time, input data set, configuration parameter value；

Build performance model；The Spark run times collected, input data set, configuration parameter Value Data are constituted horizontal Amount, multiple vectorial composing training collection are modeled by random forests algorithm to above-mentioned training set；

Search allocation optimum parameter；Using the performance model for building, by Genetic algorithm searching allocation optimum parameter.

Further, a verification step, the verification step are also included after the search allocation optimum parameter step It is that the allocation optimum parameter that will be searched carries out configuration Spark, and runtime verification performs whether the time is most short.

Further, random generation parameter is described in collection data step：Assume parameter s span be [a, B], unified in the span, uniform, randomly value c, a≤c≤b, then (/t is one to produce a record " s/tc " Tab), in this manner, generate other configurations parameter.

Used as a kind of improvement, described being modeled to above-mentioned training set by random forests algorithm specifically includes following step Suddenly：

Random forests algorithm obtains multiple bootstrap from given training set by repeatedly random repeatable sampling Data set；

To one decision tree of each bootstrap dataset construction, construction is that data point is assigned into left and right by iteration Two sons concentrate what is realized, and this cutting procedure is the parameter space of a search segmentation function to seek maximum information increment meaning The process of the lower optimal parameter of justice；

Estimated by reaching the histogram experience of the tag along sort of this leaf node in statistics training set at each leaf node Count the class distribution on this leaf node；

Repetitive exercise process goes to always the maximal tree depth of user's setting or until obtaining by continuing segmentation Till taking bigger information gain；

In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to Determine ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are decision trees at each The number of sample predictions device at split vertexes.

It is described to be specially by Genetic algorithm searching allocation optimum parameter as further improvement：

One group of vector { c1 ..., cm } is set to initial configuration parameters value input performance model, model output execution time T1, then change initial value, is input into performance model, and model output execution time t2, t2 is compared with t1, corresponding to the time is shorter Configuration parameter repeats above step as allocation optimum, until finding the most short configuration of execution time.

Specifically, the random forests algorithm is as follows：

Input：Training set S, guidance function F, Integer n tree (bootstrap sample numbers)

1.for i=1to ntree

2.S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)

3.Ci=F (S ')

4.}

5.

Output：Polymerization C*.

The invention provides a kind of Spark configuration parameter automatic optimization methods of data perception, by the way that Spark is determined in advance The parameter of application program and impact Spark performances, random arrangement parameter obtains training set, and training set is calculated by random forest Method builds performance model, and by Genetic algorithm searching allocation optimum parameter is gone out.The present invention does not require that user understands that Spark runs machine In the case of system, parameter meating and use and span, and application program feature and input set, can find for user and operate in The allocation optimum parameter of application-specific under specified cluster environment, than former method for parameter configuration more simple and fast this The bright random forests algorithm for using combines the strong point of machine learning and statistical inference, can use less training set, reach compared with In high precision.

Description of the drawings

Fig. 1 is a kind of Spark configuration parameter automatic optimization method overall flow schematic diagrams of data perception of the invention；

Fig. 2 be a kind of data perception of the invention Spark configuration parameter automatic optimization methods in genetic algorithm schematic diagram.

Specific embodiment

In order that the objects, technical solutions and advantages of the present invention become apparent from, below in conjunction with drawings and Examples, to this It is bright to be further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, and without It is of the invention in limiting.

As shown in Figure 1-2, the Spark configuration parameter automatic optimization methods of a kind of data perception, including following three big step：

1) data are collected；The collection data include four little steps, as follows：

(1) parameter for affecting performance is found from all parameters of Spark；

(2) span of parameter is determined；

(3) input set of application program is selected；

(4) it is determined that span in change at random parameter, configure Spark, the application of the different input data sets of operation Program, the data collected are used as training set；

Data (Collecting) stage is being collected, above-mentioned four little step specifically can be expressed as：Selected experiment Spark application programs, the more commonly used is HiBench benchmarks, and HiBench contains figure calculating, machine learning, non- Iterative application program, therefrom chooses some representational programs, such as KMeans, Bayesian, PageRank, WordCount, TeraSort, further determining that affects the parameter of Spark performances in above-mentioned application program, determine the span of above-mentioned parameter； The random generation parameter in span, and configuration file configuration Spark is generated, each application program selects some input sets, Conf Generator are configuration parameter makers, and configuration file is produced using Conf Generator, and configuration file is included The random parameter for generating, runs application program and collects data with postponing；The data are included but is not limited to：When Spark runs Between, input data set, configuration parameter value.

Produce especially by following manner random generation parameter described in data step is collected：Assume parameter s value model It is [a, b] to enclose, unified in the span, uniform, randomly value c, a≤c≤b, then produce record " s/tc " (/t It is a tab), in this manner, generate other configurations parameter.

2) performance model is built；The Spark run times collected, input data set, configuration parameter Value Data are constituted horizontal Vector, multiple vectorial composing training collection, is modeled by random forests algorithm to above-mentioned training set.

Specifically, random forests algorithm is modeled to above-mentioned training set and specifically includes from given training set by multiple Random repeatable sampling obtains multiple bootstrap data sets；To one decision-making of each bootstrap dataset construction Tree, construction is to concentrate what is realized by of left and right two of assigning to data point of iteration, and this cutting procedure is a search point The process of optimal parameter under the parameter space of function is cut to seek maximum information increment meaning；By statistics at each leaf node The class distribution estimated on this leaf node of the histogram experience of the tag along sort of this leaf node is reached in training set；Repetitive exercise mistake Cheng Yizhi go to the maximal tree depth that user sets or until can not pass through to continue to split obtain bigger information gain as Only；In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to it is determined that Ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are that decision tree is divided at each The number of sample predictions device at node.

Wherein, the random forests algorithm is specific as follows shown：

1.for i=1to ntree

3.Ci=F (S ')

4.}

5.

Output：Polymerization C*.

Integrated Algorithm in present invention machine learning --- random forest is modeled；Machine learning is compared to traditional statistics Learning method, with organizing and fitting parameter, can process the advantage of bigger data set；Random forest is compared to other Machine algorithm, can solve the problems, such as over-fitting, more (higher-dimension) situation of processing feature etc. and other effects.

3) allocation optimum parameter is searched for；Using the performance model for building, by Genetic algorithm searching allocation optimum parameter. Specific way is that one group of vector { c1 ..., cm } is set to initial configuration parameters value input performance model, during model output execution Between t1, then change initial value, be input into performance model, model output execution time t2, t2 is compared with t1, corresponding to the time is shorter Configuration parameter as allocation optimum, above step is repeated, until finding the most short configuration of execution time.

Compared to other optimized algorithms, the such as method of exhaustion, greedy method, simulated annealing, ant group algorithm have genetic algorithm Good ability of searching optimum, rapidly can search out all solutions in solution space, without being absorbed in locally optimal solution Rapid decrease trap；Search, with potential concurrency, can carry out the comparison of multiple individualities from colony；Search procedure Simply, in-service evaluation function is inspired；It is iterated using probability mechanism, there is randomness.

Finally, as a kind of preferred embodiment, after the search allocation optimum parameter step verification step is also included, The verification step is that the allocation optimum parameter that will be searched carries out configuration Spark, and runtime verification performs whether the time is most It is short.

The invention provides a kind of Spark configuration parameter automatic optimization methods of data perception, by the way that Spark is determined in advance The parameter of application program and impact Spark performances, random arrangement parameter obtains training set, and training set is calculated by random forest Method builds performance model, and by Genetic algorithm searching allocation optimum parameter is gone out.The present invention does not require that user understands that Spark runs machine In the case of system, parameter meating and use and span, and application program feature and input set, can find for user and operate in The allocation optimum parameter of application-specific under specified cluster environment, than former method for parameter configuration more simple and fast this The bright random forests algorithm for using combines the strong point of machine learning and statistical inference, can use less training set, reach compared with In high precision.The present invention can find allocation optimum parameter for any input data set, because in practical situations both user is running During application program, input set is arbitrarily change, it is contemplated that practical situations.

The above is only the preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art Member, under the premise without departing from the principles of the invention, can also make some improvements and modifications, and these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims

1. Spark configuration parameter automatic optimization methods of a kind of data perception, it is characterised in that comprise the steps：

Collect data；The collection data are specifically included：Selected Spark application programs, in further determining that above-mentioned application program The parameter of Spark performances is affected, the span of above-mentioned parameter is determined；The random generation parameter in span, and generate match somebody with somebody File configuration Spark is put, application program is run and is collected data with postponing；The data are included but is not limited to：When Spark runs Between, input data set, configuration parameter value；

Build performance model；The Spark run times collected, input data set, configuration parameter Value Data are constituted into transversal vector, it is many Individual vectorial composing training collection, is modeled by random forests algorithm to above-mentioned training set；

2. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 1, it is characterised in that described Also include a verification step after search allocation optimum parameter step, the verification step is the allocation optimum parameter that will be searched Configuration Spark is carried out, and runtime verification performs whether the time is most short.

3. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 2, it is characterised in that collecting Random generation parameter step is described in data：Assume that parameter s span is [a, b], it is unified in the span, equal Even, randomly value c, a≤c≤b then produces a record " s/tc " (/t is a tab), in this manner, generates Other configurations parameter.

4. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 3, it is characterised in that described logical Cross random forests algorithm above-mentioned training set is modeled and specifically include following steps：

Random forests algorithm obtains multiple bootstrap data from given training set by repeatedly random repeatable sampling Collection；

To one decision tree of each bootstrap dataset construction, construction is that data point is assigned into left and right two by iteration Realize in subset, cutting procedure be a search segmentation function parameter space to seek maximum information increment meaning under it is optimal The process of parameter；

At each leaf node by reach in statistics training set the tag along sort of this leaf node histogram experience estimation this Class distribution on leaf node；Repetitive exercise process go to always user setting maximal tree depth or until can not by after Till continuous segmentation obtains bigger information gain；

In random forests algorithm, the time is performed as dependent variable, input set and configuration parameter are used as independent variable, in addition it is also necessary to it is determined that Ntree and mtry values, ntree values are the decision tree quantity set up in random forest, and mtry values are that decision tree is divided at each The number of sample predictions device at node.

5. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 4, it is characterised in that described logical Cross Genetic algorithm searching allocation optimum parameter to be specially：

One group vectorial { c1 ..., cm } is set to initial configuration parameters value input performance model, model exports execution time t1, then Change initial value, is input into performance model, and model output execution time t2, t2 is compared with t1, time shorter corresponding configuration Parameter repeats above step as allocation optimum, until finding the most short configuration of execution time.

6. Spark configuration parameter automatic optimization methods of data perception as claimed in claim 4, it is characterised in that it is described with Shown in machine forest algorithm is specific as follows：

For i=1 to ntree

S '=bootstrap samples are extracted from S (independent same distribution sample puts back to extraction)

Ci=F (S ')

}

C * (x) = \underset{y &Element; Y}{\arg} Σ_{i = 1}^{n t r e e} C i (x) / n t r e e

Output：Polymerization C*.