CN108491226A

CN108491226A - Spark based on cluster scaling configures parameter automated tuning method

Info

Publication number: CN108491226A
Application number: CN201810110273.XA
Authority: CN
Inventors: 鲍亮; 陈炜昭; 卜晓璇
Original assignee: Xidian University
Current assignee: Hegang Digital Technology Co ltd
Priority date: 2018-02-05
Filing date: 2018-02-05
Publication date: 2018-09-04
Anticipated expiration: 2038-02-05
Also published as: CN108491226B

Abstract

A kind of Spark based on cluster scaling disclosed by the invention configures parameter automated tuning method, and step is：(1) cluster is built；(2) option and installment parameter sets；(3) configuration parameter value type and range are determined；(4) cluster is scaled；(5) training Random Forest model；(6) best configuration is screened；(7) allocative effect is verified.Present invention could apply in mass data processing technical field, by scaling distributed memory Computational frame Spark memory configurations parameter value ranges and pending data amount, shorten the time that evaluation each configures, the relationship between configuration and distributed memory Computational frame Spark clustering performance influence powers is established by Random Forest model, searches for the best configuration of the distributed memory Computational frame Spark clustering performances for more hardware configuration same computers composition of sening as an envoy to.

Description

Spark based on cluster scaling configures parameter automated tuning method

Technical field

The invention belongs to field of computer technology, the one kind further related in mass data processing technical field is based on The Spark of cluster scaling configures parameter automated tuning method.The present invention can be by scaling distributed memory Computational frame Spark collection Group and training Random Forest model, obtain the configuration better than distributed memory Computational frame Spark clustering performances under default configuration.

Background technology

Distributed memory Computational frame Spark is the big data parallel computation frame calculated based on memory.Distributed memory Computational frame Spark is calculated based on memory, improves the real-time of the data processing under big data environment, while ensure that Gao Rong Mistake and high scalability allow user that distributed memory Computational frame Spark is deployed on a large amount of inexpensive hardware, shape At cluster.It is put down currently, distributed memory Computational frame Spark has been developed as the big data calculating comprising numerous sub-projects Platform, distributed memory Computational frame Spark is used by many giants, including Amazon, eBay and Yahoo！.Many groups It knits and runs distributed memory Computational frame Spark all on the cluster for possessing thousands of nodes.Configuring parameter optimization is always One of the research hotspot of distributed memory Computational frame Spark, since configuration parameter is numerous (being more than 100), performance is configured Parameter influence is very big, is far from reaching optimum performance using default configuration.Therefore, for distributed memory Computational frame Spark Configuration parameter automatic optimization be a urgent problem to be solved.

Its application where Shenzhen Xianjin Technology Academe patent document " a kind of Spark configuration parameters of data perception from Dynamic optimization method " (application number：201611182310.5 the date of application：2016.12.20 publication number：CN106648654A public in) A kind of Spark configuration parameter automatic optimization methods of data perception are opened.This method is by selected Spark application programs, into one Step determines the parameter that Spark performances are influenced in above application program, determines the value range of above-mentioned parameter；In value range with Machine generates parameter, and generates configuration file configuration Spark, and application program and data are collected with operation is postponed；By the Spark of collection Run time, input data set, configuration parameter value data constitute transversal vector, and multiple vector composing training collection pass through random forest Algorithm models above-mentioned training set；Using the performance model built, pass through Genetic algorithm searching allocation optimum parameter.It should Shortcoming existing for method is to need to evaluate each configuration on actual environment to distributed memory Computational frame Spark collection Group's performance influence power wastes plenty of time cost as the training set of Random Forest model.

Patent document " a kind of Spark platform properties automatic optimization method " (Shen of its application where university of the Chinese Academy of Sciences Please number：201610068611.9 the date of application：2016.02.01 publication number：CN105868019A it is flat that a kind of Spark is disclosed in) Platform performance automatic optimization method, this method create a Spark application performance models by the execution mechanism of Spark platforms, for The Spark applications of one setting, the partial data for choosing Spark applications are supported on the Spark platforms and run, and acquire Spark Performance data when application operation；The performance data of acquisition is inputted into Spark application performance models, determines that running the Spark answers The value of each parameter in used time Spark application performance model；Calculate performance of the Spark platforms in different configuration parameter combinations (application total execute time), obtain Spark platform properties it is optimal when configuration parameter combination.Shortcoming existing for this method It is that the establishments of distributed memory Computational frame Spark application performance models is it is understood that distributed memory Computational frame Spark Execution mechanism, model creation process is complicated, and difficulty is high.

Invention content

The purpose of the present invention is be directed to prior art distributed memory Computational frame Spark to configure parameter automatic optimization method The disadvantage of time cost height and model creation process complexity proposes that a kind of Spark configurations parameter based on cluster scaling is adjusted automatically Excellent method.

Realizing the thinking of the object of the invention is, distributed memory Computational frame Spark memories are scaled by cluster scaling Parameter value range and input data amount are configured, shortens each configuration of evaluation to distributed memory Computational frame Spark sociabilities The time of energy influence power, it can spend less time and obtain sufficient training set, train more accurate Random Forest model. Using Random Forest model and screening best configuration method, the distribution for more hardware configuration same computers composition of sening as an envoy to is searched for The best configuration of memory Computational frame Spark clustering performances.

The specific steps of the present invention include as follows：

(1) cluster is built：

Build the collection being made of the more identical computers of hardware configuration equipped with distributed memory Computational frame Spark Group；

(2) option and installment parameter sets：

From all configuration parameters to be modified of distributed memory Computational frame Spark clusters, select in optimisation criteria The configuration parameter for recommending modification forms configuration parameter set to be optimized and closes；

(3) configuration parameter value type and range are determined：

According to parameter declaration standard, configuration parameter set to be optimized in distributed memory Computational frame Spark clusters is set The value type and range of each parameter in conjunction, the extraction acquiescence value from the value range of each parameter, all acquiescences are taken Value composition default configuration；

(4) cluster is scaled：

Strategy is scaled using distributed memory Computational frame Spark clusters, is scaled in configuration parameter set conjunction to be optimized The value range and pending data of memory configurations parameter；

(5) training Random Forest model：

(5a) records the initial time of search process；

Configuration parameter set to be optimized is combined into hyperspace as search space by (5b), using uniform sampling strategy, Search space is sampled, the equally distributed configuration parameter set in search space is obtained and closes, configure and join as initial ranging Manifold is closed；

(5c) using configuration Evaluation Strategy, all configurations in evaluation initial ranging configuration parameter set conjunction are obtained by distribution The training set that formula memory Computational frame Spark clustering performance influence powers sort from big to small；

(5d) in training set before obtainingA configuration forms iterative search configuration parameter set and closes, and m indicates what user specified The configuration sum searched in each iterative search procedures；

Training set is input to training pattern in Random Forest model by (5e)；

(6) best configuration is screened：

(6a) utilizes uniform sampling strategy, generates configuration parameter set and closes, is taken out at random from the parameter setsA configuration, Each configuration is evaluated using configuration Evaluation Strategy, if the configuration influences distributed memory Computational frame Spark clustering performances Power is more than first configuration in training set, creates one and is in an ordered configuration parameter sets, which is put by distributed memory Computational frame Spark clustering performance influence power descending sorts are in an ordered configuration in parameter sets, will each be configured evaluation result and be added Enter into training set；

Each actual disposition during (6b) closes iterative search configuration parameter set reduces search according to range approximation Strategy Space generates configuration parameter set and closes using uniform sampling strategy；By configuration parameter set each of close configuration be input to it is random gloomy In woods model, predicted configuration obtains performance in prediction result to the performance influence power of distributed memory Computational frame Spark clusters The maximum predicted configuration of influence power；

(6c) obtains the property to distributed memory Computational frame Spark clusters of predicted configuration using configuration Evaluation Strategy Can influence power, by predicted configuration and the configuration to the performance influence power composition sequence of distributed memory Computational frame Spark clusters, It is added to training set, actual disposition is replaced according to two kinds of situations in configuration replacement policy；If actual disposition is not replaced, Next time, search did not used range approximation Strategy to the actual disposition；

(6d) subtracts search process initial time with the time completed when configuration is replaced, and obtains the time of search process；

(6e) judges whether the time of search process is less than the search time that user specifies, if so, (6a) is thened follow the steps, Otherwise, step (6f) is executed；

(6f) is extracted in training set to the maximum configuration conduct of distributed memory Computational frame Spark clustering performance influence powers Best configuration；

(7) allocative effect is verified：

(7a) restores strategy using distributed memory Computational frame Spark clusters, and the memory configurations after reduction reduction take Value and pending data obtain configuration to be verified and practical pending data；

(7b) evaluates configuration to be verified and default configuration to distributed memory Computational frame respectively using configuration Evaluation Strategy The performance influence power of Spark clusters will be greater than performance influence power of the default configuration to distributed memory Computational frame Spark clusters Configuration to be verified, the configuration parameter of the distributed memory Computational frame Spark as automated tuning.

The present invention has the advantage that compared with prior art：

First, since the present invention scales strategy using distributed memory Computational frame Spark clusters, scale to be optimized match The value range and pending data for setting the memory configurations parameter in parameter sets shorten each configuration of evaluation in distribution The time of Computational frame Spark clustering performance influence powers is deposited, and then overcomes the prior art and needs to evaluate often on actual environment A configuration is to distributed memory Computational frame Spark clustering performance influence powers, and as the training set of Random Forest model, waste is big The problem of measuring time cost so that The present invention reduces the time costs for obtaining model training collection.

Second, the present invention is by the way that training set to be input in Random Forest model in training pattern, by Random Forest model The execution mechanism of direct phantom frame Spark, overcomes prior art distributed memory Computational frame Spark application performance models Establishment it is understood that distributed memory Computational frame Spark execution mechanism, model creation process is complicated, and difficulty is high to ask Topic so that present invention reduces the thresholds that user optimizes distributed memory Computational frame Spark clusters.

Description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is the emulation experiment figure of the present invention.

Specific implementation mode

The present invention is described further below in conjunction with the accompanying drawings.

With reference to attached drawing 1, the specific steps of the present invention are described further.

Step 1, cluster is built.

Build the collection being made of the more identical computers of hardware configuration equipped with distributed memory Computational frame Spark Group.

Step 2, option and installment parameter sets.

From all configuration parameters to be modified of distributed memory Computational frame Spark clusters, select in optimisation criteria The configuration parameter for recommending modification forms configuration parameter set to be optimized and closes.

In the optimization page in distributed memory Computational frame Spark official documents, optimisation criteria is described in detail should The configuration parameter being optimized.

Step 3, configuration parameter value type and range are determined.

According to parameter declaration standard, configuration parameter set to be optimized in distributed memory Computational frame Spark clusters is set The value type and range of each parameter in conjunction, the extraction acquiescence value from the value range of each parameter, all acquiescences are taken Value composition default configuration.

In the configuration page in distributed memory Computational frame Spark official documents, parameter declaration standard is described in detail The effect that each configuration parameter set is closed, default value and value range.

Step 4, cluster is scaled.

Strategy is scaled using distributed memory Computational frame Spark clusters, is scaled in configuration parameter set conjunction to be optimized The value range and pending data of memory configurations parameter.

The step of distributed memory Computational frame Spark clusters scaling strategy is as follows：

1st step calculates distributed memory Computational frame Spark cluster scalings according to the following formula：

Wherein, R indicates distributed memory Computational frame Spark cluster scalings,Indicate downward floor operation, log₂ Indicate that the log operations bottom of for, M indicate the memory size of every computer, unit million with 2.

2nd step calculates the value range of the memory configurations parameter after scaling according to the following formula：

Wherein, m indicates that the memory configurations parameter after scaling, ∈ expressions belong to symbol.

3rd step calculates the pending data after scaling according to the following formula：

Wherein, d indicates that the pending data after scaling, D indicate the pending data before scaling.

Step 5, training Random Forest model.

Record the initial time of search process.

Configuration parameter set to be optimized is combined into hyperspace as search space, using uniform sampling strategy, to searching Rope space is sampled, and is obtained the equally distributed configuration parameter set in search space and is closed, as initial ranging configuration parameter set It closes.

The step of uniform sampling strategy is as follows：

1st step obtains the section of k same size by each dimension in search space according to k deciles, wherein k is The sum that configuration parameter set to be searched is closed in the initial ranging that user specifies.

2nd step randomly selects a floating number in each section.

The floating number chosen in all sections is formed a k and ties up sequence by the 3rd step, upsets floating number in k dimension sequences at random Sequence, obtain out of order k dimensions sequence.

The floating number of each same position in k dimension sequences out of order in all dimensions is formed a sequence, often by the 4th step A sequence is configured as one, is obtained k configuration parameter set and is closed.

Using configuration Evaluation Strategy, all configurations in evaluation initial ranging configuration parameter set conjunction are obtained by distribution Deposit the training set that Computational frame Spark clustering performance influence powers sort from big to small.

Configuration Evaluation Strategy refers to, with configuration to be evaluated, running distributed memory Computational frame Spark clusters, using use The specified analytical pending data in family, the time required to recording and analyzing data, by the inverse of the time as distributed Memory Computational frame Spark clustering performance influence powers, by configuration with the configuration to distributed memory Computational frame Spark sociabilities Can influence power form a sequence, wherein the analysis method that the user specifies refers to, user is from statistical analysis, machine learning, Any one selected data processing method in web search field.

Before being obtained in training setA configuration forms iterative search configuration parameter set and closes, and it is each that m indicates that user specifies The configuration sum searched in iterative search procedures.

Training set is input to training pattern in Random Forest model.

Step 6, best configuration is screened.

Using uniform sampling strategy, generates configuration parameter set and close, taken out at random from the parameter setsA configuration utilizes The each configuration of Evaluation Strategy evaluation is configured, if the configuration is big to distributed memory Computational frame Spark clustering performance influence powers First configuration in training set, creates one and is in an ordered configuration parameter sets, which is put into and is calculated by distributed memory Frame Spark clustering performance influence power descending sorts are in an ordered configuration in parameter sets, and each configuration evaluation result is added to In training set.

The step of uniform sampling strategy is as follows：

2nd step randomly selects a floating number in each section.

Each actual disposition in being closed to iterative search configuration parameter set reduces search space according to range approximation Strategy, Using uniform sampling strategy, generates configuration parameter set and close；Configuration parameter set each of is closed into configuration and is input to random forest mould In type, to the performance influence power of distributed memory Computational frame Spark clusters, obtain performance in prediction result influences predicted configuration The maximum predicted configuration of power.

The step of uniform sampling strategy is as follows：

2nd step randomly selects a floating number in each section.

The step of range approximation Strategy, is as follows：

1st step, in all configurations in the training set of search space in each dimension, from more than pending configuration value In other configurations, extract with pending configuration value apart from shortest other configurations value as coboundary, from less than pending In the other configurations for configuring value, extraction is with pending configuration value apart from shortest other configurations value as lower boundary.

2nd step, using the up-and-down boundary of each dimension as the value range of the dimension, by the value range group of all dimensions At the search space after reduction.

Using configuration Evaluation Strategy, the performance shadow to distributed memory Computational frame Spark clusters of predicted configuration is obtained Power is rung, by predicted configuration and the configuration to the performance influence power composition sequence of distributed memory Computational frame Spark clusters, is added To training set, actual disposition is replaced according to two kinds of situations in configuration replacement policy；If actual disposition is not replaced, next time Search does not use range approximation Strategy to the actual disposition.

Two kinds of situations in configuration replacement policy replace actual disposition：

A. it is more than the situation of actual disposition for predicted configuration performance influence power, actual disposition is replaced with predicted configuration.

B. it is not empty situation for being in an ordered configuration parameter sets, first configuration is extracted from being in an ordered configuration in parameter sets Replace actual disposition.

The step of range approximation Strategy, is as follows：

Search process initial time is subtracted with the time completed when configuration is replaced, obtains the time of search process.

Judge whether the time of search process is less than the search time that user specifies, if so, step 6 is re-executed, it is no Then, it extracts configuration maximum to distributed memory Computational frame Spark clustering performance influence powers in training set and is used as best configuration.

Step 7, allocative effect is verified.

Using distributed memory Computational frame Spark clusters restore strategy, reduction reduction after memory configurations value and Pending data obtains configuration to be verified and practical pending data.

The step of distributed memory Computational frame Spark clusters reduction strategy is as follows：

1st step calculates the memory configurations after reduction according to the following formula：

C=(m-300) × R+300

Wherein, C indicates the memory configurations after reduction.

2nd step calculates the pending data after reduction according to the following formula：

D=d × R

Wherein, D indicates the pending data before scaling.

Using configuration Evaluation Strategy, configuration to be verified and default configuration are evaluated respectively to distributed memory Computational frame The performance influence power of Spark clusters will be greater than performance influence power of the default configuration to distributed memory Computational frame Spark clusters Configuration to be verified, the configuration parameter of the distributed memory Computational frame Spark as automated tuning.

Further verification explanation is made to the effect of the present invention with reference to emulation experiment.

1. simulated conditions：

The emulation experiment environment of the present invention is to select 6 hardware configurations on Ali's cloud just the same equipped with distributed memory The computer of Computational frame Spark builds distributed memory Computational frame Spark clusters.Every computer in emulation experiment Specifications parameter is as shown in table 1.

1 COMPUTER PARAMETER specification list of table

Operating system	CentOS 6.8
		Processor check figure	4
Memory	32GB
		Hard disk	250GB

2. emulation content：

With user's inputs different three times, the distributed memory Computational frame Spark configuration ginsengs scaled based on cluster are used Number automated tuning method carries out emulation experiment, verification distributed memory Computational frame Spark sociabilities under the configuration searched out The performance of energy is better than default configuration, each emulation experiment serial number, the pending data that each user specifies, analysis method, search The configuration sum m searched in time, configuration parameter set to be searched is closed in initial ranging total k and each iterative search procedures As shown in table 2.

2 simulation parameter list of table

Serial number	Pending data	Analysis method	Search time	k	m
						1	506.9M	PageRank (web search)	485 minutes	317	20
2	7.5G	LogisticRegression (machine learning)	360 minutes	163	20
						3	76.5G	WordCount (statistical analysis)	320 minutes	211	20

3. analysis of simulation result：

With reference to attached drawing 2, the simulation result of the present invention is described further.Each user that abscissa in Fig. 2 represents The serial number of input, ordinate indicate that the time of distributed memory Computational frame Spark cluster analysis pending datas, unit are Second.Oblique line cylindricality in Fig. 2 represents default configuration, the configuration of entity cylindricality representing optimized.Fig. 2 has recorded in user's input three times, Distributed memory Computational frame Spark clusters are being distributed rationally under default configuration, are completed using the analysis method that user specifies Analyze the time of pending data.In Fig. 2, the entity cylindricality under same sequence number is below oblique line cylindricality, it can be seen that three times Under the distributing rationally that user inputs, the time of distributed memory Computational frame Spark cluster analysis pending datas is all small In default configuration, show in the case where distributing rationally, distributed memory Computational frame Spark clustering performances are better than default configuration, verification Spark based on cluster scaling configures the validity of parameter automated tuning method.

In conclusion a kind of Spark based on cluster scaling disclosed by the invention configures parameter automated tuning method, solve Prior art distributed computing framework Spark configuration parameter automatic optimization method time cost height and model creation process are complicated The problem of.Specific steps include：(1) cluster is built；(2) option and installment parameter sets；(3) determine configuration parameter value type and Range；(4) cluster is scaled；(5) training Random Forest model；(6) best configuration is screened；(7) allocative effect is verified.The present invention's Distributed memory Computational frame Spark colonization process is scaled, training Random Forest model and screening best configuration are this experiment Innovative point reduces the time cost for obtaining training set by scaling distributed memory Computational frame Spark clusters；Pass through instruction Practice Random Forest model and screening best configuration set, solves the problems, such as model creation process complexity, obtained better than acquiescence The lower distributed memory Computational frame Spark clustering performances of configuration are distributed rationally.Present invention could apply to mass data processing In technical field, distributed memory Computational frame Spark memory configurations parameter value ranges are scaled by pressing cluster scaling With input data amount, the distributed memory Computational frame Spark clusters for more hardware configuration same computers composition of sening as an envoy to are searched for The best configuration parameter of performance.

Claims

1. a kind of distributed memory Computational frame Spark based on cluster scaling configures parameter automated tuning method, feature exists In, distributed memory Computational frame Spark memory configurations parameter value ranges and input data amount are scaled by cluster scaling, The send as an envoy to distributed memory Computational frame Spark clustering performances of more hardware configuration same computers composition of search best are matched It sets, specific steps include as follows：

(1) cluster is built：

Build the cluster being made of the more identical computers of hardware configuration equipped with distributed memory Computational frame Spark；

(2) option and installment parameter sets：

From all configuration parameters to be modified of distributed memory Computational frame Spark clusters, select to recommend in optimisation criteria The configuration parameter of modification forms configuration parameter set to be optimized and closes；

(3) configuration parameter value type and range are determined：

According to parameter declaration standard, it is arranged in configuration parameter set conjunction to be optimized in distributed memory Computational frame Spark clusters The value type and range of each parameter, the extraction acquiescence value from the value range of each parameter, by all acquiescence value groups At default configuration；

(4) cluster is scaled：

Strategy is scaled using distributed memory Computational frame Spark clusters, scales the memory during configuration parameter set to be optimized is closed Configure the value range and pending data of parameter；

(5) training Random Forest model：

(5a) records the initial time of search process；

Configuration parameter set to be optimized is combined into hyperspace as search space, using uniform sampling strategy, to searching by (5b) Rope space is sampled, and is obtained the equally distributed configuration parameter set in search space and is closed, as initial ranging configuration parameter set It closes；

(5c) using configuration Evaluation Strategy, all configurations in evaluation initial ranging configuration parameter set conjunction are obtained by distribution Deposit the training set that Computational frame Spark clustering performance influence powers sort from big to small；

(5d) in training set before obtainingA configuration forms iterative search configuration parameter set and closes, and it is each that m indicates that user specifies The configuration sum searched in iterative search procedures；

Training set is input to training pattern in Random Forest model by (5e)；

(6) best configuration is screened：

(6a) utilizes uniform sampling strategy, generates configuration parameter set and closes, is taken out at random from the parameter setsA configuration utilizes The each configuration of Evaluation Strategy evaluation is configured, if the configuration is big to distributed memory Computational frame Spark clustering performance influence powers First configuration in training set, creates one and is in an ordered configuration parameter sets, which is put into and is calculated by distributed memory Frame Spark clustering performance influence power descending sorts are in an ordered configuration in parameter sets, and each configuration evaluation result is added to In training set；

Each actual disposition during (6b) closes iterative search configuration parameter set reduces search space according to range approximation Strategy, Using uniform sampling strategy, generates configuration parameter set and close；Configuration parameter set each of is closed into configuration and is input to random forest mould In type, to the performance influence power of distributed memory Computational frame Spark clusters, obtain performance in prediction result influences predicted configuration The maximum predicted configuration of power；

(6c) obtains the performance shadow to distributed memory Computational frame Spark clusters of predicted configuration using configuration Evaluation Strategy Power is rung, by predicted configuration and the configuration to the performance influence power composition sequence of distributed memory Computational frame Spark clusters, is added To training set, actual disposition is replaced according to two kinds of situations in configuration replacement policy；If actual disposition is not replaced, next time Search does not use range approximation Strategy to the actual disposition；

(6e) judges whether the time of search process is less than the search time that user specifies, if so, (6a) is thened follow the steps, it is no Then, step (6f) is executed；

(6f) is extracted in training set to the maximum configuration of distributed memory Computational frame Spark clustering performance influence powers as best Configuration；

(7) allocative effect is verified：

(7a) using distributed memory Computational frame Spark clusters restore strategy, reduction reduction after memory configurations value and Pending data obtains configuration to be verified and practical pending data；

2. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that：The step of distributed memory Computational frame Spark clusters scaling strategy described in step (4) is such as Under：

The first step calculates distributed memory Computational frame Spark cluster scalings according to the following formula：

Wherein, R indicates distributed memory Computational frame Spark cluster scalings,Indicate downward floor operation, log₂It indicates With 2 for bottom log operations, M indicates the memory size of every computer, unit million；

Second step calculates the value range of the memory configurations parameter after scaling according to the following formula：

Wherein, m indicates that the memory configurations parameter after scaling, ∈ expressions belong to symbol；

Third walks, and according to the following formula, calculates the pending data after scaling：

3. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that：Step (5b), step (6a), the step of uniform sampling strategy described in step (6b) are as follows：

The first step obtains the section of k same size by each dimension in search space according to k deciles, wherein k is user The sum that configuration parameter set to be searched is closed in specified initial ranging；

Second step randomly selects a floating number in each section；

Third walks, and the floating number chosen in all sections, which is formed a k, ties up sequence, upsets floating number in k dimension sequences at random Sequentially, out of order k dimension sequences are obtained；

The floating number of each same position in k dimension sequences out of order in all dimensions is formed a sequence, Mei Gexu by the 4th step Row obtain k configuration parameter set and close as a configuration.

4. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that：Step (5c), step (6a), the configuration Evaluation Strategy described in step (6c) refer to, with to be evaluated Configuration runs distributed memory Computational frame Spark clusters, the analytical pending data specified using user, note It, will using the inverse of the time as distributed memory Computational frame Spark clustering performance influence powers the time required to record analysis data Configuration forms a sequence with the configuration to distributed memory Computational frame Spark clustering performance influence powers, wherein the user Specified analysis method refers to, user is from statistical analysis, machine learning, in web search field at any one selected data Reason method.

5. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that：The step of range approximation Strategy described in step (6b), step (6c), is as follows：

The first step, in all configurations in the training set of search space in each dimension, from its more than pending configuration value During it is configured, extracts with pending configuration value apart from shortest other configurations value as coboundary, match from less than pending It sets in the other configurations of value, extraction is with pending configuration value apart from shortest other configurations value as lower boundary；

Second step forms the value range of all dimensions using the up-and-down boundary of each dimension as the value range of the dimension Search space after reduction.

6. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that：Replacing actual disposition according to two kinds of situations in configuration replacement policy described in step (6c) refers to：

A. it is more than the situation of actual disposition for predicted configuration performance influence power, actual disposition is replaced with predicted configuration；

B. it is not empty situation for being in an ordered configuration parameter sets, first configuration replacement is extracted from being in an ordered configuration in parameter sets Actual disposition.

7. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that：The step of distributed memory Computational frame Spark clusters reduction strategy described in step (7a) is such as Under：

The first step calculates the memory configurations after reduction according to the following formula：

C=(m-300) × R+300

Wherein, C indicates the memory configurations after reduction；

Second step calculates the pending data after reduction according to the following formula：

D=d × R

Wherein, D indicates the pending data before scaling.