CN108491226A - Spark based on cluster scaling configures parameter automated tuning method - Google Patents

Spark based on cluster scaling configures parameter automated tuning method Download PDF

Info

Publication number
CN108491226A
CN108491226A CN201810110273.XA CN201810110273A CN108491226A CN 108491226 A CN108491226 A CN 108491226A CN 201810110273 A CN201810110273 A CN 201810110273A CN 108491226 A CN108491226 A CN 108491226A
Authority
CN
China
Prior art keywords
configuration
distributed memory
computational frame
parameter
spark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810110273.XA
Other languages
Chinese (zh)
Other versions
CN108491226B (en
Inventor
鲍亮
陈炜昭
卜晓璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hegang Digital Technology Co ltd
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201810110273.XA priority Critical patent/CN108491226B/en
Publication of CN108491226A publication Critical patent/CN108491226A/en
Application granted granted Critical
Publication of CN108491226B publication Critical patent/CN108491226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

A kind of Spark based on cluster scaling disclosed by the invention configures parameter automated tuning method, and step is:(1) cluster is built;(2) option and installment parameter sets;(3) configuration parameter value type and range are determined;(4) cluster is scaled;(5) training Random Forest model;(6) best configuration is screened;(7) allocative effect is verified.Present invention could apply in mass data processing technical field, by scaling distributed memory Computational frame Spark memory configurations parameter value ranges and pending data amount, shorten the time that evaluation each configures, the relationship between configuration and distributed memory Computational frame Spark clustering performance influence powers is established by Random Forest model, searches for the best configuration of the distributed memory Computational frame Spark clustering performances for more hardware configuration same computers composition of sening as an envoy to.

Description

Spark based on cluster scaling configures parameter automated tuning method
Technical field
The invention belongs to field of computer technology, the one kind further related in mass data processing technical field is based on The Spark of cluster scaling configures parameter automated tuning method.The present invention can be by scaling distributed memory Computational frame Spark collection Group and training Random Forest model, obtain the configuration better than distributed memory Computational frame Spark clustering performances under default configuration.
Background technology
Distributed memory Computational frame Spark is the big data parallel computation frame calculated based on memory.Distributed memory Computational frame Spark is calculated based on memory, improves the real-time of the data processing under big data environment, while ensure that Gao Rong Mistake and high scalability allow user that distributed memory Computational frame Spark is deployed on a large amount of inexpensive hardware, shape At cluster.It is put down currently, distributed memory Computational frame Spark has been developed as the big data calculating comprising numerous sub-projects Platform, distributed memory Computational frame Spark is used by many giants, including Amazon, eBay and Yahoo!.Many groups It knits and runs distributed memory Computational frame Spark all on the cluster for possessing thousands of nodes.Configuring parameter optimization is always One of the research hotspot of distributed memory Computational frame Spark, since configuration parameter is numerous (being more than 100), performance is configured Parameter influence is very big, is far from reaching optimum performance using default configuration.Therefore, for distributed memory Computational frame Spark Configuration parameter automatic optimization be a urgent problem to be solved.
Its application where Shenzhen Xianjin Technology Academe patent document " a kind of Spark configuration parameters of data perception from Dynamic optimization method " (application number:201611182310.5 the date of application:2016.12.20 publication number:CN106648654A public in) A kind of Spark configuration parameter automatic optimization methods of data perception are opened.This method is by selected Spark application programs, into one Step determines the parameter that Spark performances are influenced in above application program, determines the value range of above-mentioned parameter;In value range with Machine generates parameter, and generates configuration file configuration Spark, and application program and data are collected with operation is postponed;By the Spark of collection Run time, input data set, configuration parameter value data constitute transversal vector, and multiple vector composing training collection pass through random forest Algorithm models above-mentioned training set;Using the performance model built, pass through Genetic algorithm searching allocation optimum parameter.It should Shortcoming existing for method is to need to evaluate each configuration on actual environment to distributed memory Computational frame Spark collection Group's performance influence power wastes plenty of time cost as the training set of Random Forest model.
Patent document " a kind of Spark platform properties automatic optimization method " (Shen of its application where university of the Chinese Academy of Sciences Please number:201610068611.9 the date of application:2016.02.01 publication number:CN105868019A it is flat that a kind of Spark is disclosed in) Platform performance automatic optimization method, this method create a Spark application performance models by the execution mechanism of Spark platforms, for The Spark applications of one setting, the partial data for choosing Spark applications are supported on the Spark platforms and run, and acquire Spark Performance data when application operation;The performance data of acquisition is inputted into Spark application performance models, determines that running the Spark answers The value of each parameter in used time Spark application performance model;Calculate performance of the Spark platforms in different configuration parameter combinations (application total execute time), obtain Spark platform properties it is optimal when configuration parameter combination.Shortcoming existing for this method It is that the establishments of distributed memory Computational frame Spark application performance models is it is understood that distributed memory Computational frame Spark Execution mechanism, model creation process is complicated, and difficulty is high.
Invention content
The purpose of the present invention is be directed to prior art distributed memory Computational frame Spark to configure parameter automatic optimization method The disadvantage of time cost height and model creation process complexity proposes that a kind of Spark configurations parameter based on cluster scaling is adjusted automatically Excellent method.
Realizing the thinking of the object of the invention is, distributed memory Computational frame Spark memories are scaled by cluster scaling Parameter value range and input data amount are configured, shortens each configuration of evaluation to distributed memory Computational frame Spark sociabilities The time of energy influence power, it can spend less time and obtain sufficient training set, train more accurate Random Forest model. Using Random Forest model and screening best configuration method, the distribution for more hardware configuration same computers composition of sening as an envoy to is searched for The best configuration of memory Computational frame Spark clustering performances.
The specific steps of the present invention include as follows:
(1) cluster is built:
Build the collection being made of the more identical computers of hardware configuration equipped with distributed memory Computational frame Spark Group;
(2) option and installment parameter sets:
From all configuration parameters to be modified of distributed memory Computational frame Spark clusters, select in optimisation criteria The configuration parameter for recommending modification forms configuration parameter set to be optimized and closes;
(3) configuration parameter value type and range are determined:
According to parameter declaration standard, configuration parameter set to be optimized in distributed memory Computational frame Spark clusters is set The value type and range of each parameter in conjunction, the extraction acquiescence value from the value range of each parameter, all acquiescences are taken Value composition default configuration;
(4) cluster is scaled:
Strategy is scaled using distributed memory Computational frame Spark clusters, is scaled in configuration parameter set conjunction to be optimized The value range and pending data of memory configurations parameter;
(5) training Random Forest model:
(5a) records the initial time of search process;
Configuration parameter set to be optimized is combined into hyperspace as search space by (5b), using uniform sampling strategy, Search space is sampled, the equally distributed configuration parameter set in search space is obtained and closes, configure and join as initial ranging Manifold is closed;
(5c) using configuration Evaluation Strategy, all configurations in evaluation initial ranging configuration parameter set conjunction are obtained by distribution The training set that formula memory Computational frame Spark clustering performance influence powers sort from big to small;
(5d) in training set before obtainingA configuration forms iterative search configuration parameter set and closes, and m indicates what user specified The configuration sum searched in each iterative search procedures;
Training set is input to training pattern in Random Forest model by (5e);
(6) best configuration is screened:
(6a) utilizes uniform sampling strategy, generates configuration parameter set and closes, is taken out at random from the parameter setsA configuration, Each configuration is evaluated using configuration Evaluation Strategy, if the configuration influences distributed memory Computational frame Spark clustering performances Power is more than first configuration in training set, creates one and is in an ordered configuration parameter sets, which is put by distributed memory Computational frame Spark clustering performance influence power descending sorts are in an ordered configuration in parameter sets, will each be configured evaluation result and be added Enter into training set;
Each actual disposition during (6b) closes iterative search configuration parameter set reduces search according to range approximation Strategy Space generates configuration parameter set and closes using uniform sampling strategy;By configuration parameter set each of close configuration be input to it is random gloomy In woods model, predicted configuration obtains performance in prediction result to the performance influence power of distributed memory Computational frame Spark clusters The maximum predicted configuration of influence power;
(6c) obtains the property to distributed memory Computational frame Spark clusters of predicted configuration using configuration Evaluation Strategy Can influence power, by predicted configuration and the configuration to the performance influence power composition sequence of distributed memory Computational frame Spark clusters, It is added to training set, actual disposition is replaced according to two kinds of situations in configuration replacement policy;If actual disposition is not replaced, Next time, search did not used range approximation Strategy to the actual disposition;
(6d) subtracts search process initial time with the time completed when configuration is replaced, and obtains the time of search process;
(6e) judges whether the time of search process is less than the search time that user specifies, if so, (6a) is thened follow the steps, Otherwise, step (6f) is executed;
(6f) is extracted in training set to the maximum configuration conduct of distributed memory Computational frame Spark clustering performance influence powers Best configuration;
(7) allocative effect is verified:
(7a) restores strategy using distributed memory Computational frame Spark clusters, and the memory configurations after reduction reduction take Value and pending data obtain configuration to be verified and practical pending data;
(7b) evaluates configuration to be verified and default configuration to distributed memory Computational frame respectively using configuration Evaluation Strategy The performance influence power of Spark clusters will be greater than performance influence power of the default configuration to distributed memory Computational frame Spark clusters Configuration to be verified, the configuration parameter of the distributed memory Computational frame Spark as automated tuning.
The present invention has the advantage that compared with prior art:
First, since the present invention scales strategy using distributed memory Computational frame Spark clusters, scale to be optimized match The value range and pending data for setting the memory configurations parameter in parameter sets shorten each configuration of evaluation in distribution The time of Computational frame Spark clustering performance influence powers is deposited, and then overcomes the prior art and needs to evaluate often on actual environment A configuration is to distributed memory Computational frame Spark clustering performance influence powers, and as the training set of Random Forest model, waste is big The problem of measuring time cost so that The present invention reduces the time costs for obtaining model training collection.
Second, the present invention is by the way that training set to be input in Random Forest model in training pattern, by Random Forest model The execution mechanism of direct phantom frame Spark, overcomes prior art distributed memory Computational frame Spark application performance models Establishment it is understood that distributed memory Computational frame Spark execution mechanism, model creation process is complicated, and difficulty is high to ask Topic so that present invention reduces the thresholds that user optimizes distributed memory Computational frame Spark clusters.
Description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is the emulation experiment figure of the present invention.
Specific implementation mode
The present invention is described further below in conjunction with the accompanying drawings.
With reference to attached drawing 1, the specific steps of the present invention are described further.
Step 1, cluster is built.
Build the collection being made of the more identical computers of hardware configuration equipped with distributed memory Computational frame Spark Group.
Step 2, option and installment parameter sets.
From all configuration parameters to be modified of distributed memory Computational frame Spark clusters, select in optimisation criteria The configuration parameter for recommending modification forms configuration parameter set to be optimized and closes.
In the optimization page in distributed memory Computational frame Spark official documents, optimisation criteria is described in detail should The configuration parameter being optimized.
Step 3, configuration parameter value type and range are determined.
According to parameter declaration standard, configuration parameter set to be optimized in distributed memory Computational frame Spark clusters is set The value type and range of each parameter in conjunction, the extraction acquiescence value from the value range of each parameter, all acquiescences are taken Value composition default configuration.
In the configuration page in distributed memory Computational frame Spark official documents, parameter declaration standard is described in detail The effect that each configuration parameter set is closed, default value and value range.
Step 4, cluster is scaled.
Strategy is scaled using distributed memory Computational frame Spark clusters, is scaled in configuration parameter set conjunction to be optimized The value range and pending data of memory configurations parameter.
The step of distributed memory Computational frame Spark clusters scaling strategy is as follows:
1st step calculates distributed memory Computational frame Spark cluster scalings according to the following formula:
Wherein, R indicates distributed memory Computational frame Spark cluster scalings,Indicate downward floor operation, log2 Indicate that the log operations bottom of for, M indicate the memory size of every computer, unit million with 2.
2nd step calculates the value range of the memory configurations parameter after scaling according to the following formula:
Wherein, m indicates that the memory configurations parameter after scaling, ∈ expressions belong to symbol.
3rd step calculates the pending data after scaling according to the following formula:
Wherein, d indicates that the pending data after scaling, D indicate the pending data before scaling.
Step 5, training Random Forest model.
Record the initial time of search process.
Configuration parameter set to be optimized is combined into hyperspace as search space, using uniform sampling strategy, to searching Rope space is sampled, and is obtained the equally distributed configuration parameter set in search space and is closed, as initial ranging configuration parameter set It closes.
The step of uniform sampling strategy is as follows:
1st step obtains the section of k same size by each dimension in search space according to k deciles, wherein k is The sum that configuration parameter set to be searched is closed in the initial ranging that user specifies.
2nd step randomly selects a floating number in each section.
The floating number chosen in all sections is formed a k and ties up sequence by the 3rd step, upsets floating number in k dimension sequences at random Sequence, obtain out of order k dimensions sequence.
The floating number of each same position in k dimension sequences out of order in all dimensions is formed a sequence, often by the 4th step A sequence is configured as one, is obtained k configuration parameter set and is closed.
Using configuration Evaluation Strategy, all configurations in evaluation initial ranging configuration parameter set conjunction are obtained by distribution Deposit the training set that Computational frame Spark clustering performance influence powers sort from big to small.
Configuration Evaluation Strategy refers to, with configuration to be evaluated, running distributed memory Computational frame Spark clusters, using use The specified analytical pending data in family, the time required to recording and analyzing data, by the inverse of the time as distributed Memory Computational frame Spark clustering performance influence powers, by configuration with the configuration to distributed memory Computational frame Spark sociabilities Can influence power form a sequence, wherein the analysis method that the user specifies refers to, user is from statistical analysis, machine learning, Any one selected data processing method in web search field.
Before being obtained in training setA configuration forms iterative search configuration parameter set and closes, and it is each that m indicates that user specifies The configuration sum searched in iterative search procedures.
Training set is input to training pattern in Random Forest model.
Step 6, best configuration is screened.
Using uniform sampling strategy, generates configuration parameter set and close, taken out at random from the parameter setsA configuration utilizes The each configuration of Evaluation Strategy evaluation is configured, if the configuration is big to distributed memory Computational frame Spark clustering performance influence powers First configuration in training set, creates one and is in an ordered configuration parameter sets, which is put into and is calculated by distributed memory Frame Spark clustering performance influence power descending sorts are in an ordered configuration in parameter sets, and each configuration evaluation result is added to In training set.
The step of uniform sampling strategy is as follows:
1st step obtains the section of k same size by each dimension in search space according to k deciles, wherein k is The sum that configuration parameter set to be searched is closed in the initial ranging that user specifies.
2nd step randomly selects a floating number in each section.
The floating number chosen in all sections is formed a k and ties up sequence by the 3rd step, upsets floating number in k dimension sequences at random Sequence, obtain out of order k dimensions sequence.
The floating number of each same position in k dimension sequences out of order in all dimensions is formed a sequence, often by the 4th step A sequence is configured as one, is obtained k configuration parameter set and is closed.
Configuration Evaluation Strategy refers to, with configuration to be evaluated, running distributed memory Computational frame Spark clusters, using use The specified analytical pending data in family, the time required to recording and analyzing data, by the inverse of the time as distributed Memory Computational frame Spark clustering performance influence powers, by configuration with the configuration to distributed memory Computational frame Spark sociabilities Can influence power form a sequence, wherein the analysis method that the user specifies refers to, user is from statistical analysis, machine learning, Any one selected data processing method in web search field.
Each actual disposition in being closed to iterative search configuration parameter set reduces search space according to range approximation Strategy, Using uniform sampling strategy, generates configuration parameter set and close;Configuration parameter set each of is closed into configuration and is input to random forest mould In type, to the performance influence power of distributed memory Computational frame Spark clusters, obtain performance in prediction result influences predicted configuration The maximum predicted configuration of power.
The step of uniform sampling strategy is as follows:
1st step obtains the section of k same size by each dimension in search space according to k deciles, wherein k is The sum that configuration parameter set to be searched is closed in the initial ranging that user specifies.
2nd step randomly selects a floating number in each section.
The floating number chosen in all sections is formed a k and ties up sequence by the 3rd step, upsets floating number in k dimension sequences at random Sequence, obtain out of order k dimensions sequence.
The floating number of each same position in k dimension sequences out of order in all dimensions is formed a sequence, often by the 4th step A sequence is configured as one, is obtained k configuration parameter set and is closed.
The step of range approximation Strategy, is as follows:
1st step, in all configurations in the training set of search space in each dimension, from more than pending configuration value In other configurations, extract with pending configuration value apart from shortest other configurations value as coboundary, from less than pending In the other configurations for configuring value, extraction is with pending configuration value apart from shortest other configurations value as lower boundary.
2nd step, using the up-and-down boundary of each dimension as the value range of the dimension, by the value range group of all dimensions At the search space after reduction.
Using configuration Evaluation Strategy, the performance shadow to distributed memory Computational frame Spark clusters of predicted configuration is obtained Power is rung, by predicted configuration and the configuration to the performance influence power composition sequence of distributed memory Computational frame Spark clusters, is added To training set, actual disposition is replaced according to two kinds of situations in configuration replacement policy;If actual disposition is not replaced, next time Search does not use range approximation Strategy to the actual disposition.
Configuration Evaluation Strategy refers to, with configuration to be evaluated, running distributed memory Computational frame Spark clusters, using use The specified analytical pending data in family, the time required to recording and analyzing data, by the inverse of the time as distributed Memory Computational frame Spark clustering performance influence powers, by configuration with the configuration to distributed memory Computational frame Spark sociabilities Can influence power form a sequence, wherein the analysis method that the user specifies refers to, user is from statistical analysis, machine learning, Any one selected data processing method in web search field.
Two kinds of situations in configuration replacement policy replace actual disposition:
A. it is more than the situation of actual disposition for predicted configuration performance influence power, actual disposition is replaced with predicted configuration.
B. it is not empty situation for being in an ordered configuration parameter sets, first configuration is extracted from being in an ordered configuration in parameter sets Replace actual disposition.
The step of range approximation Strategy, is as follows:
1st step, in all configurations in the training set of search space in each dimension, from more than pending configuration value In other configurations, extract with pending configuration value apart from shortest other configurations value as coboundary, from less than pending In the other configurations for configuring value, extraction is with pending configuration value apart from shortest other configurations value as lower boundary.
2nd step, using the up-and-down boundary of each dimension as the value range of the dimension, by the value range group of all dimensions At the search space after reduction.
Search process initial time is subtracted with the time completed when configuration is replaced, obtains the time of search process.
Judge whether the time of search process is less than the search time that user specifies, if so, step 6 is re-executed, it is no Then, it extracts configuration maximum to distributed memory Computational frame Spark clustering performance influence powers in training set and is used as best configuration.
Step 7, allocative effect is verified.
Using distributed memory Computational frame Spark clusters restore strategy, reduction reduction after memory configurations value and Pending data obtains configuration to be verified and practical pending data.
The step of distributed memory Computational frame Spark clusters reduction strategy is as follows:
1st step calculates the memory configurations after reduction according to the following formula:
C=(m-300) × R+300
Wherein, C indicates the memory configurations after reduction.
2nd step calculates the pending data after reduction according to the following formula:
D=d × R
Wherein, D indicates the pending data before scaling.
Using configuration Evaluation Strategy, configuration to be verified and default configuration are evaluated respectively to distributed memory Computational frame The performance influence power of Spark clusters will be greater than performance influence power of the default configuration to distributed memory Computational frame Spark clusters Configuration to be verified, the configuration parameter of the distributed memory Computational frame Spark as automated tuning.
Further verification explanation is made to the effect of the present invention with reference to emulation experiment.
1. simulated conditions:
The emulation experiment environment of the present invention is to select 6 hardware configurations on Ali's cloud just the same equipped with distributed memory The computer of Computational frame Spark builds distributed memory Computational frame Spark clusters.Every computer in emulation experiment Specifications parameter is as shown in table 1.
1 COMPUTER PARAMETER specification list of table
Operating system CentOS 6.8
Processor check figure 4
Memory 32GB
Hard disk 250GB
2. emulation content:
With user's inputs different three times, the distributed memory Computational frame Spark configuration ginsengs scaled based on cluster are used Number automated tuning method carries out emulation experiment, verification distributed memory Computational frame Spark sociabilities under the configuration searched out The performance of energy is better than default configuration, each emulation experiment serial number, the pending data that each user specifies, analysis method, search The configuration sum m searched in time, configuration parameter set to be searched is closed in initial ranging total k and each iterative search procedures As shown in table 2.
2 simulation parameter list of table
Serial number Pending data Analysis method Search time k m
1 506.9M PageRank (web search) 485 minutes 317 20
2 7.5G LogisticRegression (machine learning) 360 minutes 163 20
3 76.5G WordCount (statistical analysis) 320 minutes 211 20
3. analysis of simulation result:
With reference to attached drawing 2, the simulation result of the present invention is described further.Each user that abscissa in Fig. 2 represents The serial number of input, ordinate indicate that the time of distributed memory Computational frame Spark cluster analysis pending datas, unit are Second.Oblique line cylindricality in Fig. 2 represents default configuration, the configuration of entity cylindricality representing optimized.Fig. 2 has recorded in user's input three times, Distributed memory Computational frame Spark clusters are being distributed rationally under default configuration, are completed using the analysis method that user specifies Analyze the time of pending data.In Fig. 2, the entity cylindricality under same sequence number is below oblique line cylindricality, it can be seen that three times Under the distributing rationally that user inputs, the time of distributed memory Computational frame Spark cluster analysis pending datas is all small In default configuration, show in the case where distributing rationally, distributed memory Computational frame Spark clustering performances are better than default configuration, verification Spark based on cluster scaling configures the validity of parameter automated tuning method.
In conclusion a kind of Spark based on cluster scaling disclosed by the invention configures parameter automated tuning method, solve Prior art distributed computing framework Spark configuration parameter automatic optimization method time cost height and model creation process are complicated The problem of.Specific steps include:(1) cluster is built;(2) option and installment parameter sets;(3) determine configuration parameter value type and Range;(4) cluster is scaled;(5) training Random Forest model;(6) best configuration is screened;(7) allocative effect is verified.The present invention's Distributed memory Computational frame Spark colonization process is scaled, training Random Forest model and screening best configuration are this experiment Innovative point reduces the time cost for obtaining training set by scaling distributed memory Computational frame Spark clusters;Pass through instruction Practice Random Forest model and screening best configuration set, solves the problems, such as model creation process complexity, obtained better than acquiescence The lower distributed memory Computational frame Spark clustering performances of configuration are distributed rationally.Present invention could apply to mass data processing In technical field, distributed memory Computational frame Spark memory configurations parameter value ranges are scaled by pressing cluster scaling With input data amount, the distributed memory Computational frame Spark clusters for more hardware configuration same computers composition of sening as an envoy to are searched for The best configuration parameter of performance.

Claims (7)

1. a kind of distributed memory Computational frame Spark based on cluster scaling configures parameter automated tuning method, feature exists In, distributed memory Computational frame Spark memory configurations parameter value ranges and input data amount are scaled by cluster scaling, The send as an envoy to distributed memory Computational frame Spark clustering performances of more hardware configuration same computers composition of search best are matched It sets, specific steps include as follows:
(1) cluster is built:
Build the cluster being made of the more identical computers of hardware configuration equipped with distributed memory Computational frame Spark;
(2) option and installment parameter sets:
From all configuration parameters to be modified of distributed memory Computational frame Spark clusters, select to recommend in optimisation criteria The configuration parameter of modification forms configuration parameter set to be optimized and closes;
(3) configuration parameter value type and range are determined:
According to parameter declaration standard, it is arranged in configuration parameter set conjunction to be optimized in distributed memory Computational frame Spark clusters The value type and range of each parameter, the extraction acquiescence value from the value range of each parameter, by all acquiescence value groups At default configuration;
(4) cluster is scaled:
Strategy is scaled using distributed memory Computational frame Spark clusters, scales the memory during configuration parameter set to be optimized is closed Configure the value range and pending data of parameter;
(5) training Random Forest model:
(5a) records the initial time of search process;
Configuration parameter set to be optimized is combined into hyperspace as search space, using uniform sampling strategy, to searching by (5b) Rope space is sampled, and is obtained the equally distributed configuration parameter set in search space and is closed, as initial ranging configuration parameter set It closes;
(5c) using configuration Evaluation Strategy, all configurations in evaluation initial ranging configuration parameter set conjunction are obtained by distribution Deposit the training set that Computational frame Spark clustering performance influence powers sort from big to small;
(5d) in training set before obtainingA configuration forms iterative search configuration parameter set and closes, and it is each that m indicates that user specifies The configuration sum searched in iterative search procedures;
Training set is input to training pattern in Random Forest model by (5e);
(6) best configuration is screened:
(6a) utilizes uniform sampling strategy, generates configuration parameter set and closes, is taken out at random from the parameter setsA configuration utilizes The each configuration of Evaluation Strategy evaluation is configured, if the configuration is big to distributed memory Computational frame Spark clustering performance influence powers First configuration in training set, creates one and is in an ordered configuration parameter sets, which is put into and is calculated by distributed memory Frame Spark clustering performance influence power descending sorts are in an ordered configuration in parameter sets, and each configuration evaluation result is added to In training set;
Each actual disposition during (6b) closes iterative search configuration parameter set reduces search space according to range approximation Strategy, Using uniform sampling strategy, generates configuration parameter set and close;Configuration parameter set each of is closed into configuration and is input to random forest mould In type, to the performance influence power of distributed memory Computational frame Spark clusters, obtain performance in prediction result influences predicted configuration The maximum predicted configuration of power;
(6c) obtains the performance shadow to distributed memory Computational frame Spark clusters of predicted configuration using configuration Evaluation Strategy Power is rung, by predicted configuration and the configuration to the performance influence power composition sequence of distributed memory Computational frame Spark clusters, is added To training set, actual disposition is replaced according to two kinds of situations in configuration replacement policy;If actual disposition is not replaced, next time Search does not use range approximation Strategy to the actual disposition;
(6d) subtracts search process initial time with the time completed when configuration is replaced, and obtains the time of search process;
(6e) judges whether the time of search process is less than the search time that user specifies, if so, (6a) is thened follow the steps, it is no Then, step (6f) is executed;
(6f) is extracted in training set to the maximum configuration of distributed memory Computational frame Spark clustering performance influence powers as best Configuration;
(7) allocative effect is verified:
(7a) using distributed memory Computational frame Spark clusters restore strategy, reduction reduction after memory configurations value and Pending data obtains configuration to be verified and practical pending data;
(7b) evaluates configuration to be verified and default configuration to distributed memory Computational frame respectively using configuration Evaluation Strategy The performance influence power of Spark clusters will be greater than performance influence power of the default configuration to distributed memory Computational frame Spark clusters Configuration to be verified, the configuration parameter of the distributed memory Computational frame Spark as automated tuning.
2. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that:The step of distributed memory Computational frame Spark clusters scaling strategy described in step (4) is such as Under:
The first step calculates distributed memory Computational frame Spark cluster scalings according to the following formula:
Wherein, R indicates distributed memory Computational frame Spark cluster scalings,Indicate downward floor operation, log2It indicates With 2 for bottom log operations, M indicates the memory size of every computer, unit million;
Second step calculates the value range of the memory configurations parameter after scaling according to the following formula:
Wherein, m indicates that the memory configurations parameter after scaling, ∈ expressions belong to symbol;
Third walks, and according to the following formula, calculates the pending data after scaling:
Wherein, d indicates that the pending data after scaling, D indicate the pending data before scaling.
3. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that:Step (5b), step (6a), the step of uniform sampling strategy described in step (6b) are as follows:
The first step obtains the section of k same size by each dimension in search space according to k deciles, wherein k is user The sum that configuration parameter set to be searched is closed in specified initial ranging;
Second step randomly selects a floating number in each section;
Third walks, and the floating number chosen in all sections, which is formed a k, ties up sequence, upsets floating number in k dimension sequences at random Sequentially, out of order k dimension sequences are obtained;
The floating number of each same position in k dimension sequences out of order in all dimensions is formed a sequence, Mei Gexu by the 4th step Row obtain k configuration parameter set and close as a configuration.
4. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that:Step (5c), step (6a), the configuration Evaluation Strategy described in step (6c) refer to, with to be evaluated Configuration runs distributed memory Computational frame Spark clusters, the analytical pending data specified using user, note It, will using the inverse of the time as distributed memory Computational frame Spark clustering performance influence powers the time required to record analysis data Configuration forms a sequence with the configuration to distributed memory Computational frame Spark clustering performance influence powers, wherein the user Specified analysis method refers to, user is from statistical analysis, machine learning, in web search field at any one selected data Reason method.
5. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that:The step of range approximation Strategy described in step (6b), step (6c), is as follows:
The first step, in all configurations in the training set of search space in each dimension, from its more than pending configuration value During it is configured, extracts with pending configuration value apart from shortest other configurations value as coboundary, match from less than pending It sets in the other configurations of value, extraction is with pending configuration value apart from shortest other configurations value as lower boundary;
Second step forms the value range of all dimensions using the up-and-down boundary of each dimension as the value range of the dimension Search space after reduction.
6. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that:Replacing actual disposition according to two kinds of situations in configuration replacement policy described in step (6c) refers to:
A. it is more than the situation of actual disposition for predicted configuration performance influence power, actual disposition is replaced with predicted configuration;
B. it is not empty situation for being in an ordered configuration parameter sets, first configuration replacement is extracted from being in an ordered configuration in parameter sets Actual disposition.
7. the distributed memory Computational frame Spark configurations parameter according to claim 1 based on cluster scaling is adjusted automatically Excellent method, it is characterised in that:The step of distributed memory Computational frame Spark clusters reduction strategy described in step (7a) is such as Under:
The first step calculates the memory configurations after reduction according to the following formula:
C=(m-300) × R+300
Wherein, C indicates the memory configurations after reduction;
Second step calculates the pending data after reduction according to the following formula:
D=d × R
Wherein, D indicates the pending data before scaling.
CN201810110273.XA 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling Active CN108491226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810110273.XA CN108491226B (en) 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810110273.XA CN108491226B (en) 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling

Publications (2)

Publication Number Publication Date
CN108491226A true CN108491226A (en) 2018-09-04
CN108491226B CN108491226B (en) 2021-03-23

Family

ID=63344582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810110273.XA Active CN108491226B (en) 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling

Country Status (1)

Country Link
CN (1) CN108491226B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388565A (en) * 2018-09-27 2019-02-26 西安电子科技大学 Software system performance optimization method based on production confrontation network
CN110134665A (en) * 2019-04-17 2019-08-16 北京百度网讯科技有限公司 Database self-learning optimization method and device based on traffic mirroring
CN111259933A (en) * 2020-01-09 2020-06-09 中国科学院计算技术研究所 High-dimensional feature data classification method and system based on distributed parallel decision tree
CN111629048A (en) * 2020-05-22 2020-09-04 浪潮电子信息产业股份有限公司 spark cluster optimal configuration parameter determination method, device and equipment
CN112418311A (en) * 2020-11-21 2021-02-26 安徽理工大学 Distributed random forest method for risk assessment of communication network
CN113032367A (en) * 2021-03-24 2021-06-25 安徽大学 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
WO2022111125A1 (en) * 2020-11-27 2022-06-02 深圳先进技术研究院 Random-forest-based automatic optimization method for graphic data processing framework

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327118A (en) * 2013-07-09 2013-09-25 南京大学 Intelligent virtual machine cluster scaling method and system for web application in cloud computing
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN106844673A (en) * 2017-01-24 2017-06-13 山东亿海兰特通信科技有限公司 A kind of method and system based on the public security data acquisition intimate degree of multidimensional personnel
US20170168814A1 (en) * 2015-12-15 2017-06-15 Impetus Technologies, Inc. System and Method for Registration of a Custom Component in a Distributed Computing Pipeline
CN107360026A (en) * 2017-07-07 2017-11-17 西安电子科技大学 Distributed message performance of middle piece is predicted and modeling method
US20170364795A1 (en) * 2016-06-15 2017-12-21 Akw Analytics Inc. Petroleum analytics learning machine system with machine learning analytics applications for upstream and midstream oil and gas industry

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327118A (en) * 2013-07-09 2013-09-25 南京大学 Intelligent virtual machine cluster scaling method and system for web application in cloud computing
US20170168814A1 (en) * 2015-12-15 2017-06-15 Impetus Technologies, Inc. System and Method for Registration of a Custom Component in a Distributed Computing Pipeline
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
US20170364795A1 (en) * 2016-06-15 2017-12-21 Akw Analytics Inc. Petroleum analytics learning machine system with machine learning analytics applications for upstream and midstream oil and gas industry
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN106844673A (en) * 2017-01-24 2017-06-13 山东亿海兰特通信科技有限公司 A kind of method and system based on the public security data acquisition intimate degree of multidimensional personnel
CN107360026A (en) * 2017-07-07 2017-11-17 西安电子科技大学 Distributed message performance of middle piece is predicted and modeling method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BAO LIANG等: "An Orthogonal Genetic Algorithm for QoS-Aware Service Composition", 《COMPUTER JOURNAL》 *
HADJ AHMED BOUARARA等: "A Fireworks Algorithm for Modern Web Information Retrieval with Visual Results Mining", 《INTERNATIONAL JOURNAL OF SWARM INTELLIGENCE RESEARCH》 *
詹剑锋等: "BigDataBench:开源的大数据系统评测基准", 《计算机学报》 *
鲍亮: "基于函数式编程的Web服务组合技术研究", 《中国博士学位论文全文数据库·信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388565A (en) * 2018-09-27 2019-02-26 西安电子科技大学 Software system performance optimization method based on production confrontation network
CN109388565B (en) * 2018-09-27 2021-08-06 西安电子科技大学 Software system performance optimization method based on generating type countermeasure network
CN110134665A (en) * 2019-04-17 2019-08-16 北京百度网讯科技有限公司 Database self-learning optimization method and device based on traffic mirroring
CN110134665B (en) * 2019-04-17 2021-05-25 北京百度网讯科技有限公司 Database self-learning optimization method and device based on flow mirror image
CN111259933A (en) * 2020-01-09 2020-06-09 中国科学院计算技术研究所 High-dimensional feature data classification method and system based on distributed parallel decision tree
CN111259933B (en) * 2020-01-09 2023-06-13 中国科学院计算技术研究所 High-dimensional characteristic data classification method and system based on distributed parallel decision tree
CN111629048A (en) * 2020-05-22 2020-09-04 浪潮电子信息产业股份有限公司 spark cluster optimal configuration parameter determination method, device and equipment
CN111629048B (en) * 2020-05-22 2023-04-07 浪潮电子信息产业股份有限公司 spark cluster optimal configuration parameter determination method, device and equipment
CN112418311A (en) * 2020-11-21 2021-02-26 安徽理工大学 Distributed random forest method for risk assessment of communication network
WO2022111125A1 (en) * 2020-11-27 2022-06-02 深圳先进技术研究院 Random-forest-based automatic optimization method for graphic data processing framework
CN113032367A (en) * 2021-03-24 2021-06-25 安徽大学 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system

Also Published As

Publication number Publication date
CN108491226B (en) 2021-03-23

Similar Documents

Publication Publication Date Title
CN108491226A (en) Spark based on cluster scaling configures parameter automated tuning method
CN107341270B (en) Social platform-oriented user emotion influence analysis method
CN102567391B (en) Method and device for building classification forecasting mixed model
CN105589806B (en) A kind of software defect tendency Forecasting Methodology based on SMOTE+Boosting algorithms
CN108154430A (en) A kind of credit scoring construction method based on machine learning and big data technology
CN103778548B (en) Merchandise news and key word matching method, merchandise news put-on method and device
CN103744928B (en) A kind of network video classification method based on history access record
CN102591917B (en) Data processing method and system and related device
CN102279887B (en) A kind of Document Classification Method, Apparatus and system
CN110543616B (en) SMT solder paste printing volume prediction method based on industrial big data
CN102023986B (en) The method and apparatus of text classifier is built with reference to external knowledge
CN105740227B (en) A kind of genetic simulated annealing method of neologisms in solution Chinese word segmentation
CN103310003A (en) Method and system for predicting click rate of new advertisement based on click log
CN106803799B (en) Performance test method and device
CN110502361A (en) Fine granularity defect positioning method towards bug report
CN103530347A (en) Internet resource quality assessment method and system based on big data mining
CN107067182A (en) Towards the product design scheme appraisal procedure of multidimensional image
CN113536097B (en) Recommendation method and device based on automatic feature grouping
CN108537273A (en) A method of executing automatic machinery study for unbalanced sample
CN105893669A (en) Global simulation performance predication method based on data digging
CN104820724A (en) Method for obtaining prediction model of knowledge points of text-type education resources and model application method
CN111476296A (en) Sample generation method, classification model training method, identification method and corresponding devices
CN103795592B (en) Online water navy detection method and device
CN104657574A (en) Building method and device for medical diagnosis models
CN106843941A (en) Information processing method, device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230602

Address after: Building 1, Science and Technology Innovation Service Center, No. 856 Zhongshan East Road, High tech Zone, Shijiazhuang City, Hebei Province, 050035

Patentee after: Hegang Digital Technology Co.,Ltd.

Address before: 710071 Taibai South Road, Yanta District, Xi'an, Shaanxi Province, No. 2

Patentee before: XIDIAN University