CN108491226B - Spark configuration parameter automatic tuning method based on cluster scaling - Google Patents
Spark configuration parameter automatic tuning method based on cluster scaling Download PDFInfo
- Publication number
- CN108491226B CN108491226B CN201810110273.XA CN201810110273A CN108491226B CN 108491226 B CN108491226 B CN 108491226B CN 201810110273 A CN201810110273 A CN 201810110273A CN 108491226 B CN108491226 B CN 108491226B
- Authority
- CN
- China
- Prior art keywords
- configuration
- distributed memory
- cluster
- spark
- memory computing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/71—Version control; Configuration management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Supply And Distribution Of Alternating Current (AREA)
Abstract
The invention discloses a Spark configuration parameter automatic tuning method based on cluster scaling, which comprises the following steps: (1) building a cluster; (2) selecting a configuration parameter set; (3) determining the value type and range of the configuration parameters; (4) zooming the cluster; (5) training a random forest model; (6) screening the optimal configuration; (7) and verifying the configuration effect. The invention can be applied to the technical field of mass data processing, shortens the time for evaluating each configuration by zooming the value range of the distributed memory computing frame Spark memory configuration parameters and the data volume to be processed, establishes the relationship between the configuration and the distributed memory computing frame Spark cluster performance influence by the random forest model, and searches the configuration with the best performance of the distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration.
Description
Technical Field
The invention belongs to the technical field of computers, and further relates to a Spark configuration parameter automatic tuning method based on cluster scaling in the technical field of mass data processing. According to the invention, the configuration of the distributed memory computing frame Spark cluster performance superior to that of the distributed memory computing frame Spark cluster under default configuration can be obtained by zooming the distributed memory computing frame Spark cluster and training the random forest model.
Background
The distributed memory computing framework Spark is a big data parallel computing framework based on memory computing. The distributed memory computing framework Spark is based on memory computing, so that the real-time performance of data processing in a big data environment is improved, high fault tolerance and high scalability are guaranteed, and a user is allowed to deploy the distributed memory computing framework Spark on a large amount of cheap hardware to form a cluster. Currently, the distributed memory computing framework Spark, which has been used by many macros including Amazon, eBay, and Yahoo! . Many organizations run a distributed memory computing framework Spark over a cluster having thousands of nodes. Configuration parameter optimization has been one of the research hotspots of the distributed memory computing framework Spark, since the configuration parameters are numerous (more than 100), the performance is greatly affected by the configuration parameters, and the best performance is far from being achieved by using the default configuration. Therefore, automatic optimization of configuration parameters for the distributed memory computing framework Spark is an urgent problem to be solved.
The patent document 'an automatic optimization method of data-aware Spark configuration parameters' (application number: 201611182310.5 application date: 2016.12.20 published: CN106648654A) applied by Shenzhen research institute of advanced technology discloses an automatic optimization method of data-aware Spark configuration parameters. The method comprises the steps of selecting a Spark application program, further determining parameters influencing Spark performance in the application program, and determining the value range of the parameters; randomly generating parameters in a value range, generating a configuration file configuration Spark, running an application program after configuration and collecting data; constructing transverse vectors by using the collected Spark running time, input data sets and configuration parameter value data, constructing a training set by using a plurality of vectors, and modeling the training set by using a random forest algorithm; and searching for optimal configuration parameters through a genetic algorithm by using the constructed performance model. The method has the disadvantages that the influence of each configuration on the performance of the Spark cluster of the distributed memory computing framework needs to be evaluated in the actual environment, and the influence is used as a training set of the random forest model, so that a large amount of time cost is wasted.
In a patent document applied by university of Chinese academy of sciences, a Spark platform performance automatic optimization method (application number: 201610068611.9 application date: 2016.02.01 publication number: CN105868019A), an automatic Spark platform performance optimization method is disclosed, which creates a Spark application performance model through an execution mechanism of a Spark platform, selects partial data load of the Spark application to run on the Spark platform aiming at a set Spark application, and collects performance data when the Spark application runs; inputting the collected performance data into a Spark application performance model, and determining values of all parameters in the Spark application performance model when Spark application is operated; and calculating the performance (total application execution time) of the Spark platform when the Spark platform is combined with different configuration parameters to obtain the configuration parameter combination when the Spark platform is optimal in performance. The method has the defects that the establishment of the distributed memory computing framework Spark application performance model needs to understand the execution mechanism of the distributed memory computing framework Spark, and the model establishment process is complex and difficult.
Disclosure of Invention
The invention aims to provide a Spark configuration parameter automatic optimization method based on cluster scaling, aiming at the defects of high time cost and complex model creation process of the Spark configuration parameter automatic optimization method of the distributed memory computing framework in the prior art.
The idea for realizing the purpose of the invention is to scale the value range of the distributed memory computing frame Spark memory configuration parameters and the input data volume according to the cluster scaling, shorten the time for evaluating the influence of each configuration on the performance of the distributed memory computing frame Spark cluster, spend less time to obtain a sufficient training set and train a more accurate random forest model. And searching the configuration with the best performance of a distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration by using a random forest model and a screening optimal configuration method.
The method comprises the following specific steps:
(1) building a cluster:
building a cluster consisting of a plurality of computers with the same hardware configuration and provided with distributed memory computing frames Spark;
(2) selecting a configuration parameter set:
selecting the configuration parameters recommended to be modified in the optimization standard from all the configuration parameters to be modified of the Spark cluster of the distributed memory computing framework to form a configuration parameter set to be optimized;
(3) determining the value type and range of the configuration parameters:
setting the value type and range of each parameter in a configuration parameter set to be optimized in a Spark cluster of a distributed memory computing framework according to a parameter description standard, extracting a default value from the value range of each parameter, and forming all default values into default configuration;
(4) scaling the clusters:
zooming the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed by utilizing a zooming strategy of a Spark cluster of a distributed memory computing framework;
(5) training a random forest model:
(5a) recording the starting time of the searching process;
(5b) forming a multi-dimensional space by using the configuration parameter sets to be optimized as a search space, and sampling the search space by using a uniform sampling strategy to obtain configuration parameter sets uniformly distributed in the search space as an initial search configuration parameter set;
(5c) evaluating all configurations in the initial search configuration parameter set by using a configuration evaluation strategy to obtain a training set which is ordered from large to small according to the performance influence of a Spark cluster of a distributed memory computing framework;
(5d) before taking from the training setConfiguring to form an iterative search configuration parameter set, wherein m represents the total number of configurations searched in each iterative search process specified by a user;
(5e) inputting the training set into a random forest model to train the model;
(6) screening an optimal configuration:
(6a) generating a configuration parameter set by using a uniform sampling strategy, and randomly taking out the configuration parameter setA configuration for evaluating each configuration by using a configuration evaluation strategyIf the influence of the configuration on the distributed memory computing frame Spark cluster performance is larger than the first configuration in the training set, creating an ordered configuration parameter set, putting the configuration into the ordered configuration parameter set which is sorted according to the descending order of the distributed memory computing frame Spark cluster performance influence, and adding each configuration evaluation result into the training set;
(6b) for each actual configuration in the iterative search configuration parameter set, reducing a search space according to a range approximation strategy, and generating a configuration parameter set by using a uniform sampling strategy; inputting each configuration in the configuration parameter set into a random forest model, predicting the performance influence of the configuration on a distributed memory computing frame Spark cluster, and obtaining the predicted configuration with the maximum performance influence in the prediction result;
(6c) obtaining the performance influence of the predicted configuration on the distributed memory computing frame Spark cluster by using a configuration evaluation strategy, forming a sequence by the predicted configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, adding the sequence into a training set, and replacing the actual configuration according to two situations in a configuration replacement strategy; if the actual configuration is not replaced, the next search does not adopt a range approximation strategy for the actual configuration;
(6d) subtracting the initial time of the searching process from the time when the configuration replacement is completed to obtain the time of the searching process;
(6e) judging whether the time of the searching process is less than the searching time specified by the user, if so, executing the step (6a), otherwise, executing the step (6 f);
(6f) extracting the configuration with the maximum influence on the performance of the distributed memory computing framework Spark cluster in the training set as the optimal configuration;
(7) and (3) verifying configuration effect:
(7a) reducing the values of the reduced memory configuration and the data to be processed by using a distributed memory computing framework Spark cluster reduction strategy to obtain the configuration to be verified and the actual data to be processed;
(7b) and respectively evaluating the performance influence of the configuration to be verified and the default configuration on the Spark cluster of the distributed memory computing framework by using a configuration evaluation strategy, and taking the configuration to be verified, which is greater than the performance influence of the default configuration on the Spark cluster of the distributed memory computing framework, as the configuration parameters of the Spark cluster of the automatically-tuned distributed memory computing framework.
Compared with the prior art, the invention has the following advantages:
firstly, the method utilizes the scaling strategy of the distributed memory computing frame Spark cluster to scale the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed, so that the time for evaluating the influence of each configuration on the performance of the distributed memory computing frame Spark cluster is shortened, and the problem that the time cost is wasted as the training set of the random forest model because the influence of each configuration on the performance of the distributed memory computing frame Spark cluster needs to be evaluated in the actual environment in the prior art is solved, so that the time cost for obtaining the model training set is reduced.
Secondly, the training set is input into the training model in the random forest model, and the random forest model directly simulates the execution mechanism of the frame Spark, so that the problems that the establishment of the application performance model of the distributed memory computing frame Spark in the prior art needs to understand the execution mechanism of the distributed memory computing frame Spark, the model establishment process is complex, and the difficulty is high are solved, and the threshold of optimizing the distributed memory computing frame Spark cluster by a user is reduced.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a simulation experiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention are further described with reference to fig. 1.
And step 1, building a cluster.
And (4) building a cluster consisting of a plurality of computers with the same hardware configuration and provided with distributed memory computing frames Spark.
And 2, selecting a configuration parameter set.
And selecting the configuration parameters recommended to be modified in the optimization standard from all the configuration parameters to be modified of the Spark cluster of the distributed memory computing framework to form a configuration parameter set to be optimized.
In the optimization page in the Spark official document of the distributed memory computing framework, the optimization criteria specify the configuration parameters that should be optimized.
And step 3, determining the value type and range of the configuration parameters.
Setting the value type and range of each parameter in a configuration parameter set to be optimized in a Spark cluster of a distributed memory computing framework according to a parameter description standard, extracting default values from the value range of each parameter, and forming default configuration by all the default values.
In a configuration page in a Spark official document of a distributed memory computing framework, a parameter description standard specifies the role, default values and value ranges of each configuration parameter set in detail.
And 4, scaling the cluster.
And utilizing a distributed memory computing framework Spark cluster scaling strategy to scale the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed.
The steps of the distributed memory computing framework Spark cluster scaling strategy are as follows:
wherein, R represents the scale of the Spark cluster of the distributed memory computing framework,represents a rounding-down operation, log2Representing base 2 logarithmic operation, and M represents the memory size per computer in megabits.
wherein m represents the scaled memory configuration parameter, and e represents the symbol.
And 3, calculating the scaled data to be processed according to the following formula:
wherein D represents the data to be processed after zooming, and D represents the data to be processed before zooming.
And 5, training a random forest model.
The starting moment of the search process is recorded.
And forming a multi-dimensional space by using the configuration parameter sets to be optimized as a search space, and sampling the search space by using a uniform sampling strategy to obtain configuration parameter sets uniformly distributed in the search space as an initial search configuration parameter set.
The steps of the uniform sampling strategy are as follows:
And 2, randomly selecting a floating point number in each interval.
And 3, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence.
And 4, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
And evaluating all configurations in the initial search configuration parameter set by using a configuration evaluation strategy to obtain a training set which is ordered from large to small according to the performance influence of a Spark cluster of a distributed memory computing framework.
The configuration evaluation strategy is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, take the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and combine the configuration and the configuration on the performance influence of the distributed memory computing frame Spark cluster into a sequence, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
Before taking from the training setAnd configuring to form an iterative search configuration parameter set, wherein m represents the total number of configurations searched in each iterative search process specified by a user.
And inputting the training set into a random forest model to train the model.
And 6, screening the optimal configuration.
Generating a configuration parameter set by using a uniform sampling strategy, and randomly taking out the configuration parameter setAnd (3) configuration, namely evaluating each configuration by using a configuration evaluation strategy, if the influence of the configuration on the distributed memory computing frame Spark cluster performance is greater than that of the first configuration in a training set, creating an ordered configuration parameter set, putting the configuration into the ordered configuration parameter set which is sorted according to the descending order of the distributed memory computing frame Spark cluster performance influence, and adding each configuration evaluation result into the training set.
The steps of the uniform sampling strategy are as follows:
And 2, randomly selecting a floating point number in each interval.
And 3, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence.
And 4, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
The configuration evaluation strategy is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, take the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and combine the configuration and the configuration on the performance influence of the distributed memory computing frame Spark cluster into a sequence, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
For each actual configuration in the iterative search configuration parameter set, reducing a search space according to a range approximation strategy, and generating a configuration parameter set by using a uniform sampling strategy; and inputting each configuration in the configuration parameter set into the random forest model, predicting the performance influence of the configuration on the distributed memory computing frame Spark cluster, and obtaining the predicted configuration with the maximum performance influence in the prediction result.
The steps of the uniform sampling strategy are as follows:
And 2, randomly selecting a floating point number in each interval.
And 3, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence.
And 4, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
The steps of the range approximation strategy are as follows:
and step 1, extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations larger than the value of the configuration to be processed as an upper boundary in each dimension of all the configurations in the training set of the search space, and extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations smaller than the value of the configuration to be processed as a lower boundary.
And 2, taking the upper and lower boundaries of each dimension as the value range of the dimension, and forming the reduced search space by the value ranges of all the dimensions.
Obtaining the performance influence of the predicted configuration on the distributed memory computing frame Spark cluster by using a configuration evaluation strategy, forming a sequence by the predicted configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, adding the sequence into a training set, and replacing the actual configuration according to two situations in a configuration replacement strategy; if the actual configuration is not replaced, the next search does not employ a range approximation strategy for the actual configuration.
The configuration evaluation strategy is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, take the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and combine the configuration and the configuration on the performance influence of the distributed memory computing frame Spark cluster into a sequence, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
Two cases in the configuration replacement policy to replace the actual configuration refer to:
A. for situations where the predicted configuration performance impact is greater than the actual configuration, the actual configuration is replaced with the predicted configuration.
B. For the case where the ordered set of configuration parameters is not empty, a first configuration is extracted from the ordered set of configuration parameters in place of the actual configuration.
The steps of the range approximation strategy are as follows:
and step 1, extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations larger than the value of the configuration to be processed as an upper boundary in each dimension of all the configurations in the training set of the search space, and extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations smaller than the value of the configuration to be processed as a lower boundary.
And 2, taking the upper and lower boundaries of each dimension as the value range of the dimension, and forming the reduced search space by the value ranges of all the dimensions.
And subtracting the starting time of the searching process from the time when the configuration replacement is finished to obtain the time of the searching process.
And (4) judging whether the time of the searching process is less than the searching time specified by the user, if so, re-executing the step (6), and otherwise, extracting the configuration with the largest influence on the performance of the Spark cluster of the distributed memory computing frame in the training set as the optimal configuration.
And 7, verifying the configuration effect.
And restoring the value of the reduced memory configuration and the data to be processed by using a distributed memory computing framework Spark cluster restoring strategy to obtain the configuration to be verified and the actual data to be processed.
The steps of the distributed memory computing framework Spark cluster restoring strategy are as follows:
C=(m-300)×R+300
wherein, C represents the restored memory configuration.
And step 2, calculating the reduced data to be processed according to the following formula:
D=d×R
wherein D represents the data to be processed before scaling.
And respectively evaluating the performance influence of the configuration to be verified and the default configuration on the Spark cluster of the distributed memory computing framework by using a configuration evaluation strategy, and taking the configuration to be verified, which is greater than the performance influence of the default configuration on the Spark cluster of the distributed memory computing framework, as the configuration parameters of the Spark cluster of the automatically-tuned distributed memory computing framework.
The effect of the present invention is further verified and explained below with the simulation experiment.
1. Simulation conditions are as follows:
the simulation experiment environment of the invention is that 6 computers with distributed memory computing frame Spark on Ariyun are selected to be configured with the same hardware, and a distributed memory computing frame Spark cluster is built. The specification parameters for each computer in the simulation are shown in table 1.
TABLE 1 computer parameter Specification List
Operating system | CentOS 6.8 |
Number of processor cores | 4 |
Memory device | 32GB |
Hard disk | 250GB |
2. Simulation content:
the method comprises the steps of performing simulation experiments by using a distributed memory computing frame Spark configuration parameter automatic optimization method based on cluster scaling through three different user inputs, verifying that the performance of the distributed memory computing frame Spark cluster under the searched configuration is superior to that of default configuration, wherein the serial number of each simulation experiment, data to be processed specified by a user each time, an analysis method, search time, the total number k of a configuration parameter set to be searched in initial search and the total number m of configuration parameters searched in each iterative search process are shown in a table 2.
Table 2 simulation parameters summary
Serial number | Data to be processed | Analytical method | Search | k | m | |
1 | 506.9M | PageRank (webpage retrieval) | 485 minutes | 317 | 20 | |
2 | 7.5G | Logistic regression (machine learning) | 360 minutes | 163 | 20 | |
3 | 76.5G | WordCount (statistical score)Analysis) | 320 minutes | 211 | 20 |
3. And (3) simulation result analysis:
the simulation results of the present invention are further described with reference to fig. 2. The abscissa in fig. 2 represents the serial number input by the user each time, and the ordinate represents the time of analyzing the to-be-processed data by the Spark cluster in the distributed memory computing framework, and the unit is second. The diagonal columns in fig. 2 represent default configurations and the solid columns represent optimized configurations. Fig. 2 records the time for completing the analysis of the to-be-processed data by using the analysis method specified by the user in the optimization configuration and the default configuration of the distributed memory computing framework Spark cluster in the three user inputs. In fig. 2, the entity columns with the same serial number are all lower than the diagonal columns, and it can be seen that, under the optimized configuration obtained by three times of user input, the time for analyzing the data to be processed by the distributed memory computing framework Spark cluster is all smaller than the default configuration, which indicates that under the optimized configuration, the performance of the distributed memory computing framework Spark cluster is better than the default configuration, and the effectiveness of the Spark configuration parameter automatic tuning method based on cluster scaling is verified.
In summary, the invention discloses a Spark configuration parameter automatic optimization method based on cluster scaling, which solves the problems of high time cost and complex model creation process of the Spark configuration parameter automatic optimization method of the distributed computing framework in the prior art. The method comprises the following specific steps: (1) building a cluster; (2) selecting a configuration parameter set; (3) determining the value type and range of the configuration parameters; (4) zooming the cluster; (5) training a random forest model; (6) screening the optimal configuration; (7) and verifying the configuration effect. In the process of scaling the Spark cluster of the distributed memory computing frame, the random forest model is trained and the optimal configuration is screened to serve as an innovation point of the experiment, and the time cost for obtaining the training set is reduced by scaling the Spark cluster of the distributed memory computing frame; by training the random forest model and screening the optimal configuration set, the problem of complex model creation process is solved, and optimal configuration superior to the performance of a Spark cluster of a distributed memory computing framework under default configuration is obtained. The invention can be applied to the technical field of mass data processing, and the configuration parameters with the best performance of the distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration are searched by scaling the value range and the input data volume of the distributed memory computing frame Spark memory configuration parameters according to the cluster scaling.
Claims (6)
1. A distributed memory computing frame Spark configuration parameter automatic tuning method based on cluster scaling is characterized in that the value range and the input data volume of the distributed memory computing frame Spark configuration parameter are scaled according to the cluster scaling, and the configuration with the best performance of a distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration is searched, wherein the method comprises the following specific steps:
(1) building a cluster:
building a cluster consisting of a plurality of computers with the same hardware configuration and provided with distributed memory computing frames Spark;
(2) selecting a configuration parameter set:
selecting the configuration parameters recommended to be modified in the optimization standard from all the configuration parameters to be modified of the Spark cluster of the distributed memory computing framework to form a configuration parameter set to be optimized;
(3) determining the value type and range of the configuration parameters:
setting the value type and range of each parameter in a configuration parameter set to be optimized in a Spark cluster of a distributed memory computing framework according to a parameter description standard, extracting a default value from the value range of each parameter, and forming all default values into default configuration;
(4) scaling the clusters:
zooming the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed by utilizing a zooming strategy of a Spark cluster of a distributed memory computing framework;
the steps of the distributed memory computing framework Spark cluster scaling strategy are as follows:
firstly, calculating the scale of a Spark cluster of a distributed memory calculation framework according to the following formula:
wherein, R represents the scale of the Spark cluster of the distributed memory computing framework,represents a rounding-down operation, log2The logarithm operation with the base of 2 is represented, M represents the memory size of each computer, and the unit is megabyte;
secondly, calculating the value range of the scaled memory configuration parameters according to the following formula:
wherein m represents the scaled memory configuration parameter, and e represents the symbol;
thirdly, calculating the scaled data to be processed according to the following formula:
wherein D represents the data to be processed after zooming, and D represents the data to be processed before zooming;
(5) training a random forest model:
(5a) recording the starting time of the searching process;
(5b) forming a multi-dimensional space by using the configuration parameter sets to be optimized as a search space, and sampling the search space by using a uniform sampling strategy to obtain configuration parameter sets uniformly distributed in the search space as an initial search configuration parameter set;
(5c) evaluating all configurations in the initial search configuration parameter set by using a configuration evaluation strategy to obtain a training set which is ordered from large to small according to the performance influence of a Spark cluster of a distributed memory computing framework;
(5d) before taking from the training setConfiguring to form an iterative search configuration parameter set, wherein m represents the total number of configurations searched in each iterative search process specified by a user;
(5e) inputting the training set into a random forest model to train the model;
(6) screening an optimal configuration:
(6a) generating a configuration parameter set by using a uniform sampling strategy, and randomly taking out the configuration parameter setThe configuration comprises the steps that each configuration is evaluated by a configuration evaluation strategy, if the influence of the configuration on the distributed memory computing frame Spark cluster performance is larger than that of the first configuration in a training set, an ordered configuration parameter set is created, the configuration is placed into the ordered configuration parameter set which is sorted according to the descending order of the distributed memory computing frame Spark cluster performance influence, and each configuration evaluation result is added into the training set;
(6b) for each actual configuration in the iterative search configuration parameter set, reducing a search space according to a range approximation strategy, and generating a configuration parameter set by using a uniform sampling strategy; inputting each configuration in the configuration parameter set into a random forest model, predicting the performance influence of the configuration on a distributed memory computing frame Spark cluster, and obtaining the predicted configuration with the maximum performance influence in the prediction result;
(6c) obtaining the performance influence of the predicted configuration on the distributed memory computing frame Spark cluster by using a configuration evaluation strategy, forming a sequence by the predicted configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, adding the sequence into a training set, and replacing the actual configuration according to two situations in a configuration replacement strategy; if the actual configuration is not replaced, the next search does not adopt a range approximation strategy for the actual configuration;
(6d) subtracting the initial time of the searching process from the time when the configuration replacement is completed to obtain the time of the searching process;
(6e) judging whether the time of the searching process is less than the searching time specified by the user, if so, executing the step (6a), otherwise, executing the step (6 f);
(6f) extracting the configuration with the maximum influence on the performance of the distributed memory computing framework Spark cluster in the training set as the optimal configuration;
(7) and (3) verifying configuration effect:
(7a) reducing the values of the reduced memory configuration and the data to be processed by using a distributed memory computing framework Spark cluster reduction strategy to obtain the configuration to be verified and the actual data to be processed;
(7b) and respectively evaluating the performance influence of the configuration to be verified and the default configuration on the Spark cluster of the distributed memory computing framework by using a configuration evaluation strategy, and taking the configuration to be verified, which is greater than the performance influence of the default configuration on the Spark cluster of the distributed memory computing framework, as the configuration parameters of the Spark cluster of the automatically-tuned distributed memory computing framework.
2. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the steps of the uniform sampling strategy in the steps (5b), (6a) and (6b) are as follows:
the method comprises the steps that firstly, each dimension in a search space is equally divided according to k to obtain k intervals with the same size, wherein k is the total number of configuration parameter sets to be searched in initial search specified by a user;
secondly, randomly selecting a floating point number in each interval;
thirdly, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence;
and fourthly, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
3. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the configuration evaluation strategy in the step (5c), the step (6a) and the step (6c) is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, use the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and form a sequence of the configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
4. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the range approximation strategy in the steps (6b) and (6c) comprises the following steps:
firstly, extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations larger than the value of the configuration to be processed as an upper boundary in each dimension of all the configurations in the training set of the search space, and extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations smaller than the value of the configuration to be processed as a lower boundary;
and secondly, taking the upper and lower boundaries of each dimension as the value range of the dimension, and forming a reduced search space by the value ranges of all the dimensions.
5. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the step (6c) of replacing the actual configuration according to two situations in the configuration replacement policy includes:
A. for the case that the predicted configuration performance impact is greater than the actual configuration, replacing the actual configuration with the predicted configuration;
B. for the case where the ordered set of configuration parameters is not empty, a first configuration is extracted from the ordered set of configuration parameters in place of the actual configuration.
6. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the step of restoring the strategy of the distributed memory computing framework Spark cluster described in the step (7a) is as follows:
step one, calculating the restored memory configuration according to the following formula:
C=(m-300)×R+300
wherein, C represents the memory configuration after reduction;
secondly, calculating the reduced data to be processed according to the following formula:
D=d×R
wherein D represents the data to be processed before scaling.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810110273.XA CN108491226B (en) | 2018-02-05 | 2018-02-05 | Spark configuration parameter automatic tuning method based on cluster scaling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810110273.XA CN108491226B (en) | 2018-02-05 | 2018-02-05 | Spark configuration parameter automatic tuning method based on cluster scaling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108491226A CN108491226A (en) | 2018-09-04 |
CN108491226B true CN108491226B (en) | 2021-03-23 |
Family
ID=63344582
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810110273.XA Active CN108491226B (en) | 2018-02-05 | 2018-02-05 | Spark configuration parameter automatic tuning method based on cluster scaling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108491226B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388565B (en) * | 2018-09-27 | 2021-08-06 | 西安电子科技大学 | Software system performance optimization method based on generating type countermeasure network |
CN110134665B (en) * | 2019-04-17 | 2021-05-25 | 北京百度网讯科技有限公司 | Database self-learning optimization method and device based on flow mirror image |
CN111259933B (en) * | 2020-01-09 | 2023-06-13 | 中国科学院计算技术研究所 | High-dimensional characteristic data classification method and system based on distributed parallel decision tree |
CN111629048B (en) * | 2020-05-22 | 2023-04-07 | 浪潮电子信息产业股份有限公司 | spark cluster optimal configuration parameter determination method, device and equipment |
CN112418311A (en) * | 2020-11-21 | 2021-02-26 | 安徽理工大学 | Distributed random forest method for risk assessment of communication network |
CN114565001A (en) * | 2020-11-27 | 2022-05-31 | 深圳先进技术研究院 | Automatic tuning method for graph data processing framework based on random forest |
CN113032367A (en) * | 2021-03-24 | 2021-06-25 | 安徽大学 | Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103327118A (en) * | 2013-07-09 | 2013-09-25 | 南京大学 | Intelligent virtual machine cluster scaling method and system for web application in cloud computing |
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
CN106648654A (en) * | 2016-12-20 | 2017-05-10 | 深圳先进技术研究院 | Data sensing-based Spark configuration parameter automatic optimization method |
CN106844673A (en) * | 2017-01-24 | 2017-06-13 | 山东亿海兰特通信科技有限公司 | A kind of method and system based on the public security data acquisition intimate degree of multidimensional personnel |
CN107360026A (en) * | 2017-07-07 | 2017-11-17 | 西安电子科技大学 | Distributed message performance of middle piece is predicted and modeling method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10031747B2 (en) * | 2015-12-15 | 2018-07-24 | Impetus Technologies, Inc. | System and method for registration of a custom component in a distributed computing pipeline |
US10430725B2 (en) * | 2016-06-15 | 2019-10-01 | Akw Analytics Inc. | Petroleum analytics learning machine system with machine learning analytics applications for upstream and midstream oil and gas industry |
-
2018
- 2018-02-05 CN CN201810110273.XA patent/CN108491226B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103327118A (en) * | 2013-07-09 | 2013-09-25 | 南京大学 | Intelligent virtual machine cluster scaling method and system for web application in cloud computing |
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
CN106648654A (en) * | 2016-12-20 | 2017-05-10 | 深圳先进技术研究院 | Data sensing-based Spark configuration parameter automatic optimization method |
CN106844673A (en) * | 2017-01-24 | 2017-06-13 | 山东亿海兰特通信科技有限公司 | A kind of method and system based on the public security data acquisition intimate degree of multidimensional personnel |
CN107360026A (en) * | 2017-07-07 | 2017-11-17 | 西安电子科技大学 | Distributed message performance of middle piece is predicted and modeling method |
Non-Patent Citations (4)
Title |
---|
A Fireworks Algorithm for Modern Web Information Retrieval with Visual Results Mining;Hadj Ahmed Bouarara等;《International Journal of Swarm Intelligence Research》;20151231;第6卷(第3期);第1-23页 * |
An Orthogonal Genetic Algorithm for QoS-Aware Service Composition;Bao Liang等;《COMPUTER JOURNAL》;20161231;第59卷(第12期);第1857-1871页 * |
BigDataBench:开源的大数据系统评测基准;詹剑锋等;《计算机学报》;20160131;第39卷(第1期);第196-211页 * |
基于函数式编程的Web服务组合技术研究;鲍亮;《中国博士学位论文全文数据库·信息科技辑》;20101015;第I139-19页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108491226A (en) | 2018-09-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108491226B (en) | Spark configuration parameter automatic tuning method based on cluster scaling | |
Hu et al. | Sensitivity-guided metaheuristics for accurate discrete gate sizing | |
CN106648654A (en) | Data sensing-based Spark configuration parameter automatic optimization method | |
JP2023522567A (en) | Generation of integrated circuit layouts using neural networks | |
CN109388565B (en) | Software system performance optimization method based on generating type countermeasure network | |
CN110647995A (en) | Rule training method, device, equipment and storage medium | |
US11841839B1 (en) | Preprocessing and imputing method for structural data | |
US11531831B2 (en) | Managing machine learning features | |
CN110968564A (en) | Data processing method and training method of data state prediction model | |
CN117236278B (en) | Chip production simulation method and system based on digital twin technology | |
Jingbiao et al. | Research and improvement of clustering algorithm in data mining | |
KR102352036B1 (en) | Device and method for variable selection using stochastic gradient descent | |
CN113743453A (en) | Population quantity prediction method based on random forest | |
CN112445746B (en) | Automatic cluster configuration optimization method and system based on machine learning | |
Trushkowsky et al. | Getting it all from the crowd | |
Faricha et al. | The comparative study for predicting disease outbreak | |
CN105824976A (en) | Method and device for optimizing word segmentation banks | |
US8666986B2 (en) | Grid-based data clustering method | |
CN107491417A (en) | A kind of document structure tree method under topic model based on particular division | |
CN111259117B (en) | Short text batch matching method and device | |
Lu et al. | On the auto-tuning of elastic-search based on machine learning | |
CN112507181B (en) | Search request classification method, device, electronic equipment and storage medium | |
JP2006072820A (en) | Device for analyzing combination optimization problem | |
Luo et al. | Research on the anonymous customer segmentation model of telecom | |
CN117041073B (en) | Network behavior prediction method, system, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230602 Address after: Building 1, Science and Technology Innovation Service Center, No. 856 Zhongshan East Road, High tech Zone, Shijiazhuang City, Hebei Province, 050035 Patentee after: Hegang Digital Technology Co.,Ltd. Address before: 710071 Taibai South Road, Yanta District, Xi'an, Shaanxi Province, No. 2 Patentee before: XIDIAN University |