CN108491226B - Spark configuration parameter automatic tuning method based on cluster scaling - Google Patents

Spark configuration parameter automatic tuning method based on cluster scaling Download PDF

Info

Publication number
CN108491226B
CN108491226B CN201810110273.XA CN201810110273A CN108491226B CN 108491226 B CN108491226 B CN 108491226B CN 201810110273 A CN201810110273 A CN 201810110273A CN 108491226 B CN108491226 B CN 108491226B
Authority
CN
China
Prior art keywords
configuration
distributed memory
cluster
spark
memory computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810110273.XA
Other languages
Chinese (zh)
Other versions
CN108491226A (en
Inventor
鲍亮
陈炜昭
卜晓璇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hegang Digital Technology Co ltd
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201810110273.XA priority Critical patent/CN108491226B/en
Publication of CN108491226A publication Critical patent/CN108491226A/en
Application granted granted Critical
Publication of CN108491226B publication Critical patent/CN108491226B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention discloses a Spark configuration parameter automatic tuning method based on cluster scaling, which comprises the following steps: (1) building a cluster; (2) selecting a configuration parameter set; (3) determining the value type and range of the configuration parameters; (4) zooming the cluster; (5) training a random forest model; (6) screening the optimal configuration; (7) and verifying the configuration effect. The invention can be applied to the technical field of mass data processing, shortens the time for evaluating each configuration by zooming the value range of the distributed memory computing frame Spark memory configuration parameters and the data volume to be processed, establishes the relationship between the configuration and the distributed memory computing frame Spark cluster performance influence by the random forest model, and searches the configuration with the best performance of the distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration.

Description

Spark configuration parameter automatic tuning method based on cluster scaling
Technical Field
The invention belongs to the technical field of computers, and further relates to a Spark configuration parameter automatic tuning method based on cluster scaling in the technical field of mass data processing. According to the invention, the configuration of the distributed memory computing frame Spark cluster performance superior to that of the distributed memory computing frame Spark cluster under default configuration can be obtained by zooming the distributed memory computing frame Spark cluster and training the random forest model.
Background
The distributed memory computing framework Spark is a big data parallel computing framework based on memory computing. The distributed memory computing framework Spark is based on memory computing, so that the real-time performance of data processing in a big data environment is improved, high fault tolerance and high scalability are guaranteed, and a user is allowed to deploy the distributed memory computing framework Spark on a large amount of cheap hardware to form a cluster. Currently, the distributed memory computing framework Spark, which has been used by many macros including Amazon, eBay, and Yahoo! . Many organizations run a distributed memory computing framework Spark over a cluster having thousands of nodes. Configuration parameter optimization has been one of the research hotspots of the distributed memory computing framework Spark, since the configuration parameters are numerous (more than 100), the performance is greatly affected by the configuration parameters, and the best performance is far from being achieved by using the default configuration. Therefore, automatic optimization of configuration parameters for the distributed memory computing framework Spark is an urgent problem to be solved.
The patent document 'an automatic optimization method of data-aware Spark configuration parameters' (application number: 201611182310.5 application date: 2016.12.20 published: CN106648654A) applied by Shenzhen research institute of advanced technology discloses an automatic optimization method of data-aware Spark configuration parameters. The method comprises the steps of selecting a Spark application program, further determining parameters influencing Spark performance in the application program, and determining the value range of the parameters; randomly generating parameters in a value range, generating a configuration file configuration Spark, running an application program after configuration and collecting data; constructing transverse vectors by using the collected Spark running time, input data sets and configuration parameter value data, constructing a training set by using a plurality of vectors, and modeling the training set by using a random forest algorithm; and searching for optimal configuration parameters through a genetic algorithm by using the constructed performance model. The method has the disadvantages that the influence of each configuration on the performance of the Spark cluster of the distributed memory computing framework needs to be evaluated in the actual environment, and the influence is used as a training set of the random forest model, so that a large amount of time cost is wasted.
In a patent document applied by university of Chinese academy of sciences, a Spark platform performance automatic optimization method (application number: 201610068611.9 application date: 2016.02.01 publication number: CN105868019A), an automatic Spark platform performance optimization method is disclosed, which creates a Spark application performance model through an execution mechanism of a Spark platform, selects partial data load of the Spark application to run on the Spark platform aiming at a set Spark application, and collects performance data when the Spark application runs; inputting the collected performance data into a Spark application performance model, and determining values of all parameters in the Spark application performance model when Spark application is operated; and calculating the performance (total application execution time) of the Spark platform when the Spark platform is combined with different configuration parameters to obtain the configuration parameter combination when the Spark platform is optimal in performance. The method has the defects that the establishment of the distributed memory computing framework Spark application performance model needs to understand the execution mechanism of the distributed memory computing framework Spark, and the model establishment process is complex and difficult.
Disclosure of Invention
The invention aims to provide a Spark configuration parameter automatic optimization method based on cluster scaling, aiming at the defects of high time cost and complex model creation process of the Spark configuration parameter automatic optimization method of the distributed memory computing framework in the prior art.
The idea for realizing the purpose of the invention is to scale the value range of the distributed memory computing frame Spark memory configuration parameters and the input data volume according to the cluster scaling, shorten the time for evaluating the influence of each configuration on the performance of the distributed memory computing frame Spark cluster, spend less time to obtain a sufficient training set and train a more accurate random forest model. And searching the configuration with the best performance of a distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration by using a random forest model and a screening optimal configuration method.
The method comprises the following specific steps:
(1) building a cluster:
building a cluster consisting of a plurality of computers with the same hardware configuration and provided with distributed memory computing frames Spark;
(2) selecting a configuration parameter set:
selecting the configuration parameters recommended to be modified in the optimization standard from all the configuration parameters to be modified of the Spark cluster of the distributed memory computing framework to form a configuration parameter set to be optimized;
(3) determining the value type and range of the configuration parameters:
setting the value type and range of each parameter in a configuration parameter set to be optimized in a Spark cluster of a distributed memory computing framework according to a parameter description standard, extracting a default value from the value range of each parameter, and forming all default values into default configuration;
(4) scaling the clusters:
zooming the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed by utilizing a zooming strategy of a Spark cluster of a distributed memory computing framework;
(5) training a random forest model:
(5a) recording the starting time of the searching process;
(5b) forming a multi-dimensional space by using the configuration parameter sets to be optimized as a search space, and sampling the search space by using a uniform sampling strategy to obtain configuration parameter sets uniformly distributed in the search space as an initial search configuration parameter set;
(5c) evaluating all configurations in the initial search configuration parameter set by using a configuration evaluation strategy to obtain a training set which is ordered from large to small according to the performance influence of a Spark cluster of a distributed memory computing framework;
(5d) before taking from the training set
Figure BDA0001568976440000031
Configuring to form an iterative search configuration parameter set, wherein m represents the total number of configurations searched in each iterative search process specified by a user;
(5e) inputting the training set into a random forest model to train the model;
(6) screening an optimal configuration:
(6a) generating a configuration parameter set by using a uniform sampling strategy, and randomly taking out the configuration parameter set
Figure BDA0001568976440000032
A configuration for evaluating each configuration by using a configuration evaluation strategyIf the influence of the configuration on the distributed memory computing frame Spark cluster performance is larger than the first configuration in the training set, creating an ordered configuration parameter set, putting the configuration into the ordered configuration parameter set which is sorted according to the descending order of the distributed memory computing frame Spark cluster performance influence, and adding each configuration evaluation result into the training set;
(6b) for each actual configuration in the iterative search configuration parameter set, reducing a search space according to a range approximation strategy, and generating a configuration parameter set by using a uniform sampling strategy; inputting each configuration in the configuration parameter set into a random forest model, predicting the performance influence of the configuration on a distributed memory computing frame Spark cluster, and obtaining the predicted configuration with the maximum performance influence in the prediction result;
(6c) obtaining the performance influence of the predicted configuration on the distributed memory computing frame Spark cluster by using a configuration evaluation strategy, forming a sequence by the predicted configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, adding the sequence into a training set, and replacing the actual configuration according to two situations in a configuration replacement strategy; if the actual configuration is not replaced, the next search does not adopt a range approximation strategy for the actual configuration;
(6d) subtracting the initial time of the searching process from the time when the configuration replacement is completed to obtain the time of the searching process;
(6e) judging whether the time of the searching process is less than the searching time specified by the user, if so, executing the step (6a), otherwise, executing the step (6 f);
(6f) extracting the configuration with the maximum influence on the performance of the distributed memory computing framework Spark cluster in the training set as the optimal configuration;
(7) and (3) verifying configuration effect:
(7a) reducing the values of the reduced memory configuration and the data to be processed by using a distributed memory computing framework Spark cluster reduction strategy to obtain the configuration to be verified and the actual data to be processed;
(7b) and respectively evaluating the performance influence of the configuration to be verified and the default configuration on the Spark cluster of the distributed memory computing framework by using a configuration evaluation strategy, and taking the configuration to be verified, which is greater than the performance influence of the default configuration on the Spark cluster of the distributed memory computing framework, as the configuration parameters of the Spark cluster of the automatically-tuned distributed memory computing framework.
Compared with the prior art, the invention has the following advantages:
firstly, the method utilizes the scaling strategy of the distributed memory computing frame Spark cluster to scale the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed, so that the time for evaluating the influence of each configuration on the performance of the distributed memory computing frame Spark cluster is shortened, and the problem that the time cost is wasted as the training set of the random forest model because the influence of each configuration on the performance of the distributed memory computing frame Spark cluster needs to be evaluated in the actual environment in the prior art is solved, so that the time cost for obtaining the model training set is reduced.
Secondly, the training set is input into the training model in the random forest model, and the random forest model directly simulates the execution mechanism of the frame Spark, so that the problems that the establishment of the application performance model of the distributed memory computing frame Spark in the prior art needs to understand the execution mechanism of the distributed memory computing frame Spark, the model establishment process is complex, and the difficulty is high are solved, and the threshold of optimizing the distributed memory computing frame Spark cluster by a user is reduced.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a simulation experiment of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention are further described with reference to fig. 1.
And step 1, building a cluster.
And (4) building a cluster consisting of a plurality of computers with the same hardware configuration and provided with distributed memory computing frames Spark.
And 2, selecting a configuration parameter set.
And selecting the configuration parameters recommended to be modified in the optimization standard from all the configuration parameters to be modified of the Spark cluster of the distributed memory computing framework to form a configuration parameter set to be optimized.
In the optimization page in the Spark official document of the distributed memory computing framework, the optimization criteria specify the configuration parameters that should be optimized.
And step 3, determining the value type and range of the configuration parameters.
Setting the value type and range of each parameter in a configuration parameter set to be optimized in a Spark cluster of a distributed memory computing framework according to a parameter description standard, extracting default values from the value range of each parameter, and forming default configuration by all the default values.
In a configuration page in a Spark official document of a distributed memory computing framework, a parameter description standard specifies the role, default values and value ranges of each configuration parameter set in detail.
And 4, scaling the cluster.
And utilizing a distributed memory computing framework Spark cluster scaling strategy to scale the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed.
The steps of the distributed memory computing framework Spark cluster scaling strategy are as follows:
step 1, calculating the scaling of a Spark cluster of a distributed memory calculation framework according to the following formula:
Figure BDA0001568976440000051
wherein, R represents the scale of the Spark cluster of the distributed memory computing framework,
Figure BDA0001568976440000052
represents a rounding-down operation, log2Representing base 2 logarithmic operation, and M represents the memory size per computer in megabits.
Step 2, calculating the value range of the scaled memory configuration parameters according to the following formula:
Figure BDA0001568976440000053
wherein m represents the scaled memory configuration parameter, and e represents the symbol.
And 3, calculating the scaled data to be processed according to the following formula:
Figure BDA0001568976440000054
wherein D represents the data to be processed after zooming, and D represents the data to be processed before zooming.
And 5, training a random forest model.
The starting moment of the search process is recorded.
And forming a multi-dimensional space by using the configuration parameter sets to be optimized as a search space, and sampling the search space by using a uniform sampling strategy to obtain configuration parameter sets uniformly distributed in the search space as an initial search configuration parameter set.
The steps of the uniform sampling strategy are as follows:
step 1, equally dividing each dimension in the search space according to k to obtain k intervals with the same size, wherein k is the total number of the configuration parameter sets to be searched in the initial search specified by the user.
And 2, randomly selecting a floating point number in each interval.
And 3, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence.
And 4, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
And evaluating all configurations in the initial search configuration parameter set by using a configuration evaluation strategy to obtain a training set which is ordered from large to small according to the performance influence of a Spark cluster of a distributed memory computing framework.
The configuration evaluation strategy is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, take the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and combine the configuration and the configuration on the performance influence of the distributed memory computing frame Spark cluster into a sequence, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
Before taking from the training set
Figure BDA0001568976440000061
And configuring to form an iterative search configuration parameter set, wherein m represents the total number of configurations searched in each iterative search process specified by a user.
And inputting the training set into a random forest model to train the model.
And 6, screening the optimal configuration.
Generating a configuration parameter set by using a uniform sampling strategy, and randomly taking out the configuration parameter set
Figure BDA0001568976440000062
And (3) configuration, namely evaluating each configuration by using a configuration evaluation strategy, if the influence of the configuration on the distributed memory computing frame Spark cluster performance is greater than that of the first configuration in a training set, creating an ordered configuration parameter set, putting the configuration into the ordered configuration parameter set which is sorted according to the descending order of the distributed memory computing frame Spark cluster performance influence, and adding each configuration evaluation result into the training set.
The steps of the uniform sampling strategy are as follows:
step 1, equally dividing each dimension in the search space according to k to obtain k intervals with the same size, wherein k is the total number of the configuration parameter sets to be searched in the initial search specified by the user.
And 2, randomly selecting a floating point number in each interval.
And 3, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence.
And 4, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
The configuration evaluation strategy is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, take the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and combine the configuration and the configuration on the performance influence of the distributed memory computing frame Spark cluster into a sequence, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
For each actual configuration in the iterative search configuration parameter set, reducing a search space according to a range approximation strategy, and generating a configuration parameter set by using a uniform sampling strategy; and inputting each configuration in the configuration parameter set into the random forest model, predicting the performance influence of the configuration on the distributed memory computing frame Spark cluster, and obtaining the predicted configuration with the maximum performance influence in the prediction result.
The steps of the uniform sampling strategy are as follows:
step 1, equally dividing each dimension in the search space according to k to obtain k intervals with the same size, wherein k is the total number of the configuration parameter sets to be searched in the initial search specified by the user.
And 2, randomly selecting a floating point number in each interval.
And 3, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence.
And 4, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
The steps of the range approximation strategy are as follows:
and step 1, extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations larger than the value of the configuration to be processed as an upper boundary in each dimension of all the configurations in the training set of the search space, and extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations smaller than the value of the configuration to be processed as a lower boundary.
And 2, taking the upper and lower boundaries of each dimension as the value range of the dimension, and forming the reduced search space by the value ranges of all the dimensions.
Obtaining the performance influence of the predicted configuration on the distributed memory computing frame Spark cluster by using a configuration evaluation strategy, forming a sequence by the predicted configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, adding the sequence into a training set, and replacing the actual configuration according to two situations in a configuration replacement strategy; if the actual configuration is not replaced, the next search does not employ a range approximation strategy for the actual configuration.
The configuration evaluation strategy is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, take the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and combine the configuration and the configuration on the performance influence of the distributed memory computing frame Spark cluster into a sequence, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
Two cases in the configuration replacement policy to replace the actual configuration refer to:
A. for situations where the predicted configuration performance impact is greater than the actual configuration, the actual configuration is replaced with the predicted configuration.
B. For the case where the ordered set of configuration parameters is not empty, a first configuration is extracted from the ordered set of configuration parameters in place of the actual configuration.
The steps of the range approximation strategy are as follows:
and step 1, extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations larger than the value of the configuration to be processed as an upper boundary in each dimension of all the configurations in the training set of the search space, and extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations smaller than the value of the configuration to be processed as a lower boundary.
And 2, taking the upper and lower boundaries of each dimension as the value range of the dimension, and forming the reduced search space by the value ranges of all the dimensions.
And subtracting the starting time of the searching process from the time when the configuration replacement is finished to obtain the time of the searching process.
And (4) judging whether the time of the searching process is less than the searching time specified by the user, if so, re-executing the step (6), and otherwise, extracting the configuration with the largest influence on the performance of the Spark cluster of the distributed memory computing frame in the training set as the optimal configuration.
And 7, verifying the configuration effect.
And restoring the value of the reduced memory configuration and the data to be processed by using a distributed memory computing framework Spark cluster restoring strategy to obtain the configuration to be verified and the actual data to be processed.
The steps of the distributed memory computing framework Spark cluster restoring strategy are as follows:
step 1, calculating the restored memory configuration according to the following formula:
C=(m-300)×R+300
wherein, C represents the restored memory configuration.
And step 2, calculating the reduced data to be processed according to the following formula:
D=d×R
wherein D represents the data to be processed before scaling.
And respectively evaluating the performance influence of the configuration to be verified and the default configuration on the Spark cluster of the distributed memory computing framework by using a configuration evaluation strategy, and taking the configuration to be verified, which is greater than the performance influence of the default configuration on the Spark cluster of the distributed memory computing framework, as the configuration parameters of the Spark cluster of the automatically-tuned distributed memory computing framework.
The effect of the present invention is further verified and explained below with the simulation experiment.
1. Simulation conditions are as follows:
the simulation experiment environment of the invention is that 6 computers with distributed memory computing frame Spark on Ariyun are selected to be configured with the same hardware, and a distributed memory computing frame Spark cluster is built. The specification parameters for each computer in the simulation are shown in table 1.
TABLE 1 computer parameter Specification List
Operating system CentOS 6.8
Number of processor cores 4
Memory device 32GB
Hard disk 250GB
2. Simulation content:
the method comprises the steps of performing simulation experiments by using a distributed memory computing frame Spark configuration parameter automatic optimization method based on cluster scaling through three different user inputs, verifying that the performance of the distributed memory computing frame Spark cluster under the searched configuration is superior to that of default configuration, wherein the serial number of each simulation experiment, data to be processed specified by a user each time, an analysis method, search time, the total number k of a configuration parameter set to be searched in initial search and the total number m of configuration parameters searched in each iterative search process are shown in a table 2.
Table 2 simulation parameters summary
Serial number Data to be processed Analytical method Search time k m
1 506.9M PageRank (webpage retrieval) 485 minutes 317 20
2 7.5G Logistic regression (machine learning) 360 minutes 163 20
3 76.5G WordCount (statistical score)Analysis) 320 minutes 211 20
3. And (3) simulation result analysis:
the simulation results of the present invention are further described with reference to fig. 2. The abscissa in fig. 2 represents the serial number input by the user each time, and the ordinate represents the time of analyzing the to-be-processed data by the Spark cluster in the distributed memory computing framework, and the unit is second. The diagonal columns in fig. 2 represent default configurations and the solid columns represent optimized configurations. Fig. 2 records the time for completing the analysis of the to-be-processed data by using the analysis method specified by the user in the optimization configuration and the default configuration of the distributed memory computing framework Spark cluster in the three user inputs. In fig. 2, the entity columns with the same serial number are all lower than the diagonal columns, and it can be seen that, under the optimized configuration obtained by three times of user input, the time for analyzing the data to be processed by the distributed memory computing framework Spark cluster is all smaller than the default configuration, which indicates that under the optimized configuration, the performance of the distributed memory computing framework Spark cluster is better than the default configuration, and the effectiveness of the Spark configuration parameter automatic tuning method based on cluster scaling is verified.
In summary, the invention discloses a Spark configuration parameter automatic optimization method based on cluster scaling, which solves the problems of high time cost and complex model creation process of the Spark configuration parameter automatic optimization method of the distributed computing framework in the prior art. The method comprises the following specific steps: (1) building a cluster; (2) selecting a configuration parameter set; (3) determining the value type and range of the configuration parameters; (4) zooming the cluster; (5) training a random forest model; (6) screening the optimal configuration; (7) and verifying the configuration effect. In the process of scaling the Spark cluster of the distributed memory computing frame, the random forest model is trained and the optimal configuration is screened to serve as an innovation point of the experiment, and the time cost for obtaining the training set is reduced by scaling the Spark cluster of the distributed memory computing frame; by training the random forest model and screening the optimal configuration set, the problem of complex model creation process is solved, and optimal configuration superior to the performance of a Spark cluster of a distributed memory computing framework under default configuration is obtained. The invention can be applied to the technical field of mass data processing, and the configuration parameters with the best performance of the distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration are searched by scaling the value range and the input data volume of the distributed memory computing frame Spark memory configuration parameters according to the cluster scaling.

Claims (6)

1. A distributed memory computing frame Spark configuration parameter automatic tuning method based on cluster scaling is characterized in that the value range and the input data volume of the distributed memory computing frame Spark configuration parameter are scaled according to the cluster scaling, and the configuration with the best performance of a distributed memory computing frame Spark cluster formed by a plurality of computers with the same hardware configuration is searched, wherein the method comprises the following specific steps:
(1) building a cluster:
building a cluster consisting of a plurality of computers with the same hardware configuration and provided with distributed memory computing frames Spark;
(2) selecting a configuration parameter set:
selecting the configuration parameters recommended to be modified in the optimization standard from all the configuration parameters to be modified of the Spark cluster of the distributed memory computing framework to form a configuration parameter set to be optimized;
(3) determining the value type and range of the configuration parameters:
setting the value type and range of each parameter in a configuration parameter set to be optimized in a Spark cluster of a distributed memory computing framework according to a parameter description standard, extracting a default value from the value range of each parameter, and forming all default values into default configuration;
(4) scaling the clusters:
zooming the value range of the memory configuration parameters in the configuration parameter set to be optimized and the data to be processed by utilizing a zooming strategy of a Spark cluster of a distributed memory computing framework;
the steps of the distributed memory computing framework Spark cluster scaling strategy are as follows:
firstly, calculating the scale of a Spark cluster of a distributed memory calculation framework according to the following formula:
Figure FDA0002923213340000011
wherein, R represents the scale of the Spark cluster of the distributed memory computing framework,
Figure FDA0002923213340000012
represents a rounding-down operation, log2The logarithm operation with the base of 2 is represented, M represents the memory size of each computer, and the unit is megabyte;
secondly, calculating the value range of the scaled memory configuration parameters according to the following formula:
Figure FDA0002923213340000013
wherein m represents the scaled memory configuration parameter, and e represents the symbol;
thirdly, calculating the scaled data to be processed according to the following formula:
Figure FDA0002923213340000021
wherein D represents the data to be processed after zooming, and D represents the data to be processed before zooming;
(5) training a random forest model:
(5a) recording the starting time of the searching process;
(5b) forming a multi-dimensional space by using the configuration parameter sets to be optimized as a search space, and sampling the search space by using a uniform sampling strategy to obtain configuration parameter sets uniformly distributed in the search space as an initial search configuration parameter set;
(5c) evaluating all configurations in the initial search configuration parameter set by using a configuration evaluation strategy to obtain a training set which is ordered from large to small according to the performance influence of a Spark cluster of a distributed memory computing framework;
(5d) before taking from the training set
Figure FDA0002923213340000022
Configuring to form an iterative search configuration parameter set, wherein m represents the total number of configurations searched in each iterative search process specified by a user;
(5e) inputting the training set into a random forest model to train the model;
(6) screening an optimal configuration:
(6a) generating a configuration parameter set by using a uniform sampling strategy, and randomly taking out the configuration parameter set
Figure FDA0002923213340000023
The configuration comprises the steps that each configuration is evaluated by a configuration evaluation strategy, if the influence of the configuration on the distributed memory computing frame Spark cluster performance is larger than that of the first configuration in a training set, an ordered configuration parameter set is created, the configuration is placed into the ordered configuration parameter set which is sorted according to the descending order of the distributed memory computing frame Spark cluster performance influence, and each configuration evaluation result is added into the training set;
(6b) for each actual configuration in the iterative search configuration parameter set, reducing a search space according to a range approximation strategy, and generating a configuration parameter set by using a uniform sampling strategy; inputting each configuration in the configuration parameter set into a random forest model, predicting the performance influence of the configuration on a distributed memory computing frame Spark cluster, and obtaining the predicted configuration with the maximum performance influence in the prediction result;
(6c) obtaining the performance influence of the predicted configuration on the distributed memory computing frame Spark cluster by using a configuration evaluation strategy, forming a sequence by the predicted configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, adding the sequence into a training set, and replacing the actual configuration according to two situations in a configuration replacement strategy; if the actual configuration is not replaced, the next search does not adopt a range approximation strategy for the actual configuration;
(6d) subtracting the initial time of the searching process from the time when the configuration replacement is completed to obtain the time of the searching process;
(6e) judging whether the time of the searching process is less than the searching time specified by the user, if so, executing the step (6a), otherwise, executing the step (6 f);
(6f) extracting the configuration with the maximum influence on the performance of the distributed memory computing framework Spark cluster in the training set as the optimal configuration;
(7) and (3) verifying configuration effect:
(7a) reducing the values of the reduced memory configuration and the data to be processed by using a distributed memory computing framework Spark cluster reduction strategy to obtain the configuration to be verified and the actual data to be processed;
(7b) and respectively evaluating the performance influence of the configuration to be verified and the default configuration on the Spark cluster of the distributed memory computing framework by using a configuration evaluation strategy, and taking the configuration to be verified, which is greater than the performance influence of the default configuration on the Spark cluster of the distributed memory computing framework, as the configuration parameters of the Spark cluster of the automatically-tuned distributed memory computing framework.
2. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the steps of the uniform sampling strategy in the steps (5b), (6a) and (6b) are as follows:
the method comprises the steps that firstly, each dimension in a search space is equally divided according to k to obtain k intervals with the same size, wherein k is the total number of configuration parameter sets to be searched in initial search specified by a user;
secondly, randomly selecting a floating point number in each interval;
thirdly, combining the floating point numbers selected in all the intervals into a k-dimensional sequence, and randomly disordering the sequence of the floating point numbers in the k-dimensional sequence to obtain a disordered k-dimensional sequence;
and fourthly, forming a sequence by floating point numbers at the same position in the disordered k-dimensional sequences in all dimensions, wherein each sequence is used as a configuration to obtain k configuration parameter sets.
3. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the configuration evaluation strategy in the step (5c), the step (6a) and the step (6c) is to run a distributed memory computing frame Spark cluster according to the configuration to be evaluated, analyze the data to be processed by using an analysis method specified by a user, record the time required by analyzing the data, use the reciprocal of the time as the performance influence of the distributed memory computing frame Spark cluster, and form a sequence of the configuration and the performance influence of the configuration on the distributed memory computing frame Spark cluster, wherein the analysis method specified by the user is any data processing method selected by the user from the fields of statistical analysis, machine learning and webpage retrieval.
4. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the range approximation strategy in the steps (6b) and (6c) comprises the following steps:
firstly, extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations larger than the value of the configuration to be processed as an upper boundary in each dimension of all the configurations in the training set of the search space, and extracting other configuration values with the shortest distance to the value of the configuration to be processed from other configurations smaller than the value of the configuration to be processed as a lower boundary;
and secondly, taking the upper and lower boundaries of each dimension as the value range of the dimension, and forming a reduced search space by the value ranges of all the dimensions.
5. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the step (6c) of replacing the actual configuration according to two situations in the configuration replacement policy includes:
A. for the case that the predicted configuration performance impact is greater than the actual configuration, replacing the actual configuration with the predicted configuration;
B. for the case where the ordered set of configuration parameters is not empty, a first configuration is extracted from the ordered set of configuration parameters in place of the actual configuration.
6. The cluster scaling-based automatic optimization method for the Spark configuration parameters of the distributed memory computing framework according to claim 1, wherein: the step of restoring the strategy of the distributed memory computing framework Spark cluster described in the step (7a) is as follows:
step one, calculating the restored memory configuration according to the following formula:
C=(m-300)×R+300
wherein, C represents the memory configuration after reduction;
secondly, calculating the reduced data to be processed according to the following formula:
D=d×R
wherein D represents the data to be processed before scaling.
CN201810110273.XA 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling Active CN108491226B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810110273.XA CN108491226B (en) 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810110273.XA CN108491226B (en) 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling

Publications (2)

Publication Number Publication Date
CN108491226A CN108491226A (en) 2018-09-04
CN108491226B true CN108491226B (en) 2021-03-23

Family

ID=63344582

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810110273.XA Active CN108491226B (en) 2018-02-05 2018-02-05 Spark configuration parameter automatic tuning method based on cluster scaling

Country Status (1)

Country Link
CN (1) CN108491226B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388565B (en) * 2018-09-27 2021-08-06 西安电子科技大学 Software system performance optimization method based on generating type countermeasure network
CN110134665B (en) * 2019-04-17 2021-05-25 北京百度网讯科技有限公司 Database self-learning optimization method and device based on flow mirror image
CN111259933B (en) * 2020-01-09 2023-06-13 中国科学院计算技术研究所 High-dimensional characteristic data classification method and system based on distributed parallel decision tree
CN111629048B (en) * 2020-05-22 2023-04-07 浪潮电子信息产业股份有限公司 spark cluster optimal configuration parameter determination method, device and equipment
CN112418311A (en) * 2020-11-21 2021-02-26 安徽理工大学 Distributed random forest method for risk assessment of communication network
CN114565001A (en) * 2020-11-27 2022-05-31 深圳先进技术研究院 Automatic tuning method for graph data processing framework based on random forest
CN113032367A (en) * 2021-03-24 2021-06-25 安徽大学 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327118A (en) * 2013-07-09 2013-09-25 南京大学 Intelligent virtual machine cluster scaling method and system for web application in cloud computing
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN106844673A (en) * 2017-01-24 2017-06-13 山东亿海兰特通信科技有限公司 A kind of method and system based on the public security data acquisition intimate degree of multidimensional personnel
CN107360026A (en) * 2017-07-07 2017-11-17 西安电子科技大学 Distributed message performance of middle piece is predicted and modeling method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10031747B2 (en) * 2015-12-15 2018-07-24 Impetus Technologies, Inc. System and method for registration of a custom component in a distributed computing pipeline
US10430725B2 (en) * 2016-06-15 2019-10-01 Akw Analytics Inc. Petroleum analytics learning machine system with machine learning analytics applications for upstream and midstream oil and gas industry

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103327118A (en) * 2013-07-09 2013-09-25 南京大学 Intelligent virtual machine cluster scaling method and system for web application in cloud computing
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN106844673A (en) * 2017-01-24 2017-06-13 山东亿海兰特通信科技有限公司 A kind of method and system based on the public security data acquisition intimate degree of multidimensional personnel
CN107360026A (en) * 2017-07-07 2017-11-17 西安电子科技大学 Distributed message performance of middle piece is predicted and modeling method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Fireworks Algorithm for Modern Web Information Retrieval with Visual Results Mining;Hadj Ahmed Bouarara等;《International Journal of Swarm Intelligence Research》;20151231;第6卷(第3期);第1-23页 *
An Orthogonal Genetic Algorithm for QoS-Aware Service Composition;Bao Liang等;《COMPUTER JOURNAL》;20161231;第59卷(第12期);第1857-1871页 *
BigDataBench:开源的大数据系统评测基准;詹剑锋等;《计算机学报》;20160131;第39卷(第1期);第196-211页 *
基于函数式编程的Web服务组合技术研究;鲍亮;《中国博士学位论文全文数据库·信息科技辑》;20101015;第I139-19页 *

Also Published As

Publication number Publication date
CN108491226A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
CN108491226B (en) Spark configuration parameter automatic tuning method based on cluster scaling
Hu et al. Sensitivity-guided metaheuristics for accurate discrete gate sizing
CN106648654A (en) Data sensing-based Spark configuration parameter automatic optimization method
JP2023522567A (en) Generation of integrated circuit layouts using neural networks
CN109388565B (en) Software system performance optimization method based on generating type countermeasure network
CN110647995A (en) Rule training method, device, equipment and storage medium
US11841839B1 (en) Preprocessing and imputing method for structural data
US11531831B2 (en) Managing machine learning features
CN110968564A (en) Data processing method and training method of data state prediction model
CN117236278B (en) Chip production simulation method and system based on digital twin technology
Jingbiao et al. Research and improvement of clustering algorithm in data mining
KR102352036B1 (en) Device and method for variable selection using stochastic gradient descent
CN113743453A (en) Population quantity prediction method based on random forest
CN112445746B (en) Automatic cluster configuration optimization method and system based on machine learning
Trushkowsky et al. Getting it all from the crowd
Faricha et al. The comparative study for predicting disease outbreak
CN105824976A (en) Method and device for optimizing word segmentation banks
US8666986B2 (en) Grid-based data clustering method
CN107491417A (en) A kind of document structure tree method under topic model based on particular division
CN111259117B (en) Short text batch matching method and device
Lu et al. On the auto-tuning of elastic-search based on machine learning
CN112507181B (en) Search request classification method, device, electronic equipment and storage medium
JP2006072820A (en) Device for analyzing combination optimization problem
Luo et al. Research on the anonymous customer segmentation model of telecom
CN117041073B (en) Network behavior prediction method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230602

Address after: Building 1, Science and Technology Innovation Service Center, No. 856 Zhongshan East Road, High tech Zone, Shijiazhuang City, Hebei Province, 050035

Patentee after: Hegang Digital Technology Co.,Ltd.

Address before: 710071 Taibai South Road, Yanta District, Xi'an, Shaanxi Province, No. 2

Patentee before: XIDIAN University