CN110048886B - Efficient cloud configuration selection algorithm for big data analysis task - Google Patents

Efficient cloud configuration selection algorithm for big data analysis task Download PDF

Info

Publication number
CN110048886B
CN110048886B CN201910294273.4A CN201910294273A CN110048886B CN 110048886 B CN110048886 B CN 110048886B CN 201910294273 A CN201910294273 A CN 201910294273A CN 110048886 B CN110048886 B CN 110048886B
Authority
CN
China
Prior art keywords
task
data
input data
cloud configuration
experimental
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910294273.4A
Other languages
Chinese (zh)
Other versions
CN110048886A (en
Inventor
陈艳姣
林龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910294273.4A priority Critical patent/CN110048886B/en
Publication of CN110048886A publication Critical patent/CN110048886A/en
Application granted granted Critical
Publication of CN110048886B publication Critical patent/CN110048886B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an efficient cloud configuration selection algorithm for a big data analysis task, which is characterized in that a small-scale cluster experiment is carried out by selecting part of input data, a performance prediction model is further constructed, the performance of the task on a large-scale cluster is estimated by using the performance prediction model, and the optimal cloud configuration is determined by using a performance prediction result. By using the above algorithm, the user can be effectively helped to find the optimal cloud configuration with lower model training time and cost. The optimal cloud configuration is selected for the large-scale data analysis task deployed on the cloud computing platform, so that the operation efficiency of the cloud computing platform can be obviously improved, and the operation cost is reduced.

Description

Efficient cloud configuration selection algorithm for big data analysis task
Technical Field
The invention belongs to the field of cloud computing, and particularly relates to an efficient cloud configuration algorithm based on a big data analysis task.
Background
The large-scale data analysis task is increasing, and the involved task content is also increasing in complexity, wherein aspects such as machine learning, natural language processing and image processing are often involved. Such tasks are typically data-intensive and computation-intensive, requiring longer computation times and higher computation costs than traditional computing tasks. Therefore, in order to complete a large-scale data analysis task, the huge computing power of cloud computing is usually utilized to help complete the task. The optimal cloud configuration is selected for the large-scale analysis task, the operation efficiency of the task can be improved, and the computing cost of a user can be reduced.
To meet different computing requirements, existing cloud service providers offer users hundreds of instance types with different resource configurations (e.g., amazon's EC2, microsoft Azure, and google's computeenine). While most cloud service providers only allow users to select an instance type from a pool of available instance types, Google's computer Engine allows users to custom configure virtual machines (configure vcpus and memory), which also makes selecting the correct cloud configuration more challenging. In addition, each large Cloud service provider also provides a Serverless Cloud architecture (e.g., amazon Lambda, google Cloud Functions, and microsoft Azure Functions), which allows users to run tasks as Serverless Functions without using pre-specified configuration boot instances. However, the Serverless architecture may require an application to rebuild its code, and Serverless cloud providers are not able to help users minimize task completion time or reduce computational costs.
The choice of cloud configuration, i.e. the choice of type of instance and number of instances, directly affects the completion time of the task and the cost of the expenditure. A properly selected cloud configuration can achieve the same performance goals at a lower cost. Because of the longer runtime of large-scale data analysis tasks, it is even more important to discover potential cost savings. Due to the diversity of tasks and the combination of instance types and cluster sizes, the search space for cloud configuration becomes huge.
In such a huge search space, exhaustive search of the use of the optimal cloud configuration is neither practical nor easily scalable. To limit the search space, the CherryPick algorithm selects the best cloud configuration by limiting the search space using limited task information. CherryPick is optimized for cost minimization, but cannot be used to optimize other goals, such as minimizing job completion time by cost budgeting. In addition, Ernest and PARIS use a performance modeling approach to select cloud configurations. By using such performance prediction models, a user may select different cloud configurations for tasks with different optimization objectives, e.g., selecting the cheapest or fastest cloud configuration. However, Ernest requires training a prediction model for each instance type, and PARIS only selects the best instance type among multiple public clouds, and cannot give the cluster size.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an efficient cloud configuration selection algorithm for a big data analysis task.
The technical scheme of the invention is an efficient cloud configuration selection algorithm for a big data analysis task, which comprises the following steps:
step 1: the training data collection phase, implemented as follows,
the training data collector will only perform specific instance type experiments on a small portion of the input data, which will be used to predict the performance of task execution over the entire input data. Training data collection includes experimental selection and experimental execution.
Experiment selection: in the experimental selection, two important experimental parameters need to be determined: (1) proportion, namely the proportion of experimental use data to total input data; (2) the number of cloud server instances used when the task is executed. The invention adopts statistical technology to select part of experimental parameters, and mainly uses the experimental parameters which can generate as much information as possible to predict the performance of the task in operation, thereby ensuring higher prediction accuracy. From the D-optimality, the experimental parameters that maximize the weighted sum of the covariance matrix (information matrix) are selected. Using Ei=(xi,yi) To represent experimental parameter settings, where xiIs the number of instances, yiIs the input data ratio. Let M denote the total number of experimental parameter settings obtained by enumerating all possible scales and instances. Then, using EiK-dimensional feature vector F can be calculatediWherein each term corresponds to a term in the predictive model. In this way, M feature vectors for all experimental settings are obtained. According to D-optimality, in experimental parameter selection, a maximum covariance matrix (information matrix) is selected
Figure BDA0002025969340000021
Experimental parameters of weighted sums, i.e.
Figure BDA0002025969340000022
The constraint condition is that 0 is not more than αi≤1,i∈[1,M],
Figure BDA0002025969340000023
α thereiniRepresenting the probability of selecting the i experimental setup. The total cost of the experiment is represented by adding a budget constraint term B, where yi/xiRun experiment E based on pricing model on cloud platformiCost of according to probability α in solving the optimization problem described aboveiThe M experimental setups were sorted in non-increasing order and the top data set was selected as training data. The first 10 data sets were selected as training data in the present invention.
The experiment was performed: after the selected experimental setup, it is determined which data samples in the entire input data set are used to compose the experimental data set to meet the specified scale. In the present invention random sampling is used to select data samples from the entire input data set because random sampling avoids trapping in isolated regions of the data set. After obtaining the small dataset, a specified number of instances are deployed using the selected experimental setup and the running task is started, after which the trial parameters and task completion time are taken as training data for building the predictive model.
Step 2: the model construction phase, which is implemented as follows,
the model constructor is composed of a model constructor and a model converter. With the training data collected for a particular instance type, the model builder may build a base prediction model. Then, the model converter converts and derives the prediction models of the other example types according to the basic prediction model.
A model builder: using T when running experiments with subsets of input data sets on specific instance typesbase(x, y) to represent the task runtime, given the number of instances x and the proportion of the dataset y. Large-scale analysis tasks are typically run in successive steps (i.e., iterations) until a termination condition is met. Each step consists essentially of two stages: concurrent computation and data communication. The computation time for task execution is related to the data set size, and there are several representative communication modes in large-scale analysis tasks. Thus, the runtime of the large-scale analysis task is inferred by resolving the computation time and the communication time. The main objective of the invention is to obtain a performance prediction function T of a given task through the calculation and communication modes of the task and designing fitting items related to x and ybase(x,y)。
The computation is time consuming, and the user-defined iterative algorithm incurs a time cost for operating on each sample of the input data. For large-scale data processing tasks in a clustered computing environment, computation time may be approximated by several different fit terms, depending on the characteristics (e.g., dense or sparse) and the algorithm of the data set. Thus, the computation time may be a function of the number of instances and the size of the data set.
Communication is time consuming and the time cost of data being transmitted over the network to the target node results. Fig. 1 abstracts a representative communication pattern in a large-scale data analysis task. Despite differences in programming models and execution mechanisms, common communication patterns can represent most communication scenarios in clustered applications. The communication time consumption is mainly a function of the number of instances, and the fitting term of the function can be deduced according to different communication modes of the task. For example, when the data size of each instance is constant, the communication time consumption linearly increases with the number of instances of the partition-aggregation communication mode, but is quadratic with the shuffle communication mode.
Given function TbaseAnd (x, y) all candidate fitting items of (x, y) use mutual information as a selection standard of the fitting items, and exclude redundant items and only select good prediction factors as the fitting items. Is provided with
Figure BDA0002025969340000031
Representing a set of all candidate items, each item fkIs a function of x and y as determined by the calculation and communication mode. Given m training data samples collected at different numbers of instances and different data sizes, the K-dimensional feature vector F for each experimental setup was first calculatedi=(f1,i,…,fK,i) E.g. fk,i=yi/xi. Then, mutual information between each item and the runtime is calculated, and an item whose mutual information with the runtime is higher than a threshold is selected. According to m training running time samples, a basic prediction model is obtained through fitting
Figure BDA0002025969340000032
Figure BDA0002025969340000033
Middle wkWherein βkIndicates whether the fitting term f is selectedkk1 indicates that the item is selected).
A model converter: cloud providers typically offer a variety of instance families with different combinations of CPU, memory, hard disk and network capacity to meet the needs of different jobs, such as general purpose and compute/storage optimization. Given a task and a fixed data set, it is known through a large number of experiments that the runtime of one instance type can be converted to a different instance type according to a simple mapping. Therefore, the prediction model is constructed without the need to experiment for each instance type to obtain training data, which greatly reduces training time and training costs.
The transformer phi is the mapping phi from the base prediction model to the target prediction model Tbase(x,y)→Ttarget(x, y). The fit term classes in the prediction function are similar by comparing the run times of different instance types at the same task and dataset scale. In other words, if f is equal to the size of the task and data setkIs contained in TbaseIn (x, y), then it is likely that T istarget(x, y) should also include fk. This is mainly because the computing and communication modes of the task remain essentially unchanged for the same application configuration and number of instances. However, under different instance types, the weight of each term will be different, so attention needs to be paid to the weight mapping from the base prediction model to the target prediction model. The invention adopts a simple and effective mapping method. Is provided with
Figure BDA0002025969340000041
Represents the lowest cost in the experimental setup selected by the training data collector, with a running time of tbase. We run experiments on the target example types
Figure BDA0002025969340000042
To obtain the running time ttarget. The model transformer derives a prediction model of the target instance type as
Figure BDA0002025969340000043
Wherein
Figure BDA0002025969340000044
And step 3: the selector is constructed in the following way,
the runtime prediction models for all instance types are integrated into a single runtime predictor T (x, y), where x is the cloud configuration vector consisting of the type and number of instances. For a given input data set of tasks, the goal is to enable the user to find the most preferred cloud configuration that meets certain runtime and cost constraints. Let p (x) be the price per unit time of the cloud configuration x, i.e. the unit price of the instance type multiplied by the number of instances. The optimal cloud configuration selection problem can be expressed as x*(x, y), c (x), Ry, where Cx (Px × Tx), y,0 ≦ y ≦ 1
Where C (x) is the price per unit time of cloud configuration x, and R (y) is a user added constraint such as maximum tolerated runtime or maximum tolerated cost. The selector S is determined by the user and used to select the best cloud configuration x that meets the desired performance or cost*
Drawings
Fig. 1 is a communication mode profile diagram of the present invention.
FIG. 2 is a general design block diagram of the present invention
FIG. 3 is a comparison of the effectiveness of the present invention
FIG. 4 shows prediction accuracy on Spark according to the present invention
FIG. 5 is a graph of prediction accuracy on Hadoop of the present invention
FIG. 6 is a graph comparing total time of tasks and model training time of the present invention
FIG. 7 shows the prediction accuracy of TeraSort of the present invention for different data set sizes
FIG. 8 is the cost of WordCount of the present invention over different instance types
FIG. 9 is the completion time of TeraSort and WordCount of the present invention on different cluster sizes
Detailed Description
The invention provides an efficient cloud configuration selection framework for the big data analysis task mainly according to the computing mode and the communication mode of the big data analysis task, so that a user can find the cloud configuration suitable for the given big data analysis task, and the computing cost of the large-scale data analysis task is greatly reduced. The framework establishes the prediction model through a small amount of experiments, uses few input data and small-scale clusters, can convert the prediction model of one instance type into the prediction model of another instance type through few extra experiments, and can determine the optimal cloud configuration at lower cost by a cloud computing user through the cloud configuration selection framework.
Referring to fig. 2, the embodiment specifically explains the process of the present invention by taking a cloud configuration selection algorithm (named as Silhouette) of a big data analysis task implemented on Amazon Webservice (AWS) as an example, as follows:
step 1: the training data collection phase, implemented as follows,
the training data collector will only perform specific instance type experiments on a small portion of the input data, which will be used to predict the performance of task execution over the entire input data. Training data collection includes experimental selection and experimental execution.
Experiment selection: in the experimental selection, two important experimental parameters need to be determined: (1) proportion, namely the proportion of experimental use data to total input data; (2) the number of cloud server instances used when the task is executed. In the embodiment, a statistical technique is used to select part of experimental parameters, and the performance of the task in operation is predicted mainly by using the experimental parameters capable of generating as much information as possible, so that higher prediction accuracy is ensured. From the D-optimality, the experimental parameters that maximize the weighted sum of the covariance matrix (information matrix) are selected. Using Ei=(xi,yi) To represent experimental parameter settings, where xiIs the number of instances, yiIs the input data ratio. Let M denote the total number of experimental parameter settings obtained by enumerating all possible scales and instances. Then, using EiWe can compute the K-dimensional feature vector FiWherein each term corresponds to a term in the predictive model. In this way we can obtain M feature vectors for all experimental settings. According to D-optimality, we choose the maximum covariance matrix (information matrix) at the time of experimental parameter selection)
Figure BDA0002025969340000051
Experimental parameters of weighted sums, i.e.
Figure BDA0002025969340000052
The constraint condition is that 0 is not more than αi≤1,i∈[1,M],
Figure BDA0002025969340000053
α thereiniRepresenting the probability of selecting the i experimental setup. We represent the total cost of the experiment by adding a budget constraint term B, where yi/xiRun experiment E based on pricing model on cloud platformiCost of according to probability α in solving the optimization problem described aboveiThe M experimental settings were subjected to rank selection experiments in non-increasing order.
The experiment was performed: after a selected experimental setup, it is necessary to determine which data samples from the entire input data set are used to compose the experimental data set to meet the specified scale. The present invention employs random sampling to select data samples from the entire input data set because random sampling can avoid trapping in isolated regions of the data set. After obtaining the small dataset, a specified number of instances are deployed using the selected experimental setup and the running task is started, after which the trial parameters and task completion time are taken as training data for building the predictive model.
The specific implementation of the examples is illustrated below:
the large-scale data analysis processing engines used in the examples are Spark and Hadoop. On Spark, we run 3 Spark ml based machine learning: classification, regression, and clustering. Where the classification algorithm uses a text classification reference dataset rcv1 with 44000 features, the regression algorithm and the clustering algorithm use 100 ten thousand synthetic datasets with 44000 features. On Hadoop, the TeraSort algorithm and the WordCount algorithm are run separately. The TeraSort algorithm is a general benchmark application for large-scale data analysis, and the main task is to sort randomly generated records, using a data set with 2 hundred million samples, and the WordCount algorithm is used to calculate the frequency of occurrence of words in 5500 ten thousand entries from a Wikipedia article.
In the EC2 instance type pool of AWS, m4.large (general), c5.large (computational optimization), r4.large (memory optimization) and i3.large (storage optimization) were selected, each instance type had 2 vcpus, and a Linux system was pre-installed. The models of data analysis processing engines used in the experiment are Apache Spark 2.2 and Hadoop 2.8 respectively. Table 1 lists the configuration and price of each type of instance.
TABLE 1
Example types Memory (GiB) Example hard disk Price (dollar/hour)
m4.large 8 EBS 0.1
c5.large 4 EBS 0.085
r4.large 15.25 EBS 0.133
i3.large 15.25 SSD 0.156
First, the data size for modeling experiments is set to 1% to 8%, and the experimental cluster size is limited to 1 to 8 examples in the example, probability α is takeniThe first 10 experiments were tested. Randomly selecting an initial seed sample from the input data set when selecting the input data sample; then, at each sampling step, randomly acquiring an output sample; the above process is repeated until the number of samples selected meets the scale requirements in the experimental parameters. In an embodiment, m4.large is used as the base instance type, so the randomly sampled dataset is finally run on a m4.large cluster of the scale specified in the experimental parameters, and the run time is recorded.
Step 2: the model construction phase, which is implemented as follows,
the model constructor is composed of a model constructor and a model converter. With the training data collected for a particular instance type, the model builder may build a base prediction model. Then, the model converter converts and derives the prediction models of the other example types according to the basic prediction model.
A model builder: using T when running experiments with subsets of input data sets on specific instance typesbase(x, y) to represent the task runtime, given the number of instances x and the proportion of the dataset y. Large-scale analysis tasks are typically run in successive steps (i.e., iterations) until a termination condition is met. Each step consists essentially of two stages: concurrent computation and data communication. The computation time for task execution is related to the data set size, and there are several representative communication modes in large-scale analysis tasks. Thus, the runtime of a large-scale analysis task can be inferred by resolving the computation time and the communication time. The main goal of this embodiment is to get the performance of a given task through the computation and communication patterns of the task and design the fit terms related to x and yCan predict function Tbase(x,y)。
The computation is time consuming, and the user-defined iterative algorithm incurs a time cost for operating on each sample of the input data. For large-scale data processing tasks in a clustered computing environment, computation time may be approximated by several different fit terms, depending on the characteristics (e.g., dense or sparse) and the algorithm of the data set. Thus, the computation time may be a function of the number of instances and the size of the data set. To determine the exact fit term of the function, specific domain knowledge needs to be incorporated.
Communication is time consuming and the time cost of data being transmitted over the network to the target node results. Fig. 1 abstracts a representative communication pattern in a large-scale data analysis task. Despite differences in programming models and execution mechanisms, common communication patterns can represent most communication scenarios in clustered applications. The communication time consumption is mainly a function of the number of instances, and the fitting term of the function can be deduced according to different communication modes of the task. For example, when the data size of each instance is constant, the communication time consumption linearly increases with the number of instances of the partition-aggregation communication mode, but is quadratic with the shuffle communication mode.
Given function TbaseAll candidate fitting terms of (x, y), we use mutual information as the selection criterion of the fitting terms, excluding redundant terms and selecting only good predictors as the fitting terms. Is provided with
Figure BDA0002025969340000071
Representing a set of all candidate items, each item fkIs a function of x and y as determined by the calculation and communication mode. Given m training data samples collected at different numbers of instances and different data sizes, the K-dimensional feature vector F for each experimental setup was first calculatedi=(f1,i,…,fK,i) E.g. fk,i=yi/xi. Then, we calculate mutual information between each item and the runtime, and select items whose mutual information with the runtime is above a threshold. According to m training running time samples, fitting to obtain a basic predictionTest model
Figure BDA0002025969340000072
Figure BDA0002025969340000073
Middle wkWherein βkIndicates whether the fitting term f is selectedkk1 indicates that the item is selected).
A model converter: cloud providers typically offer a variety of instance families with different combinations of CPU, memory, hard disk and network capacity to meet the needs of different jobs, such as general purpose and compute/storage optimization. Through a number of experiments, we have found that given a task and a fixed data set, one instance type's runtime can be transformed to a different instance type according to a simple mapping. Therefore, the prediction model is constructed without the need to experiment for each instance type to obtain training data, which greatly reduces training time and training costs.
The transformer phi is the mapping phi from the base prediction model to the target prediction model Tbase(x,y)→Ttarget(x, y). By comparing the run times of different instance types at the same task and dataset scale, it is known from the above that the classes of fit terms in the prediction function are similar. In other words, if f is equal to the size of the task and data setkIs contained in TbaseIn (x, y), then it is likely that T istarget(x, y) should also include fk. This is mainly because the computing and communication modes of the task remain essentially unchanged for the same application configuration and number of instances. However, under different instance types, the weight of each term will be different, so we need to focus on the weight mapping from the base prediction model to the target prediction model. We have adopted a simple and efficient mapping method. Is provided with
Figure BDA0002025969340000081
Represents the lowest cost in the experimental setup selected by the training data collector, with a running time of tbase. We run experiments on the target example types
Figure BDA0002025969340000082
To obtain the running time ttarget. The model transformer derives a prediction model of the target instance type as
Figure BDA0002025969340000083
Wherein
Figure BDA0002025969340000084
Figure BDA0002025969340000085
The specific embodiments of the examples are as follows:
in an embodiment, the fitting terms added to the prediction function are: constant term, y/x linear term, square root of data size, and number of instances
Figure BDA0002025969340000086
An item. The fixed constant represents the time spent in the serial calculation; for an algorithm in which the computation time is linear with the size of the data set, adding a fitting term of the data proportion and the number y/x of instances; for sparse data sets, add the square root of the data size and the number of instances
Figure BDA0002025969340000087
The fitting term of (1).
TABLE 2
Communication mode Structure of the product Fitting term
Parallel read/write Many One-to-One x
Partition-aggregate Many-to-One logx
Broadcast One-to-Many x
Collect Many-to-One x
Shuffle Many-to-Many x2
Global communication All-to-All x2
In the embodiment, according to the communication modes of different tasks, the communication fitting items shown in table 2 are used, which are x, logx, and x, respectively2. After all terms are selected, a basic prediction model is calculated using a non-negative least squares (NNLS) solver. Thereafter, the lowest cost experimental setup of the base experiment is selected and the task is run on the target instance type with the same experimental setup. Finally, the prediction models of all the example types are derived as
Figure BDA0002025969340000088
And step 3: the selector is constructed in the following way,
all the fruits are put togetherThe runtime prediction model for instance types is integrated into a single runtime predictor T (x, y), where x is a cloud configuration vector consisting of the type and number of instances. For a given input data set of tasks, the goal is to enable the user to find the most preferred cloud configuration that meets certain runtime and cost constraints. Let p (x) be the price per unit time of the cloud configuration x, i.e. the unit price of the instance type multiplied by the number of instances. The optimal cloud configuration selection problem can be expressed as x*(x, y), c (x), Ry, where Cx (Px × Tx), y,0 ≦ y ≦ 1
Where C (x) is the price per unit time of cloud configuration x, and R (y) is a user added constraint such as maximum tolerated runtime or maximum tolerated cost. The selector S is determined by the user and used to select the best cloud configuration x that meets the desired performance or cost*
The specific implementation of the examples is as follows:
after all the prediction models of all the tasks on the to-be-selected example types are obtained, the optimal cloud configuration scheme with the lowest operation cost needs to be found. What the cloud configuration scheme needs to meet is to enable tasks to be completed in the shortest time given a cost budget. For the example, the algorithm was evaluated by 4 test results, which were: validity, prediction accuracy, training cost, and application extensibility.
Effectiveness: the performances of SILHOUETTE and Ernest in 5 tasks were compared. Fig. 3(a) shows that the prediction accuracy of silent is comparable to that of Ernest, and fig. 3(b) shows that the training time and training cost of silent is much lower than Ernest. When we build the predictive model for 2 cases, silent can save 25% of training time and 30% of cost. As can be seen from fig. 3(c), when there are more candidate instance types, the training time and training cost of silent is much lower than Ernest, and when there are 5 candidate instance types, the training time of silent and Ernest is 25 minutes and 83 minutes, respectively. When there are more candidate instance types, it is expected that silent performs better.
Prediction accuracy: fig. 4 and 5 show that the prediction accuracy of the basic prediction model of m4.large and the transform prediction model of c5.large can both achieve high accuracy, which confirms the effectiveness of the model converter in silent.
Training cost: silent is intended to find the best cloud configuration with low overhead. Thus, the completion time of the entire task is compared to the training data time to build the underlying predictive model. Figure 6 shows that the training time of silouette is less than 20% of the total completion time of all applications, except TeraSort.
And (3) application expansibility: SILHOUETTE uses the same experimental setup to construct the base and transform prediction models and evaluate their prediction accuracy on different sized data sets. Fig. 7 illustrates that the prediction error is always below 15% when we use 1.5, 2, 2.5 and 3 times the dataset size, indicating that the predictive model built by silent can still maintain high accuracy even if the dataset size changes.
SILHOUETTE is used in this embodiment to select the best cloud configuration for WordCount. Considering the four example types in table 1, assume that the selector optimization objective is: given the maximum task completion time, the overall cost is minimized. FIG. 8 shows the total time and cost of running the task through the data set using each instance type. We can observe that the total time to compute optimization instance type c5.large is comparable to the storage optimization instance type i3.large, and that silent will choose the lower cost former.
The SILHOUETTE can then be used to determine the best number of instances for a given instance type. Consider two tasks, TeraSort and WordCount, respectively. Fig. 9 shows the task run times for two tasks at different cluster sizes, and the silhoutte predicted run time is very close to the actual run time, so that a specific cluster size can be selected.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims (5)

1. The efficient cloud configuration selection algorithm for the big data analysis task is characterized by comprising the following steps of:
step 1: collecting training data: selecting a plurality of input data with a certain proportion and the number of cloud server instances used when tasks corresponding to the proportion are executed, and determining each group of test parameters and task completion time, wherein the certain proportion refers to the proportion of experimental use data in the input data, and the input data in a certain proportion range is specifically 1% -10% of the data;
step 2: constructing a model: designing a fitting polynomial related to the input data proportion and the number of examples according to the input data proportion and the number of examples by using the test parameters and the task completion time in the step 1, and determining a basic prediction model
Figure FDA0002402712160000011
βk∈{0,1},wkW is not less than 1kWherein βkIndicates whether the fitting term f is selectedk,βk1 denotes selecting this item, where fkRepresenting candidates as a function of x and y as determined by the calculation and communication modes;
model conversion: obtaining the test parameter which consumes the least time in the step 1 under the target instance type, wherein the obtained running time is ttargetBy means of mapping, a prediction model of the target instance type is derived as
Figure FDA0002402712160000012
Wherein
Figure FDA0002402712160000013
tbaseIs the running time;
and step 3: the selector structure:
for a given input dataset of the task, the most preferred cloud configuration that meets certain runtime and cost constraints is computed using the predictive model obtained in step 2.
2. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 1, wherein:
the specific process of selecting a plurality of input data with a certain proportion and the number of cloud server instances used when the task corresponding to the proportion is executed in the step 1 is as follows:
firstly, selecting input data in a certain proportion range and cloud server instance number in a certain range, and selecting a maximum covariance matrix during experimental parameter selection according to D-optimality
Figure FDA0002402712160000014
Experimental parameters of weighted sums, i.e.
Figure FDA0002402712160000015
The constraint condition is that 0 is not more than αi≤1,i∈[1,M],
Figure FDA0002402712160000016
α thereiniDenotes the probability, x, of selecting i experiment settingsiIs the number of instances, yiIs the ratio of the input data, M represents the total number of experimental parameter settings obtained by enumerating all possible ratios and the number of instances, FiRepresenting a feature vector;
the total cost of the experiment is represented by adding a budget constraint term B, where yi/xiRun experiment E based on pricing model on cloud platformiThe cost of (a);
according to probability αiSequencing the M experimental settings in a non-increasing order, and selecting the top experimental parameter group in the sequencing as training data;
wherein the number of the cloud server examples in a certain range is 1-10.
3. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 2, wherein: the top 10 data set is selected as training data in a non-increasing sequential ordering.
4. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 1, wherein:
the input data of a certain proportion described in step 1 is selected from the whole input data set by random sampling.
5. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 1, wherein: fitting terms in the model construction involve computation and communication time.
CN201910294273.4A 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task Active CN110048886B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910294273.4A CN110048886B (en) 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910294273.4A CN110048886B (en) 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task

Publications (2)

Publication Number Publication Date
CN110048886A CN110048886A (en) 2019-07-23
CN110048886B true CN110048886B (en) 2020-05-12

Family

ID=67277094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910294273.4A Active CN110048886B (en) 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task

Country Status (1)

Country Link
CN (1) CN110048886B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113301067A (en) * 2020-04-01 2021-08-24 阿里巴巴集团控股有限公司 Cloud configuration recommendation method and device for machine learning application
CN115118592B (en) * 2022-06-15 2023-08-08 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator feature analysis

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053026A (en) * 2017-12-08 2018-05-18 武汉大学 A kind of mobile application background request adaptive scheduling algorithm

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103220337B (en) * 2013-03-22 2015-10-21 合肥工业大学 Based on the cloud computing resources Optimal Configuration Method of self adaptation controller perturbation
US10043194B2 (en) * 2014-04-04 2018-08-07 International Business Machines Corporation Network demand forecasting
CN109088747A (en) * 2018-07-10 2018-12-25 郑州云海信息技术有限公司 The management method and device of resource in cloud computing system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053026A (en) * 2017-12-08 2018-05-18 武汉大学 A kind of mobile application background request adaptive scheduling algorithm

Also Published As

Publication number Publication date
CN110048886A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
Nguyen et al. Towards automatic tuning of apache spark configuration
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
Arnaiz-González et al. MR-DIS: democratic instance selection for big data by MapReduce
Kumar An optimized farthest first clustering algorithm
Yin et al. Opass: Analysis and optimization of parallel data access on distributed file systems
CN110048886B (en) Efficient cloud configuration selection algorithm for big data analysis task
Hua et al. Hadoop configuration tuning with ensemble modeling and metaheuristic optimization
Jørgensen et al. GPU-FAST-PROCLUS: a fast GPU-parallelized approach to projected clustering
CN115016938A (en) Calculation graph automatic partitioning method based on reinforcement learning
Bai et al. Dnnabacus: Toward accurate computational cost prediction for deep neural networks
US9465854B2 (en) In-database connectivity components analysis of data
CN112035234A (en) Distributed batch job distribution method and device
Ying et al. Towards fault tolerance optimization based on checkpoints of in-memory framework spark
Piao et al. Computing resource prediction for mapreduce applications using decision tree
Scheinert et al. Karasu: A collaborative approach to efficient cluster configuration for big data analytics
Ganesan et al. A case for generalizable DNN cost models for mobile devices
Bhowmik et al. Hydetect: A hybrid cpu-gpu algorithm for community detection
Bompiani et al. High-performance computing with terastat
CN110309177B (en) Data processing method and related device
Sinaei et al. Run-time mapping algorithm for dynamic workloads using association rule mining
Xiong et al. Research on MapReduce parallel optimization method based on improved K-means clustering algorithm
CN116679981B (en) Software system configuration optimizing method and device based on transfer learning
CN110210566A (en) One-to-many supporting vector machine frame and its parallel method based on Spark
de Oliveira et al. SmarT: Machine learning approach for efficient filtering and retrieval of spatial and temporal data in big data
Sinaei et al. Run-time mapping algorithm for dynamic workloads on heterogeneous MPSoCs platforms

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant