CN110048886B

CN110048886B - Efficient cloud configuration selection algorithm for big data analysis task

Info

Publication number: CN110048886B
Application number: CN201910294273.4A
Authority: CN
Inventors: 陈艳姣; 林龙
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2020-05-12
Anticipated expiration: 2039-04-12
Also published as: CN110048886A

Abstract

The invention provides an efficient cloud configuration selection algorithm for a big data analysis task, which is characterized in that a small-scale cluster experiment is carried out by selecting part of input data, a performance prediction model is further constructed, the performance of the task on a large-scale cluster is estimated by using the performance prediction model, and the optimal cloud configuration is determined by using a performance prediction result. By using the above algorithm, the user can be effectively helped to find the optimal cloud configuration with lower model training time and cost. The optimal cloud configuration is selected for the large-scale data analysis task deployed on the cloud computing platform, so that the operation efficiency of the cloud computing platform can be obviously improved, and the operation cost is reduced.

Description

Efficient cloud configuration selection algorithm for big data analysis task

Technical Field

The invention belongs to the field of cloud computing, and particularly relates to an efficient cloud configuration algorithm based on a big data analysis task.

Background

The large-scale data analysis task is increasing, and the involved task content is also increasing in complexity, wherein aspects such as machine learning, natural language processing and image processing are often involved. Such tasks are typically data-intensive and computation-intensive, requiring longer computation times and higher computation costs than traditional computing tasks. Therefore, in order to complete a large-scale data analysis task, the huge computing power of cloud computing is usually utilized to help complete the task. The optimal cloud configuration is selected for the large-scale analysis task, the operation efficiency of the task can be improved, and the computing cost of a user can be reduced.

To meet different computing requirements, existing cloud service providers offer users hundreds of instance types with different resource configurations (e.g., amazon's EC2, microsoft Azure, and google's computeenine). While most cloud service providers only allow users to select an instance type from a pool of available instance types, Google's computer Engine allows users to custom configure virtual machines (configure vcpus and memory), which also makes selecting the correct cloud configuration more challenging. In addition, each large Cloud service provider also provides a Serverless Cloud architecture (e.g., amazon Lambda, google Cloud Functions, and microsoft Azure Functions), which allows users to run tasks as Serverless Functions without using pre-specified configuration boot instances. However, the Serverless architecture may require an application to rebuild its code, and Serverless cloud providers are not able to help users minimize task completion time or reduce computational costs.

The choice of cloud configuration, i.e. the choice of type of instance and number of instances, directly affects the completion time of the task and the cost of the expenditure. A properly selected cloud configuration can achieve the same performance goals at a lower cost. Because of the longer runtime of large-scale data analysis tasks, it is even more important to discover potential cost savings. Due to the diversity of tasks and the combination of instance types and cluster sizes, the search space for cloud configuration becomes huge.

In such a huge search space, exhaustive search of the use of the optimal cloud configuration is neither practical nor easily scalable. To limit the search space, the CherryPick algorithm selects the best cloud configuration by limiting the search space using limited task information. CherryPick is optimized for cost minimization, but cannot be used to optimize other goals, such as minimizing job completion time by cost budgeting. In addition, Ernest and PARIS use a performance modeling approach to select cloud configurations. By using such performance prediction models, a user may select different cloud configurations for tasks with different optimization objectives, e.g., selecting the cheapest or fastest cloud configuration. However, Ernest requires training a prediction model for each instance type, and PARIS only selects the best instance type among multiple public clouds, and cannot give the cluster size.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an efficient cloud configuration selection algorithm for a big data analysis task.

The technical scheme of the invention is an efficient cloud configuration selection algorithm for a big data analysis task, which comprises the following steps:

step 1: the training data collection phase, implemented as follows,

the training data collector will only perform specific instance type experiments on a small portion of the input data, which will be used to predict the performance of task execution over the entire input data. Training data collection includes experimental selection and experimental execution.

Experiment selection: in the experimental selection, two important experimental parameters need to be determined: (1) proportion, namely the proportion of experimental use data to total input data; (2) the number of cloud server instances used when the task is executed. The invention adopts statistical technology to select part of experimental parameters, and mainly uses the experimental parameters which can generate as much information as possible to predict the performance of the task in operation, thereby ensuring higher prediction accuracy. From the D-optimality, the experimental parameters that maximize the weighted sum of the covariance matrix (information matrix) are selected. Using E_i＝(x_i,y_i) To represent experimental parameter settings, where x_iIs the number of instances, y_iIs the input data ratio. Let M denote the total number of experimental parameter settings obtained by enumerating all possible scales and instances. Then, using E_iK-dimensional feature vector F can be calculated_iWherein each term corresponds to a term in the predictive model. In this way, M feature vectors for all experimental settings are obtained. According to D-optimality, in experimental parameter selection, a maximum covariance matrix (information matrix) is selected

Experimental parameters of weighted sums, i.e.

The constraint condition is that 0 is not more than α_i≤1,i∈[1,M],

α therein_iRepresenting the probability of selecting the i experimental setup. The total cost of the experiment is represented by adding a budget constraint term B, where y_i/x_iRun experiment E based on pricing model on cloud platform_iCost of according to probability α in solving the optimization problem described above_iThe M experimental setups were sorted in non-increasing order and the top data set was selected as training data. The first 10 data sets were selected as training data in the present invention.

The experiment was performed: after the selected experimental setup, it is determined which data samples in the entire input data set are used to compose the experimental data set to meet the specified scale. In the present invention random sampling is used to select data samples from the entire input data set because random sampling avoids trapping in isolated regions of the data set. After obtaining the small dataset, a specified number of instances are deployed using the selected experimental setup and the running task is started, after which the trial parameters and task completion time are taken as training data for building the predictive model.

Step 2: the model construction phase, which is implemented as follows,

the model constructor is composed of a model constructor and a model converter. With the training data collected for a particular instance type, the model builder may build a base prediction model. Then, the model converter converts and derives the prediction models of the other example types according to the basic prediction model.

A model builder: using T when running experiments with subsets of input data sets on specific instance types_base(x, y) to represent the task runtime, given the number of instances x and the proportion of the dataset y. Large-scale analysis tasks are typically run in successive steps (i.e., iterations) until a termination condition is met. Each step consists essentially of two stages: concurrent computation and data communication. The computation time for task execution is related to the data set size, and there are several representative communication modes in large-scale analysis tasks. Thus, the runtime of the large-scale analysis task is inferred by resolving the computation time and the communication time. The main objective of the invention is to obtain a performance prediction function T of a given task through the calculation and communication modes of the task and designing fitting items related to x and y_base(x,y)。

The computation is time consuming, and the user-defined iterative algorithm incurs a time cost for operating on each sample of the input data. For large-scale data processing tasks in a clustered computing environment, computation time may be approximated by several different fit terms, depending on the characteristics (e.g., dense or sparse) and the algorithm of the data set. Thus, the computation time may be a function of the number of instances and the size of the data set.

Communication is time consuming and the time cost of data being transmitted over the network to the target node results. Fig. 1 abstracts a representative communication pattern in a large-scale data analysis task. Despite differences in programming models and execution mechanisms, common communication patterns can represent most communication scenarios in clustered applications. The communication time consumption is mainly a function of the number of instances, and the fitting term of the function can be deduced according to different communication modes of the task. For example, when the data size of each instance is constant, the communication time consumption linearly increases with the number of instances of the partition-aggregation communication mode, but is quadratic with the shuffle communication mode.

Given function T_baseAnd (x, y) all candidate fitting items of (x, y) use mutual information as a selection standard of the fitting items, and exclude redundant items and only select good prediction factors as the fitting items. Is provided with

Representing a set of all candidate items, each item f_kIs a function of x and y as determined by the calculation and communication mode. Given m training data samples collected at different numbers of instances and different data sizes, the K-dimensional feature vector F for each experimental setup was first calculated_i＝(f_1,i,…,f_K,i) E.g. f_k,i＝y_i/x_i. Then, mutual information between each item and the runtime is calculated, and an item whose mutual information with the runtime is higher than a threshold is selected. According to m training running time samples, a basic prediction model is obtained through fitting

Middle w_kWherein β_kIndicates whether the fitting term f is selected_k(β_k1 indicates that the item is selected).

A model converter: cloud providers typically offer a variety of instance families with different combinations of CPU, memory, hard disk and network capacity to meet the needs of different jobs, such as general purpose and compute/storage optimization. Given a task and a fixed data set, it is known through a large number of experiments that the runtime of one instance type can be converted to a different instance type according to a simple mapping. Therefore, the prediction model is constructed without the need to experiment for each instance type to obtain training data, which greatly reduces training time and training costs.

The transformer phi is the mapping phi from the base prediction model to the target prediction model T_base(x,y)→T_target(x, y). The fit term classes in the prediction function are similar by comparing the run times of different instance types at the same task and dataset scale. In other words, if f is equal to the size of the task and data set_kIs contained in T_baseIn (x, y), then it is likely that T is_target(x, y) should also include f_k. This is mainly because the computing and communication modes of the task remain essentially unchanged for the same application configuration and number of instances. However, under different instance types, the weight of each term will be different, so attention needs to be paid to the weight mapping from the base prediction model to the target prediction model. The invention adopts a simple and effective mapping method. Is provided with

Represents the lowest cost in the experimental setup selected by the training data collector, with a running time of t_base. We run experiments on the target example types

To obtain the running time t_target. The model transformer derives a prediction model of the target instance type as

Wherein

And step 3: the selector is constructed in the following way,

the runtime prediction models for all instance types are integrated into a single runtime predictor T (x, y), where x is the cloud configuration vector consisting of the type and number of instances. For a given input data set of tasks, the goal is to enable the user to find the most preferred cloud configuration that meets certain runtime and cost constraints. Let p (x) be the price per unit time of the cloud configuration x, i.e. the unit price of the instance type multiplied by the number of instances. The optimal cloud configuration selection problem can be expressed as x^*(x, y), c (x), Ry, where Cx (Px × Tx), y,0 ≦ y ≦ 1

Where C (x) is the price per unit time of cloud configuration x, and R (y) is a user added constraint such as maximum tolerated runtime or maximum tolerated cost. The selector S is determined by the user and used to select the best cloud configuration x that meets the desired performance or cost^*。

Drawings

Fig. 1 is a communication mode profile diagram of the present invention.

FIG. 2 is a general design block diagram of the present invention

FIG. 3 is a comparison of the effectiveness of the present invention

FIG. 4 shows prediction accuracy on Spark according to the present invention

FIG. 5 is a graph of prediction accuracy on Hadoop of the present invention

FIG. 6 is a graph comparing total time of tasks and model training time of the present invention

FIG. 7 shows the prediction accuracy of TeraSort of the present invention for different data set sizes

FIG. 8 is the cost of WordCount of the present invention over different instance types

FIG. 9 is the completion time of TeraSort and WordCount of the present invention on different cluster sizes

Detailed Description

The invention provides an efficient cloud configuration selection framework for the big data analysis task mainly according to the computing mode and the communication mode of the big data analysis task, so that a user can find the cloud configuration suitable for the given big data analysis task, and the computing cost of the large-scale data analysis task is greatly reduced. The framework establishes the prediction model through a small amount of experiments, uses few input data and small-scale clusters, can convert the prediction model of one instance type into the prediction model of another instance type through few extra experiments, and can determine the optimal cloud configuration at lower cost by a cloud computing user through the cloud configuration selection framework.

Referring to fig. 2, the embodiment specifically explains the process of the present invention by taking a cloud configuration selection algorithm (named as Silhouette) of a big data analysis task implemented on Amazon Webservice (AWS) as an example, as follows:

step 1: the training data collection phase, implemented as follows,

Experiment selection: in the experimental selection, two important experimental parameters need to be determined: (1) proportion, namely the proportion of experimental use data to total input data; (2) the number of cloud server instances used when the task is executed. In the embodiment, a statistical technique is used to select part of experimental parameters, and the performance of the task in operation is predicted mainly by using the experimental parameters capable of generating as much information as possible, so that higher prediction accuracy is ensured. From the D-optimality, the experimental parameters that maximize the weighted sum of the covariance matrix (information matrix) are selected. Using E_i＝(x_i,y_i) To represent experimental parameter settings, where x_iIs the number of instances, y_iIs the input data ratio. Let M denote the total number of experimental parameter settings obtained by enumerating all possible scales and instances. Then, using E_iWe can compute the K-dimensional feature vector F_iWherein each term corresponds to a term in the predictive model. In this way we can obtain M feature vectors for all experimental settings. According to D-optimality, we choose the maximum covariance matrix (information matrix) at the time of experimental parameter selection)

Experimental parameters of weighted sums, i.e.

The constraint condition is that 0 is not more than α_i≤1,i∈[1,M],

α therein_iRepresenting the probability of selecting the i experimental setup. We represent the total cost of the experiment by adding a budget constraint term B, where y_i/x_iRun experiment E based on pricing model on cloud platform_iCost of according to probability α in solving the optimization problem described above_iThe M experimental settings were subjected to rank selection experiments in non-increasing order.

The experiment was performed: after a selected experimental setup, it is necessary to determine which data samples from the entire input data set are used to compose the experimental data set to meet the specified scale. The present invention employs random sampling to select data samples from the entire input data set because random sampling can avoid trapping in isolated regions of the data set. After obtaining the small dataset, a specified number of instances are deployed using the selected experimental setup and the running task is started, after which the trial parameters and task completion time are taken as training data for building the predictive model.

The specific implementation of the examples is illustrated below:

the large-scale data analysis processing engines used in the examples are Spark and Hadoop. On Spark, we run 3 Spark ml based machine learning: classification, regression, and clustering. Where the classification algorithm uses a text classification reference dataset rcv1 with 44000 features, the regression algorithm and the clustering algorithm use 100 ten thousand synthetic datasets with 44000 features. On Hadoop, the TeraSort algorithm and the WordCount algorithm are run separately. The TeraSort algorithm is a general benchmark application for large-scale data analysis, and the main task is to sort randomly generated records, using a data set with 2 hundred million samples, and the WordCount algorithm is used to calculate the frequency of occurrence of words in 5500 ten thousand entries from a Wikipedia article.

In the EC2 instance type pool of AWS, m4.large (general), c5.large (computational optimization), r4.large (memory optimization) and i3.large (storage optimization) were selected, each instance type had 2 vcpus, and a Linux system was pre-installed. The models of data analysis processing engines used in the experiment are Apache Spark 2.2 and Hadoop 2.8 respectively. Table 1 lists the configuration and price of each type of instance.

TABLE 1

Example types	Memory (GiB)	Example hard disk	Price (dollar/hour)
				m4.large	8	EBS	0.1
c5.large	4	EBS	0.085
				r4.large	15.25	EBS	0.133
i3.large	15.25	SSD	0.156

First, the data size for modeling experiments is set to 1% to 8%, and the experimental cluster size is limited to 1 to 8 examples in the example, probability α is taken_iThe first 10 experiments were tested. Randomly selecting an initial seed sample from the input data set when selecting the input data sample; then, at each sampling step, randomly acquiring an output sample; the above process is repeated until the number of samples selected meets the scale requirements in the experimental parameters. In an embodiment, m4.large is used as the base instance type, so the randomly sampled dataset is finally run on a m4.large cluster of the scale specified in the experimental parameters, and the run time is recorded.

Step 2: the model construction phase, which is implemented as follows,

A model builder: using T when running experiments with subsets of input data sets on specific instance types_base(x, y) to represent the task runtime, given the number of instances x and the proportion of the dataset y. Large-scale analysis tasks are typically run in successive steps (i.e., iterations) until a termination condition is met. Each step consists essentially of two stages: concurrent computation and data communication. The computation time for task execution is related to the data set size, and there are several representative communication modes in large-scale analysis tasks. Thus, the runtime of a large-scale analysis task can be inferred by resolving the computation time and the communication time. The main goal of this embodiment is to get the performance of a given task through the computation and communication patterns of the task and design the fit terms related to x and yCan predict function T_base(x,y)。

The computation is time consuming, and the user-defined iterative algorithm incurs a time cost for operating on each sample of the input data. For large-scale data processing tasks in a clustered computing environment, computation time may be approximated by several different fit terms, depending on the characteristics (e.g., dense or sparse) and the algorithm of the data set. Thus, the computation time may be a function of the number of instances and the size of the data set. To determine the exact fit term of the function, specific domain knowledge needs to be incorporated.

Given function T_baseAll candidate fitting terms of (x, y), we use mutual information as the selection criterion of the fitting terms, excluding redundant terms and selecting only good predictors as the fitting terms. Is provided with

Representing a set of all candidate items, each item f_kIs a function of x and y as determined by the calculation and communication mode. Given m training data samples collected at different numbers of instances and different data sizes, the K-dimensional feature vector F for each experimental setup was first calculated_i＝(f_1,i,…,f_K,i) E.g. f_k,i＝y_i/x_i. Then, we calculate mutual information between each item and the runtime, and select items whose mutual information with the runtime is above a threshold. According to m training running time samples, fitting to obtain a basic predictionTest model

A model converter: cloud providers typically offer a variety of instance families with different combinations of CPU, memory, hard disk and network capacity to meet the needs of different jobs, such as general purpose and compute/storage optimization. Through a number of experiments, we have found that given a task and a fixed data set, one instance type's runtime can be transformed to a different instance type according to a simple mapping. Therefore, the prediction model is constructed without the need to experiment for each instance type to obtain training data, which greatly reduces training time and training costs.

The transformer phi is the mapping phi from the base prediction model to the target prediction model T_base(x,y)→T_target(x, y). By comparing the run times of different instance types at the same task and dataset scale, it is known from the above that the classes of fit terms in the prediction function are similar. In other words, if f is equal to the size of the task and data set_kIs contained in T_baseIn (x, y), then it is likely that T is_target(x, y) should also include f_k. This is mainly because the computing and communication modes of the task remain essentially unchanged for the same application configuration and number of instances. However, under different instance types, the weight of each term will be different, so we need to focus on the weight mapping from the base prediction model to the target prediction model. We have adopted a simple and efficient mapping method. Is provided with

Wherein

The specific embodiments of the examples are as follows:

in an embodiment, the fitting terms added to the prediction function are: constant term, y/x linear term, square root of data size, and number of instances

An item. The fixed constant represents the time spent in the serial calculation; for an algorithm in which the computation time is linear with the size of the data set, adding a fitting term of the data proportion and the number y/x of instances; for sparse data sets, add the square root of the data size and the number of instances

The fitting term of (1).

TABLE 2

Communication mode	Structure of the product	Fitting term
			Parallel read/write	Many One-to-One	x
Partition-aggregate	Many-to-One	logx
			Broadcast	One-to-Many	x
Collect	Many-to-One	x
			Shuffle	Many-to-Many	x²
Global communication	All-to-All	x²

In the embodiment, according to the communication modes of different tasks, the communication fitting items shown in table 2 are used, which are x, logx, and x, respectively². After all terms are selected, a basic prediction model is calculated using a non-negative least squares (NNLS) solver. Thereafter, the lowest cost experimental setup of the base experiment is selected and the task is run on the target instance type with the same experimental setup. Finally, the prediction models of all the example types are derived as

And step 3: the selector is constructed in the following way,

all the fruits are put togetherThe runtime prediction model for instance types is integrated into a single runtime predictor T (x, y), where x is a cloud configuration vector consisting of the type and number of instances. For a given input data set of tasks, the goal is to enable the user to find the most preferred cloud configuration that meets certain runtime and cost constraints. Let p (x) be the price per unit time of the cloud configuration x, i.e. the unit price of the instance type multiplied by the number of instances. The optimal cloud configuration selection problem can be expressed as x^*(x, y), c (x), Ry, where Cx (Px × Tx), y,0 ≦ y ≦ 1

The specific implementation of the examples is as follows:

after all the prediction models of all the tasks on the to-be-selected example types are obtained, the optimal cloud configuration scheme with the lowest operation cost needs to be found. What the cloud configuration scheme needs to meet is to enable tasks to be completed in the shortest time given a cost budget. For the example, the algorithm was evaluated by 4 test results, which were: validity, prediction accuracy, training cost, and application extensibility.

Effectiveness: the performances of SILHOUETTE and Ernest in 5 tasks were compared. Fig. 3(a) shows that the prediction accuracy of silent is comparable to that of Ernest, and fig. 3(b) shows that the training time and training cost of silent is much lower than Ernest. When we build the predictive model for 2 cases, silent can save 25% of training time and 30% of cost. As can be seen from fig. 3(c), when there are more candidate instance types, the training time and training cost of silent is much lower than Ernest, and when there are 5 candidate instance types, the training time of silent and Ernest is 25 minutes and 83 minutes, respectively. When there are more candidate instance types, it is expected that silent performs better.

Prediction accuracy: fig. 4 and 5 show that the prediction accuracy of the basic prediction model of m4.large and the transform prediction model of c5.large can both achieve high accuracy, which confirms the effectiveness of the model converter in silent.

Training cost: silent is intended to find the best cloud configuration with low overhead. Thus, the completion time of the entire task is compared to the training data time to build the underlying predictive model. Figure 6 shows that the training time of silouette is less than 20% of the total completion time of all applications, except TeraSort.

And (3) application expansibility: SILHOUETTE uses the same experimental setup to construct the base and transform prediction models and evaluate their prediction accuracy on different sized data sets. Fig. 7 illustrates that the prediction error is always below 15% when we use 1.5, 2, 2.5 and 3 times the dataset size, indicating that the predictive model built by silent can still maintain high accuracy even if the dataset size changes.

SILHOUETTE is used in this embodiment to select the best cloud configuration for WordCount. Considering the four example types in table 1, assume that the selector optimization objective is: given the maximum task completion time, the overall cost is minimized. FIG. 8 shows the total time and cost of running the task through the data set using each instance type. We can observe that the total time to compute optimization instance type c5.large is comparable to the storage optimization instance type i3.large, and that silent will choose the lower cost former.

The SILHOUETTE can then be used to determine the best number of instances for a given instance type. Consider two tasks, TeraSort and WordCount, respectively. Fig. 9 shows the task run times for two tasks at different cluster sizes, and the silhoutte predicted run time is very close to the actual run time, so that a specific cluster size can be selected.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. The efficient cloud configuration selection algorithm for the big data analysis task is characterized by comprising the following steps of:

step 1: collecting training data: selecting a plurality of input data with a certain proportion and the number of cloud server instances used when tasks corresponding to the proportion are executed, and determining each group of test parameters and task completion time, wherein the certain proportion refers to the proportion of experimental use data in the input data, and the input data in a certain proportion range is specifically 1% -10% of the data;

step 2: constructing a model: designing a fitting polynomial related to the input data proportion and the number of examples according to the input data proportion and the number of examples by using the test parameters and the task completion time in the step 1, and determining a basic prediction model

β_k∈{0,1},w_kW is not less than 1_kWherein β_kIndicates whether the fitting term f is selected_k，β_k1 denotes selecting this item, where f_kRepresenting candidates as a function of x and y as determined by the calculation and communication modes;

model conversion: obtaining the test parameter which consumes the least time in the step 1 under the target instance type, wherein the obtained running time is t_targetBy means of mapping, a prediction model of the target instance type is derived as

Wherein

t_baseIs the running time;

and step 3: the selector structure:

for a given input dataset of the task, the most preferred cloud configuration that meets certain runtime and cost constraints is computed using the predictive model obtained in step 2.

2. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 1, wherein:

the specific process of selecting a plurality of input data with a certain proportion and the number of cloud server instances used when the task corresponding to the proportion is executed in the step 1 is as follows:

firstly, selecting input data in a certain proportion range and cloud server instance number in a certain range, and selecting a maximum covariance matrix during experimental parameter selection according to D-optimality

Experimental parameters of weighted sums, i.e.

The constraint condition is that 0 is not more than α_i≤1,i∈[1,M],

α therein_iDenotes the probability, x, of selecting i experiment settings_iIs the number of instances, y_iIs the ratio of the input data, M represents the total number of experimental parameter settings obtained by enumerating all possible ratios and the number of instances, F_iRepresenting a feature vector;

the total cost of the experiment is represented by adding a budget constraint term B, where y_i/x_iRun experiment E based on pricing model on cloud platform_iThe cost of (a);

according to probability α_iSequencing the M experimental settings in a non-increasing order, and selecting the top experimental parameter group in the sequencing as training data;

wherein the number of the cloud server examples in a certain range is 1-10.

3. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 2, wherein: the top 10 data set is selected as training data in a non-increasing sequential ordering.

4. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 1, wherein:

the input data of a certain proportion described in step 1 is selected from the whole input data set by random sampling.

5. The efficient cloud configuration selection algorithm for big data analytics tasks as claimed in claim 1, wherein: fitting terms in the model construction involve computation and communication time.