CN110048886A

CN110048886A - A kind of efficient cloud configuration selection algorithm of big data analysis task

Info

Publication number: CN110048886A
Application number: CN201910294273.4A
Authority: CN
Inventors: 陈艳姣; 林龙
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2019-07-23
Anticipated expiration: 2039-04-12
Also published as: CN110048886B

Abstract

The invention proposes a kind of efficient clouds of big data analysis task to configure selection algorithm, small-scale cluster experiment is carried out by selected part input data, and then construct performance prediction model, utility prediction model estimates performance of the task on large-scale cluster, and passage capacity prediction result determines optimal cloud configuration.By using above-mentioned algorithm, can be configured with the lower model training time and at effectively helping user to find optimal cloud originally.Large-scale data analysis task to be deployed on cloud computing platform selects optimal cloud configuration, can significantly improve its operational efficiency, and reduce operating cost.

Description

An efficient cloud configuration selection algorithm for big data analysis tasks

技术领域technical field

本发明属于云计算领域，尤其涉及基于大数据分析任务的高效云配置算法。The invention belongs to the field of cloud computing, and in particular relates to an efficient cloud configuration algorithm based on big data analysis tasks.

背景技术Background technique

大规模数据分析任务日益增长，涉及的任务内容也日益复杂，其中经常涉及机器学习、自然语言处理和图像处理等方面。与传统计算任务相比，此类任务通常是数据密集型和计算密集型的，需要更长的计算时间和更高的计算成本。因此为了完成大规模数据分析任务，通常利用云计算巨大的计算能力来帮助完成任务。为大规模分析任务选择最佳的云配置，能够提高任务的运行效率，并且能降低用户的计算成本。Large-scale data analysis tasks are growing and involve increasingly complex tasks, often involving machine learning, natural language processing, and image processing. Compared with traditional computing tasks, such tasks are usually data-intensive and computationally-intensive, requiring longer computation time and higher computational cost. Therefore, in order to complete large-scale data analysis tasks, the huge computing power of cloud computing is usually used to help complete the task. Choosing the best cloud configuration for large-scale analysis tasks can improve the operation efficiency of the task and reduce the computing cost of users.

为了满足不同的计算要求，现有的云服务提供商为用户提供上百种具有不同资源配置的实例类型(如亚马逊的EC2、微软的Azure和谷歌的ComputeEngine)。虽然大多数云服务提供商只允许用户从可用实例类型池中进行选择实例类型，但Google的Compute Engine允许用户自定义配置虚拟机(配置vCPU和内存)，这也使得选择正确的云配置变得更具挑战性。除此之外，各大云服务提供商也提供了Serverless云架构(例如亚马逊Lambda，谷歌Cloud Functions和微软Azure Functions)，这项服务允许用户将任务作为Serverless功能运行，而无需使用预先指定的配置启动实例。但是，Serverless架构可能需要应用程序重构其代码，而且Serverless云提供商并不能够帮助用户将任务完成时间最小化，或者帮助用户降低计算成本。To meet different computing requirements, existing cloud service providers provide users with hundreds of instance types with different resource configurations (such as Amazon's EC2, Microsoft's Azure, and Google's ComputeEngine). While most cloud service providers only allow users to select instance types from a pool of available instance types, Google's Compute Engine allows users to custom configure virtual machines (configuring vCPUs and memory), which also makes choosing the right cloud configuration difficult more challenging. In addition to this, major cloud service providers also offer serverless cloud architectures (such as Amazon Lambda, Google Cloud Functions, and Microsoft Azure Functions), which allow users to run tasks as serverless functions without using pre-specified configurations Start the instance. However, serverless architectures may require applications to refactor their code, and serverless cloud providers are not able to help users minimize task completion time or help users reduce computing costs.

云配置的选择，即实例的类型和实例数量的选择，直接影响任务的完成时间和耗费的经济成本。正确选择的云配置可以以更低的成本实现相同的性能目标。由于大规模数据分析任务更长的运行时间，发掘潜在的可节省成本就显得更为重要。由于任务的多样化，以及实例类型和集群规模的组合多样化，使得云配置的搜索空间变得巨大。The choice of cloud configuration, that is, the choice of instance type and number of instances, directly affects the completion time and economic cost of tasks. A properly chosen cloud configuration can achieve the same performance goals at a lower cost. Uncovering potential cost savings is even more important due to the longer runtime of large-scale data analysis tasks. Due to the diversification of tasks and the combination of instance types and cluster sizes, the search space for cloud configurations becomes huge.

在如此庞大的搜索空间中，对最佳云配置的使用穷举搜索既不实际也难以扩展。为限制搜索空间，CherryPick算法通过使用有限的任务信息来限制搜索空间，以此来选择最佳云配置。CherryPick针对成本最小化进行了优化，但不能用于优化其他目标，例如通过成本预算来最小化作业完成时间。除此之外，Ernest和PARIS则使用性能建模方法来选择云配置。通过使用这类性能预测模型，用户可以为优化目标不同的任务选择不同的云配置，例如，选择最廉价或最快速的云配置。但是，Ernest需要为每个实例类型训练预测模型，而PARIS仅在多个公共云中选择最佳实例类型，而不能给出集群大小。In such a large search space, the use of exhaustive searches for optimal cloud configurations is neither practical nor scalable. To limit the search space, the CherryPick algorithm selects the optimal cloud configuration by limiting the search space with limited task information. CherryPick is optimized for cost minimization, but cannot be used to optimize other goals, such as minimizing job completion time through cost budgeting. In addition to this, Ernest and PARIS use a performance modeling approach to select cloud configurations. By using such performance prediction models, users can choose different cloud configurations for tasks with different optimization goals, such as choosing the cheapest or fastest cloud configuration. However, Ernest needs to train a predictive model for each instance type, while PARIS only selects the best instance type across multiple public clouds without giving the cluster size.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术的不足，提出一种大数据分析任务的高效云配置选择算法。Aiming at the deficiencies of the prior art, the present invention proposes an efficient cloud configuration selection algorithm for big data analysis tasks.

本发明的技术方案为一种大数据分析任务的高效云配置选择算法，包含以下步骤：The technical solution of the present invention is an efficient cloud configuration selection algorithm for big data analysis tasks, comprising the following steps:

步骤1：训练数据收集阶段，实现方式如下，Step 1: The training data collection stage, the implementation method is as follows,

训练数据收集器仅对输入数据的一小部分进行特定实例类型的实验，这将用于预测在整个输入数据上任务执行的性能。训练数据收集包括实验选择和实验执行。The training data collector conducts experiments with specific instance types on only a small portion of the input data, which will be used to predict the performance of the task execution on the entire input data. Training data collection includes experiment selection and experiment execution.

实验选择：在实验选择中，需要确定两个重要的实验参数：(1)比例，即实验使用数据占总输入数据的比例；(2)任务执行时所使用的云服务器实例个数。本发明采用统计技术来选择部分实验参数，主要使用能够产生尽可能多信息的实验参数来预测任务运行时的性能，从而保证较高的预测准确性。根据D-optimality，选择最大化协方差矩阵(信息矩阵)加权和的实验参数。使用E_i＝(x_i,y_i)来表示实验参数设置，其中x_i是实例数，y_i是输入数据比例。设M表示通过枚举所有可能的比例和实例数得到的实验参数设置总数。然后，利用E_i，可以计算出K维特征向量F_i，其中每项对应于预测模型中的一个项。通过这种方式，获得关于所有实验设置的M个特征向量。根据D-optimality，在实验参数选择时，选择最大化协方差矩阵(信息矩阵)加权和的实验参数，即约束条件为0≤α_i≤1,i∈[1,M],其中α_i表示选择i实验设置的概率。通过添加预算约束项B来表示实验的总成本，其中y_i/x_i是根据云平台上的定价模型来运行实验E_i的成本。在解决上述优化问题时，根据概率α_i以非递增顺序对M个实验设置进行排序选择靠前的数据组作为训练数据。本发明中选择前10个数据组作为训练数据。Experiment selection: In the experiment selection, two important experimental parameters need to be determined: (1) the ratio, that is, the ratio of the experimental data to the total input data; (2) the number of cloud server instances used for task execution. The present invention uses statistical techniques to select part of the experimental parameters, and mainly uses the experimental parameters that can generate as much information as possible to predict the performance of the task running, thereby ensuring high prediction accuracy. According to D-optimality, choose the experimental parameters that maximize the weighted sum of the covariance matrix (information matrix). The experimental parameter settings are denoted using E _i =( _xi , _yi ), where _xi is the number of instances and _yi is the input data scale. Let M denote the total number of experimental parameter settings obtained by enumerating all possible scales and instance numbers. Then, using E _i , a K-dimensional feature vector F _i can be calculated, where each term corresponds to a term in the prediction model. In this way, M eigenvectors are obtained for all experimental settings. According to D-optimality, when selecting experimental parameters, choose to maximize the covariance matrix (information matrix) The experimental parameters of the weighted sum, i.e. The constraints are 0≤α _i ≤1,i∈[1,M], where α _i represents the probability of choosing i experimental setting. The total cost of the experiment is represented by adding a budget constraint term B, where y _i / _xi is the cost of running the experiment E _i according to the pricing model on the cloud platform. When solving the above optimization problem, the M experimental settings are sorted in non-increasing order according to the probability α _i and the top data set is selected as the training data. In the present invention, the first 10 data groups are selected as training data.

实验执行：在选定的实验设置后，确定使用整个输入数据集中的哪些数据样本来组成实验数据集，以满足指定的比例。本发明中采用随机抽样从整个输入数据集中选择数据样本，因为随机抽样可以避免陷入数据集的孤立区域。在获得小数据集后，使用所选实验设置部署指定数量的实例并开始运行任务，之后以试验参数和任务完成时间作为用于构建预测模型的训练数据。Experiment Execution: After a selected experiment setup, determine which data samples from the entire input dataset are used to make up the experimental dataset to satisfy the specified scale. In the present invention, random sampling is used to select data samples from the entire input data set, because random sampling can avoid falling into an isolated area of the data set. After obtaining a small dataset, deploy the specified number of instances using the chosen experiment settings and start running the task, after which the experimental parameters and task completion time are used as training data for building the predictive model.

步骤2：模型构造阶段，实现方式如下，Step 2: Model construction stage, the implementation is as follows,

模型构造器由模型构建器和模型转换器组成。利用收集的特定实例类型的训练数据，模型构建器可以建立基础预测模型。之后，模型变换器根据基础预测模型转化导出其余实例类型的预测模型。Model Builder consists of Model Builder and Model Converter. Using the training data collected for a specific instance type, the model builder can build a base predictive model. After that, the model transformer transforms and derives the prediction models of the remaining instance types according to the base prediction model.

模型构建器：在特定实例类型上运行输入数据集子集的实验时，使用T_base(x,y)来表示任务运行时间，给定实例数为x，数据集的比例为y。大规模分析任务通常以连续的步骤(即迭代)运行，直到满足终止条件。每个步骤主要由两个阶段组成：并发计算和数据通信。任务执行的计算时间与数据集大小保持相对关系，并且在大规模分析任务中有几种代表性的通信模式。因此，通过解析计算时间和通信时间来推断大规模分析任务的运行时间。本发明中主要目标是通过任务的计算和通信模式，并设计涉及x和y的拟合项，来得到给定任务的性能预测函数T_base(x,y)。Model Builder: When running experiments on a subset of the input dataset on a specific instance type, use T _base (x,y) to represent the task runtime, given the number of instances x and the scale of the dataset y. Large-scale analysis tasks are usually run in successive steps (i.e. iterations) until a termination condition is met. Each step mainly consists of two phases: concurrent computation and data communication. The computational time of task execution remains relative to the dataset size, and there are several representative communication patterns in large-scale analysis tasks. Therefore, the running time of large-scale analysis tasks is inferred by parsing the computation time and communication time. The main objective of the present invention is to obtain the performance prediction function T _base (x, y) for a given task by calculating and communicating modes of tasks and designing fitting terms involving x and y.

计算耗时，用户定义的迭代算法会对输入数据的每个样本进行运算所产生的时间成本。对于集群计算环境中的大规模数据处理任务，可以根据数据集的特征(例如，密集或稀疏)和算法，通过若干不同的拟合项来近似计算时间。由此，计算时间会是关于实例数量和数据集规模的一个函数。Computational time-consuming, the time cost incurred by the user-defined iterative algorithm operating on each sample of the input data. For large-scale data processing tasks in a cluster computing environment, the computation time can be approximated by several different fitting terms, depending on the characteristics of the dataset (eg, dense or sparse) and the algorithm. Thus, the computation time will be a function of the number of instances and the size of the dataset.

通信耗时，数据通过网络传送到目标节点产生的时间成本。图1抽象出了大规模数据分析任务中代表性的通信模式。尽管在编程模型和执行机制方面存在差异，但常见的通信模式可以表示出集群应用程序中的大多数通信情况。通信耗时主要是关于实例数量的函数，可以根据任务的不同通信模式，来推断出函数的拟合项。例如，当每个实例的数据大小不变时，通信耗时随着partition-aggregate通信模式的实例数线性增加，但是对于shuffle通信模式是二次方的关系。Communication is time-consuming, and the time cost of data transmission to the target node through the network. Figure 1 abstracts representative communication patterns in large-scale data analysis tasks. Although there are differences in programming models and execution mechanisms, common communication patterns represent most communication situations in clustered applications. The communication time is mainly a function of the number of instances, and the fitting term of the function can be inferred according to the different communication modes of the task. For example, when the data size of each instance is constant, the communication time increases linearly with the number of instances in the partition-aggregate communication mode, but it is a quadratic relationship for the shuffle communication mode.

给定函数T_base(x,y)的所有候选拟合项，使用互信息作为拟合项的选择标准，排除冗余项而只选择良好预测因子作为拟合项。设表示所有候选项的集合，其中每个项f_k是x和y由计算和通信模式决定的函数。给定在不同数量的实例和不同数据规模下收集的m个训练数据样本，首先计算每个实验设置的K维特征向量F_i＝(f_1,i,…,f_K,i)，例如f_k,i＝y_i/x_i。然后，计算每个项与运行时间之间的互信息，并选择与运行时间的互信息高于阈值的项。根据m个训练运行时间样本，拟合得到基础预测模型中w_k的值。其中β_k表示是否选择了拟合项f_k(β_k＝1表示选择该项)。Given all candidate fit terms of the function T _base (x,y), mutual information is used as the selection criterion for the fit terms, redundant terms are excluded and only good predictors are selected as the fit terms. Assume represents the set of all candidates, where each term fk is a function of _x and y determined by the computation and communication mode. Given m training data samples collected at different numbers of instances and different data scales, first compute the K-dimensional feature vector F _i = (f _1,i ,...,f _K,i ) for each experimental setting, e.g. f _k,i =y _i / _xi . Then, the mutual information between each item and the running time is calculated, and the items whose mutual information with the running time is higher than a threshold are selected. According to m training runtime samples, the basic prediction model is obtained by fitting The value of w _k in . Wherein β _k indicates whether the fitting item f _k is selected (β _k =1 indicates that this item is selected).

模型转换器：云提供商通常提供具有不同CPU，内存，硬盘和网络容量组合的各种实例系列，以满足不同作业的需要，例如通用和计算/存储器/存储优化。通过大量实验可知给定任务和固定数据集，可以根据简单映射将一个实例类型的运行时间转换为不同的实例类型。因此，不需要对每种实例类型进行实验来获取训练数据之后再来构建预测模型，这大大减少了训练时间和训练成本。Model converters: Cloud providers typically offer various instance families with different combinations of CPU, memory, hard disk, and network capacity to meet the needs of different jobs, such as general purpose and compute/memory/storage optimized. Extensive experiments show that given a task and a fixed dataset, the runtime of one instance type can be converted to a different instance type based on a simple mapping. Therefore, there is no need to perform experiments on each instance type to obtain training data before building a predictive model, which greatly reduces training time and training cost.

变换器Φ是从基础预测模型到目标预测模型的映射Φ:T_base(x,y)→T_target(x,y)。通过比较在相同任务和数据集规模下，不同实例类型的运行时间，预测函数中的拟合项类别是相似的。换句话说，相同任务和数据集规模下，如果f_k包含在T_base(x,y)中，则很可能T_target(x,y)也应该包含f_k。这主要是因为在相同的应用程序配置和实例数下，任务的计算和通信模式基本保持不变。然而，在不同的实例类型下，每个项的权重将是不同的，所以需要关注从基础预测模型到目标预测模型的权重映射。本发明中采用了一种简单而有效的映射方法。设表示训练数据采集器选择的实验设置中成本最低的，运行时间为t_base。我们在目标实例类型上运行实验以获得运行时间t_target。模型变换器将目标实例类型的预测模型导出为其中 The transformer Φ is the mapping from the base prediction model to the target prediction model Φ:T _base (x,y)→T _target (x,y). By comparing the running times of different instance types under the same task and dataset size, the categories of fit terms in the prediction function are similar. In other words, under the same task and dataset size, if f _k is contained in T _base (x, y), it is likely that T _target (x, y) should also contain f _k . This is mainly because the computation and communication patterns of tasks remain largely unchanged under the same application configuration and number of instances. However, under different instance types, the weight of each item will be different, so we need to pay attention to the weight mapping from the base prediction model to the target prediction model. A simple and effective mapping method is adopted in the present invention. Assume represents the lowest cost of the experimental settings chosen by the training data collector, with a running time of t _base . We run experiments on the target instance type to get the running time t _target . The model transformer exports the prediction model of the target instance type as in

步骤3：选择器构造阶段，实现方式如下，Step 3: Selector construction stage, the implementation is as follows,

将所有实例类型的运行时间预测模型集成到单个运行时间预测器T(x,y)中，其中x是由实例的类型和数量组成的云配置向量。对于任务的给定输入数据集，目标是使用户能够找到满足特定运行时间和成本约束的最优选云配置。令P(x)为云配置x的单位时间价格，即实例类型的单价乘以实例的数量。最优的云配置选择问题可以表述为x^*＝S(T(x,y),C(x),Ry，其中Cx＝Px×Tx,y,0≤y≤1Integrate runtime prediction models for all instance types into a single runtime predictor T(x,y), where x is a cloud configuration vector consisting of the type and number of instances. For a given input dataset of tasks, the goal is to enable the user to find the most preferred cloud configuration that satisfies certain runtime and cost constraints. Let P(x) be the price per unit time of cloud configuration x, that is, the unit price of the instance type multiplied by the number of instances. The optimal cloud configuration selection problem can be formulated as x ^* =S(T(x,y),C(x),Ry, where Cx=Px×Tx,y,0≤y≤1

其中C(x)是云配置x的单位时间价格，R(y)是用户添加的约束，例如最大容忍运行时间或最大容忍成本。选择器S(*)由用户确定，并用于选择满足期望性能或成本的最佳云配置x^*。where C(x) is the price per unit time of cloud configuration x and R(y) is a user-added constraint such as the maximum tolerable runtime or the maximum tolerable cost. The selector S(*) is determined by the user and is used to select the best cloud configuration x ^* that satisfies the desired performance or cost.

附图说明Description of drawings

图1是本发明的通信模式简介图。FIG. 1 is a schematic diagram of the communication mode of the present invention.

图2是本发明的总体设计结构图Fig. 2 is the overall design structure diagram of the present invention

图3是本发明的有效性对比图Fig. 3 is the effectiveness comparison diagram of the present invention

图4是本发明的在Spark上预测准确率Fig. 4 is the prediction accuracy rate on Spark of the present invention

图5是本发明的在Hadoop上预测准确率Fig. 5 is the prediction accuracy rate on Hadoop of the present invention

图6是本发明的任务总时间和模型训练时间对比图Fig. 6 is a comparison diagram of the total task time of the present invention and the model training time

图7是本发明的TeraSort在不同数据集大小下的预测准确率Fig. 7 is the prediction accuracy rate of TeraSort of the present invention under different data set sizes

图8是本发明的WordCount在不同实例类型上的花费Figure 8 is the cost of WordCount of the present invention on different instance types

图9是本发明的TeraSort和WordCount在不同集群规模上的完成时间Figure 9 is the completion time of TeraSort and WordCount of the present invention on different cluster sizes

具体实施方式Detailed ways

本发明主要根据大数据分析任务的计算模式和通信模式，提出了一个大数据分析任务的高效云配置选择框架，使用户可以找到适合给定大数据分析任务的云配置，从而大大降低大规模数据分析任务的计算成本。本框架通过少量的实验来建立预测模型，使用的是极少的输入数据和小规模集群，而且可以通过极少的额外实验智将一个实例类型的预测模型转换为另一个实例类型的预测模型通过本发明的云配置选择框架，云计算用户可以以更低的成本确定最佳云配置。The present invention mainly proposes an efficient cloud configuration selection framework for big data analysis tasks based on the computing mode and communication mode of big data analysis tasks, so that users can find cloud configurations suitable for a given big data analysis task, thereby greatly reducing large-scale data The computational cost of the analysis task. This framework builds predictive models through a small number of experiments, using very little input data and small-scale clusters, and can convert a predictive model of one instance type to a predictive model of another instance type with very little additional experimental intelligence. The invented cloud configuration selection framework allows cloud computing users to determine the best cloud configuration at a lower cost.

参见图2，实施例以在亚马逊云服务(AmazonWebService,AWS)上实现的大数据分析任务的云配置选择算法(命名为Silhouette)为例对本发明的流程进行一个具体的阐述，如下：Referring to Fig. 2, the embodiment takes the cloud configuration selection algorithm (named as Silhouette) of the big data analysis task realized on the Amazon cloud service (Amazon Web Service, AWS) as an example to specifically illustrate the process of the present invention, as follows:

实验选择：在实验选择中，需要确定两个重要的实验参数：(1)比例，即实验使用数据占总输入数据的比例；(2)任务执行时所使用的云服务器实例个数。本实施例中使用统计技术来选择部分实验参数，主要使用能够产生尽可能多信息的实验参数来预测任务运行时的性能，从而保证较高的预测准确性。根据D-optimality，选择最大化协方差矩阵(信息矩阵)加权和的实验参数。使用E_i＝(x_i,y_i)来表示实验参数设置，其中x_i是实例数，y_i是输入数据比例。设M表示通过枚举所有可能的比例和实例数得到的实验参数设置总数。然后，利用E_i，我们可以计算出K维特征向量F_i，其中每项对应于预测模型中的一个项。通过这种方式，我们能够获得关于所有实验设置的M个特征向量。根据D-optimality，在实验参数选择时，我们选择最大化协方差矩阵(信息矩阵)加权和的实验参数，即约束条件为0≤α_i≤1,i∈[1,M],其中α_i表示选择i实验设置的概率。我们通过添加预算约束项B来表示实验的总成本，其中y_i/x_i是根据云平台上的定价模型来运行实验E_i的成本。在解决上述优化问题时，根据概率α_i以非递增顺序对M个实验设置进行排序选择实验。Experiment selection: In the experiment selection, two important experimental parameters need to be determined: (1) the ratio, that is, the ratio of the experimental data to the total input data; (2) the number of cloud server instances used for task execution. In this embodiment, a statistical technique is used to select some experimental parameters, and the experimental parameters that can generate as much information as possible are mainly used to predict the performance of the task when running, so as to ensure high prediction accuracy. According to D-optimality, choose the experimental parameters that maximize the weighted sum of the covariance matrix (information matrix). The experimental parameter settings are denoted using E _i =( _xi , _yi ), where _xi is the number of instances and _yi is the input data scale. Let M denote the total number of experimental parameter settings obtained by enumerating all possible scales and instance numbers. Then, using E _i , we can compute a K-dimensional feature vector F _i , where each term corresponds to a term in the predictive model. In this way, we are able to obtain M eigenvectors for all experimental settings. According to D-optimality, when choosing experimental parameters, we choose to maximize the covariance matrix (information matrix) The experimental parameters of the weighted sum, i.e. The constraints are 0≤α _i ≤1,i∈[1,M], where α _i represents the probability of choosing i experimental setting. We represent the total cost of the experiment by adding a budget constraint term B, where y _i / _xi is the cost of running the experiment E _i according to the pricing model on the cloud platform. In solving the above optimization problem, the M experimental settings are sorted and selected according to the probability α _i in a non-increasing order.

实验执行：在选定的实验设置后，需要确定使用整个输入数据集中的哪些数据样本来组成实验数据集，以满足指定的比例。本发明采用随机抽样从整个输入数据集中选择数据样本，因为随机抽样可以避免陷入数据集的孤立区域。在获得小数据集后，使用所选实验设置部署指定数量的实例并开始运行任务，之后以试验参数和任务完成时间作为用于构建预测模型的训练数据。Experiment Execution: Once the experimental setup has been chosen, it is necessary to determine which data samples from the entire input dataset will be used to make up the experimental dataset to satisfy the specified scale. The present invention uses random sampling to select data samples from the entire input data set, because random sampling can avoid falling into isolated areas of the data set. After obtaining a small dataset, deploy the specified number of instances using the chosen experiment settings and start running the task, after which the experimental parameters and task completion time are used as training data for building the predictive model.

实施例的具体实施过程说明如下：The specific implementation process of the embodiment is described as follows:

实施例中使用的大规模数据分析处理引擎为Spark和Hadoop。在Spark上，我们分别运行了3种基于SparkML的机器学习：分类、回归和聚类。其中分类算法使用使用具有44000个特征的文本分类基准数据集rcv1，回归算法和聚类算法使用100万个具有44000个特征的合成数据集。在Hadoop上，分别运行了TeraSort算法和WordCount算法。其中TeraSort算法是一种大规模数据分析的通用基准测试应用程序，主要工作是对随机生成的记录进行排序，使用的是一个有2亿样本的数据集，WordCount算法用来计算来自维基百科文章的5500万个条目中的单词出现频率。The large-scale data analysis and processing engines used in the embodiment are Spark and Hadoop. On Spark, we ran 3 types of SparkML-based machine learning: classification, regression, and clustering. The classification algorithm uses the text classification benchmark dataset rcv1 with 44,000 features, and the regression and clustering algorithms use 1 million synthetic datasets with 44,000 features. On Hadoop, TeraSort algorithm and WordCount algorithm were run respectively. The TeraSort algorithm is a general-purpose benchmarking application for large-scale data analysis. Its main job is to sort randomly generated records, using a dataset of 200 million samples. The WordCount algorithm is used to calculate the data from Wikipedia articles. Word frequency in 55 million entries.

在AWS的EC2实例类型池中，选择m4.large(通用)，c5.large(计算优化)，r4.large(内存优化)和i3.large(存储优化)，每种实例类型都有2个vCPU，并预装Linux系统。实验中使用的数据分析处理引擎型号分别为Apache Spark 2.2和Hadoop 2.8。表1列出了每种类型实例的配置和价格。In AWS's pool of EC2 instance types, choose m4.large (general purpose), c5.large (compute-optimized), r4.large (memory-optimized), and i3.large (storage-optimized), each with 2 vCPUs , and pre-installed Linux system. The data analysis and processing engine models used in the experiment are Apache Spark 2.2 and Hadoop 2.8 respectively. Table 1 lists the configurations and prices for each type of instance.

表1Table 1

实例类型instance type 内存(GiB)Memory (GiB) 实例硬盘instance hard disk 价格(美元/小时)Price (USD/hour) m4.largem4.large 88 EBSEBS 0.10.1 c5.largec5.large 44 EBSEBS 0.0850.085 r4.larger4.large 15.2515.25 EBSEBS 0.1330.133 i3.largei3.large 15.2515.25 SSDSSD 0.1560.156

首先设置用于建模实验的数据规模为1％到8％，实验集群规模限制在1到8台实例。实施例中，取概率α_i前10的实验进行测试。选择输入数据样本时，从输入数据集中随机选择一个起始种子样本；然后，在每个采样步骤，随机地获取输出样本；重复上述过程，直到所选样本的数量满足实验参数中规模要求。在实施例中，使用m4.large作为基础实例类型，所以最后将随机取样的数据集运行在实验参数中指定规模的m4.large集群上，记录运行时间。First set the data scale for modeling experiments to 1% to 8%, and the experimental cluster size is limited to 1 to 8 instances. In the embodiment, the experiments with the top 10 probability α _i are used for testing. When selecting input data samples, a starting seed sample is randomly selected from the input data set; then, at each sampling step, output samples are randomly obtained; the above process is repeated until the number of selected samples meets the scale requirements in the experimental parameters. In the embodiment, m4.large is used as the basic instance type, so finally the randomly sampled data set is run on the m4.large cluster of the scale specified in the experimental parameters, and the running time is recorded.

模型构建器：在特定实例类型上运行输入数据集子集的实验时，使用T_base(x,y)来表示任务运行时间，给定实例数为x，数据集的比例为y。大规模分析任务通常以连续的步骤(即迭代)运行，直到满足终止条件。每个步骤主要由两个阶段组成：并发计算和数据通信。任务执行的计算时间与数据集大小保持相对关系，并且在大规模分析任务中有几种代表性的通信模式。因此，可以通过解析计算时间和通信时间来推断大规模分析任务的运行时间。本实施例的主要目标是通过任务的计算和通信模式，并设计涉及x和y的拟合项，来得到给定任务的性能预测函数T_base(x,y)。Model Builder: When running experiments on a subset of the input dataset on a specific instance type, use T _base (x,y) to represent the task runtime, given the number of instances x and the scale of the dataset y. Large-scale analysis tasks are usually run in successive steps (i.e. iterations) until a termination condition is met. Each step mainly consists of two phases: concurrent computation and data communication. The computational time of task execution remains relative to the dataset size, and there are several representative communication patterns in large-scale analysis tasks. Therefore, the running time of large-scale analysis tasks can be inferred by analyzing the computation time and communication time. The main goal of this embodiment is to obtain the performance prediction function T _base (x, y) for a given task by computing and communicating modes of the task and designing fitting terms involving x and y.

计算耗时，用户定义的迭代算法会对输入数据的每个样本进行运算所产生的时间成本。对于集群计算环境中的大规模数据处理任务，可以根据数据集的特征(例如，密集或稀疏)和算法，通过若干不同的拟合项来近似计算时间。由此，计算时间会是关于实例数量和数据集规模的一个函数。要确定函数的确切拟合项，需要结合具体的领域知识。Computational time-consuming, the time cost incurred by the user-defined iterative algorithm operating on each sample of the input data. For large-scale data processing tasks in a cluster computing environment, the computation time can be approximated by several different fitting terms, depending on the characteristics of the dataset (eg, dense or sparse) and the algorithm. Thus, the computation time will be a function of the number of instances and the size of the dataset. Determining the exact fit of a function requires specific domain knowledge.

给定函数T_base(x,y)的所有候选拟合项，我们使用互信息作为拟合项的选择标准，排除冗余项而只选择良好预测因子作为拟合项。设表示所有候选项的集合，其中每个项f_k是x和y由计算和通信模式决定的函数。给定在不同数量的实例和不同数据规模下收集的m个训练数据样本，首先计算每个实验设置的K维特征向量F_i＝(f_1,i,…,f_K,i)，例如f_k,i＝y_i/x_i。然后，我们计算每个项与运行时间之间的互信息，并选择与运行时间的互信息高于阈值的项。根据m个训练运行时间样本，拟合得到基础预测模型中w_k的值。其中β_k表示是否选择了拟合项f_k(β_k＝1表示选择该项)。Given all the candidate fit terms of the function T _base (x,y), we use mutual information as the selection criterion for the fit terms, exclude redundant terms and select only good predictors as the fit terms. Assume represents the set of all candidates, where each term fk is a function of _x and y determined by the computation and communication mode. Given m training data samples collected at different numbers of instances and different data scales, first compute the K-dimensional feature vector F _i = (f _1,i ,...,f _K,i ) for each experimental setting, e.g. f _k,i =y _i / _xi . Then, we compute the mutual information between each item and the running time, and select the items whose mutual information with the running time is higher than a threshold. According to m training runtime samples, the basic prediction model is obtained by fitting The value of w _k in . Wherein β _k indicates whether the fitting item f _k is selected (β _k =1 indicates that this item is selected).

模型转换器：云提供商通常提供具有不同CPU，内存，硬盘和网络容量组合的各种实例系列，以满足不同作业的需要，例如通用和计算/存储器/存储优化。通过大量实验，我们发现给定任务和固定数据集，可以根据简单映射将一个实例类型的运行时间转换为不同的实例类型。因此，不需要对每种实例类型进行实验来获取训练数据之后再来构建预测模型，这大大减少了训练时间和训练成本。Model converters: Cloud providers typically offer various instance families with different combinations of CPU, memory, hard disk, and network capacity to meet the needs of different jobs, such as general purpose and compute/memory/storage optimized. Through extensive experiments, we found that given a task and a fixed dataset, the runtime of one instance type can be converted to a different instance type according to a simple mapping. Therefore, there is no need to perform experiments on each instance type to obtain training data before building a predictive model, which greatly reduces training time and training cost.

变换器Φ是从基础预测模型到目标预测模型的映射Φ:T_base(x,y)→T_target(x,y)。通过比较在相同任务和数据集规模下，不同实例类型的运行时间，有上可知预测函数中的拟合项类别是相似的。换句话说，相同任务和数据集规模下，如果f_k包含在T_base(x,y)中，则很可能T_target(x,y)也应该包含f_k。这主要是因为在相同的应用程序配置和实例数下，任务的计算和通信模式基本保持不变。然而，在不同的实例类型下，每个项的权重将是不同的，所以我们需要关注从基础预测模型到目标预测模型的权重映射。我们采用了一种简单而有效的映射方法。设表示训练数据采集器选择的实验设置中成本最低的，运行时间为t_base。我们在目标实例类型上运行实验以获得运行时间t_target。模型变换器将目标实例类型的预测模型导出为其中 The transformer Φ is the mapping from the base prediction model to the target prediction model Φ:T _base (x,y)→T _target (x,y). By comparing the running times of different instance types under the same task and dataset size, it can be seen that the categories of fit items in the prediction function are similar. In other words, under the same task and dataset size, if f _k is contained in T _base (x, y), it is likely that T _target (x, y) should also contain f _k . This is mainly because the computation and communication patterns of tasks remain largely unchanged under the same application configuration and number of instances. However, under different instance types, the weight of each item will be different, so we need to focus on the weight mapping from the base prediction model to the target prediction model. We adopted a simple and efficient mapping method. Assume represents the lowest cost of the experimental settings chosen by the training data collector, with a running time of t _base . We run experiments on the target instance type to get the running time t _target . The model transformer exports the prediction model of the target instance type as in

实施例的具体实施方案如下：The specific implementation of embodiment is as follows:

实施例中，在预测函数中加入的拟合项有：常数项、y/x线性项、数据规模的平方根与实例数项。固定常数表示在串行计算中花费的时间；对于计算时间与数据集的大小成线性关系的算法，添加数据比例与实例数y/x的拟合项；对于稀疏数据集，添加数据规模的平方根与实例数的拟合项。In the embodiment, the fitting terms added to the prediction function are: constant term, y/x linear term, the square root of the data scale and the number of instances item. A fixed constant representing the time spent in serial computation; for algorithms whose computation time is linear with the size of the dataset, add a fitting term for the scale of the data and the number of instances y/x; for sparse datasets, add the square root of the scale of the data with the number of instances fitting term.

表2Table 2

通信模式communication mode 结构structure 拟合项fit term Parallel read/writeParallel read/write Many One-to-OneMany One-to-One xx Partition-aggregatePartition-aggregate Many-to-OneMany-to-One logxlogx BroadcastBroadcast One-to-ManyOne-to-Many xx CollectCollect Many-to-OneMany-to-One xx ShuffleShuffle Many-to-ManyMany-to-Many x2x2 Global communicationGlobal communication All-to-AllAll-to-All x2x2

实施例中，根据不同任务的通信模式，使用了如表2中展示的通信拟合项，分别为x、logx、x²。选定所有项之后，使用非负最小二乘(NNLS)求解器来计算得到基础预测模型。之后，选择基础实验中成本最低的实验设置，用同样的实验设置在目标实例类型上运行任务。最后导出所有实例类型的预测模型为 In the embodiment, according to the communication modes of different tasks, the communication fitting items shown in Table 2 are used, which are respectively x, logx, and x ² . After all terms are selected, the base predictive model is computed using a non-negative least squares (NNLS) solver. After that, choose the lowest-cost experiment setting in the base experiment, and run the task on the target instance type with the same experiment setting. Finally, the prediction model for all instance types is derived as

实施例具体的实施过程说明如下：The specific implementation process of the embodiment is described as follows:

得到所有任务在待选实例类型上的所有预测模型之后，需要找到运行成本最低点的最优云配置方案。云配置方案需要满足的是在给定成本预算的前提下，能够使得任务在最短的时间内完成。对实施例，通过4项测试结果对算法进行评估，分别为：有效性、预测准确度、训练成本和应用扩展性。After obtaining all prediction models for all tasks on the instance type to be selected, it is necessary to find the optimal cloud configuration solution with the lowest operating cost. What the cloud configuration solution needs to meet is to enable the task to be completed in the shortest time under the premise of a given cost budget. For the embodiment, the algorithm is evaluated through four test results, namely: effectiveness, prediction accuracy, training cost and application scalability.

有效性：比较SILHOUETTE和Ernest在5任务中的表现。图3(a)显示SILHOUETTE的预测精度与Ernest的预测精度相当，图3(b)显示SILHOUETTE的训练时间和训练成本远低于Ernest。当我们为2种实例建立预测模型时，SILHOUETTE可以节省25％的训练时间和30％的成本。从图3(c)可以看出，当有更多候选实例类型时，SILHOUETTE的训练时间和训练成本比Ernest要低更多，而当有5种候选实例类型时，SILHOUETTE和Ernest的训练时间分别为25分钟和83分钟。当有更多候选实例类型，可以预见到的是SILHOUETTE表现更出色。Effectiveness: Compare the performance of SILHOUETTE and Ernest on 5 tasks. Figure 3(a) shows that the prediction accuracy of SILHOUETTE is comparable to that of Ernest, and Figure 3(b) shows that the training time and training cost of SILHOUETTE is much lower than that of Ernest. SILHOUETTE can save 25% training time and 30% cost when we build predictive models for 2 instances. As can be seen from Figure 3(c), when there are more candidate instance types, the training time and training cost of SILHOUETTE are much lower than Ernest, while when there are 5 candidate instance types, the training time of SILHOUETTE and Ernest are respectively 25 minutes and 83 minutes. When there are more candidate instance types, it is predictable that SILHOUETTE performs better.

预测准确度：图4、5表明，m4.large的基本预测模型的预测精度和c5.large的变换预测模型都可以实现高精度，这证实了模型变换器在SILHOUETTE中的有效性。Prediction accuracy: Figures 4 and 5 show that both the prediction accuracy of the m4.large base prediction model and the c5.large transformed prediction model can achieve high accuracy, which confirms the effectiveness of the model transformer in SILHOUETTE.

训练成本：SILHOUETTE旨在以较低的开销找到最佳的云配置。因此，将整个任务的完成时间与构建基础预测模型的训练数据时间进行比较。图6表明，除TeraSort外，SILHOUETTE的训练时间低于所有应用的总完成时间的20％。Training cost: SILHOUETTE aims to find the optimal cloud configuration with low overhead. Therefore, the completion time of the entire task is compared with the training data time for building the underlying predictive model. Figure 6 shows that the training time of SILHOUETTE is less than 20% of the total completion time of all applications except TeraSort.

应用扩展性：在不同大小的数据集上，SILHOUETTE使用相同的实验设置来构建基础和变换预测模型，并评估其预测精度。图7说明当我们使用1.5倍、2倍、2.5倍和3倍数据集大小时，预测误差始终低于15％，这表明即使数据集的大小改变，SILHOUETTE建立的预测模型仍然可以保持较高的准确性。Application Scalability: On datasets of different sizes, SILHOUETTE uses the same experimental setup to build base and transformed prediction models and evaluate their prediction accuracy. Figure 7 illustrates that the prediction error is consistently below 15% when we use 1.5x, 2x, 2.5x, and 3x the dataset size, which shows that the prediction model built by SILHOUETTE can still maintain a high value even when the size of the dataset changes. accuracy.

本实施例中使用SILHOUETTE为WordCount选择最佳云配置。考虑表1中四种实例类型，假设选择器优化目标为：给定最大任务完成时间，最小化总成本。图9显示了使用每种实例类型下，整个数据集运行任务的总时间和总成本。我们可以观察到，计算优化实例类型c5.large的总时间与存储优化实例类型i3.large相当，SILHOUETTE将选择成本更低的前者。In this example, SILHOUETTE is used to select the best cloud configuration for WordCount. Considering the four instance types in Table 1, assume that the selector optimization objective is: given the maximum task completion time, minimize the total cost. Figure 9 shows the total time and total cost of running tasks across the dataset using each instance type. We can observe that the total time for compute-optimized instance type c5.large is comparable to storage-optimized instance type i3.large, and SILHOUETTE will choose the lower cost former.

之后，可以使用SILHOUETTE来确定给定实例类型的最佳实例数。考虑两个任务，分别为TeraSort和WordCount。图9给出了两项任务在不同集群大小下的任务运行时间，SILHOUETTE预测的运行时间非常接近实际运行时间，由此可以选择具体的集群规模。After that, SILHOUETTE can be used to determine the optimal number of instances for a given instance type. Consider two tasks, TeraSort and WordCount. Figure 9 shows the task running time of the two tasks under different cluster sizes. The running time predicted by SILHOUETTE is very close to the actual running time, so the specific cluster size can be selected.

本文中所描述的具体实施例仅仅是对本发明精神作举例说明。本发明所属技术领域的技术人员可以对所描述的具体实施例做各种各样的修改或补充或采用类似的方式替代，但并不会偏离本发明的精神或者超越所附权利要求书所定义的范围。The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which the present invention pertains can make various modifications or additions to the described specific embodiments or substitute in similar manners, but will not deviate from the spirit of the present invention or go beyond the definitions of the appended claims range.

Claims

1. An efficient cloud configuration selection algorithm for big data analysis tasks, characterized in that it comprises the following steps:

Step 1: training data collection: select a certain proportion of input data and the number of cloud server instances used when the task corresponding to the proportion is executed, and determine each group of test parameters and task completion time, wherein the certain proportion refers to the experiment. Use data as a percentage of input data;

Step 2: Model construction: Using the test parameters and task completion time in Step 1, with the input data ratio and the number of instances, design a fitting polynomial involving the ratio of input data and the number of instances, and determine the basic prediction model The value of w _k in . Wherein β _k indicates whether the fitting item f _k is selected (β _k =1 indicates that this item is selected);

Model conversion: Obtain the running time of the least time-consuming test parameter in step 1 under the target instance type as t _target , and use the mapping method to derive the prediction model of the target instance type as in

Step 3: Selector Construction:

For a given input dataset for a task, using the predictive model obtained in step 2, compute the most optimal cloud configuration that satisfies specific runtime and cost constraints.

2. the efficient cloud configuration selection algorithm of big data analysis task according to claim 1, is characterized in that:

The specific process of selecting a plurality of certain proportions of input data and the number of cloud server instances used in the execution of tasks corresponding to the proportions in the step 1 is as follows:

First select a certain range of input data and a certain range of cloud server instances. According to D-optimality, when selecting experimental parameters, choose the maximum covariance matrix (information matrix) The experimental parameters of the weighted sum, i.e. The constraints are 0≤α _i ≤1,i∈[1,M], where α _i represents the probability of choosing i experimental setting, _xi is the number of instances, y _i is the input data ratio, and M represents the total number of experimental parameter settings obtained by enumerating all possible ratios and instances;

Express the total cost of the experiment by adding a budget constraint term B, where y _i / _xi is the cost of running the experiment E _i according to the pricing model on the cloud platform;

The M experimental settings are sorted in non-increasing order according to the probability α _i , and the experimental parameter group at the top of the ranking is selected as the training data.

3 . The efficient cloud configuration selection algorithm for big data analysis tasks according to claim 1 , wherein the top 10 data groups are selected as training data in the non-increasing order sorting. 4 .

4. the efficient cloud configuration selection algorithm of big data analysis task according to claim 2, is characterized in that:

In the step 1, the input data in a certain range is specifically 1% to 10% of the data, and the number of cloud server instances in the certain range is 1-10.

5. the efficient cloud configuration selection algorithm of big data analysis task according to claim 1, is characterized in that:

A certain proportion of the input data described in step 1 is selected by random sampling from the entire input data set.

6 . The efficient cloud configuration selection algorithm for big data analysis tasks according to claim 1 , wherein the fitting term in the model construction involves computation time-consuming and communication time-consuming. 7 .