CN110048886A - A kind of efficient cloud configuration selection algorithm of big data analysis task - Google Patents

A kind of efficient cloud configuration selection algorithm of big data analysis task Download PDF

Info

Publication number
CN110048886A
CN110048886A CN201910294273.4A CN201910294273A CN110048886A CN 110048886 A CN110048886 A CN 110048886A CN 201910294273 A CN201910294273 A CN 201910294273A CN 110048886 A CN110048886 A CN 110048886A
Authority
CN
China
Prior art keywords
cloud
task
data
experiment
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910294273.4A
Other languages
Chinese (zh)
Other versions
CN110048886B (en
Inventor
陈艳姣
林龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910294273.4A priority Critical patent/CN110048886B/en
Publication of CN110048886A publication Critical patent/CN110048886A/en
Application granted granted Critical
Publication of CN110048886B publication Critical patent/CN110048886B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention proposes a kind of efficient clouds of big data analysis task to configure selection algorithm, small-scale cluster experiment is carried out by selected part input data, and then construct performance prediction model, utility prediction model estimates performance of the task on large-scale cluster, and passage capacity prediction result determines optimal cloud configuration.By using above-mentioned algorithm, can be configured with the lower model training time and at effectively helping user to find optimal cloud originally.Large-scale data analysis task to be deployed on cloud computing platform selects optimal cloud configuration, can significantly improve its operational efficiency, and reduce operating cost.

Description

A kind of efficient cloud configuration selection algorithm of big data analysis task
Technical field
The invention belongs to field of cloud calculation, more particularly to the efficient cloud placement algorithm based on big data analysis task.
Background technique
Large-scale data analysis task is growing, and the task definition being related to is also increasingly complicated, wherein frequently involving machine Study, natural language processing and image procossing etc..Compared with traditional calculations task, this generic task is usually data-intensive With computation-intensive, longer calculating time and higher calculating cost are needed.Therefore it is analyzed to complete large-scale data Task, usually using the huge computing capability of cloud computing come the task of helping through.For the optimal cloud of large scale analysis task choosing Configuration, can be improved the operational efficiency of task, and can reduce the calculating cost of user.
In order to meet different calculating requirements, existing cloud service provider provides hundreds of with different resource for user The example types (ComputeEngine of the EC2 of such as Amazon, the Azure of Microsoft and Google) of configuration.Although most of cloud clothes Business provider only allows user to carry out selection example types, but the Compute Engine of Google from available example types pond User is allowed to custom-configure virtual machine (configuration vCPU and memory), this is but also select correct cloud configuration to become more to challenge Property.In addition to this, major cloud service provider also provides Serverless cloud framework (such as Amazon Lambda, Google Cloud Functions and Microsoft Azure Functions), this service allows user using task as Serverless function It can run, without using preassigned configuration starting example.But Serverless framework may need application program weight Its code of structure, and Serverless cloud provider can not help user to minimize task completion time, or help User, which reduces, calculates cost.
The selection of cloud configuration, the i.e. selection of the type of example and example quantity directly affect deadline and the consumption of task The economic cost taken.The cloud configuration correctly selected can realize identical performance objective with lower cost.Due to counting on a large scale According to the longer runing time of analysis task, excavating potential escapable cost just seems even more important.Due to the diversification of task, And the combination diversification of example types and cluster scale, so that the search space of cloud configuration becomes huge.
In so huge search space, to the both not practical using exhaustive search of best cloud configuration or it is difficult to extend. To limit search space, CherryPick algorithm limits search space by using limited mission bit stream, is selected with this Best cloud configuration.CherryPick is optimized for cost minimization, but cannot be used for optimizing other targets, such as pass through Cost budgeting minimizes the operation deadline.In addition to this, then service performance modeling method selects cloud to Ernest and PARIS Configuration.By using this kind of performance prediction model, the cloud configuration that user can be different for the different task choosing of optimization aim, example Such as, most cheap or the fastest cloud configuration is selected.But Ernest needs to train prediction model for each example types, and PARIS only selects preferred example type in multiple public clouds, and cannot provide flock size.
Summary of the invention
In view of the deficiencies of the prior art, the present invention proposes a kind of efficient clouds of big data analysis task to configure selection algorithm.
The technical scheme is that a kind of efficient cloud of big data analysis task configures selection algorithm, it include following step It is rapid:
Step 1: training data collection phase, implementation is as follows,
Training data collector only carries out the experiment of particular instance type to the sub-fraction of input data, this will be used for pre- Survey the performance of the task execution in entire input data.It includes that experimental selection and experiment execute that training data, which is collected,.
Experimental selection: in experimental selection, it is thus necessary to determine that two important experiment parameters: (1) ratio, i.e. experiment use number According to the ratio of the total input data of Zhan;(2) used Cloud Server example number when task execution.The present invention uses statistical technique Carry out selected section experiment parameter, property when task run is mainly predicted using the experiment parameter that can generate multi information as far as possible Can, to guarantee higher forecasting accuracy.According to D-optimality, selection maximizes covariance matrix (information matrix) and adds Weigh the experiment parameter of sum.Use Ei=(xi,yi) indicate that experiment parameter is arranged, wherein xiIt is instance number, yiIt is input data ratio Example.Let m represent the experiment parameter setting sum obtained by enumerating all possible ratio and instance number.Then, E is utilizedi, can To calculate K dimensional feature vector Fi, wherein each corresponds to an item in prediction model.In this way, obtain about M feature vector of all experimental setups.According to D-optimality, in experiment parameter selection, selection maximizes covariance Matrix (information matrix)The experiment parameter of weighted sum, i.e.,Constraint condition is 0≤αi≤1, i∈[1,M],Wherein αiIndicate the probability of selection i experimental setup.By addition budgetary restraints item B come Indicate the totle drilling cost of experiment, wherein yi/xiIt is according to the pricing model in cloud platform come running experiment EiCost.In solution When stating optimization problem, according to probability αiM experimental setup is ranked up using non-increasing sequence select forward data group as Training data.Select preceding 10 data groups as training data in the present invention.
Experiment executes: after selected experimental setup, determining and is come using which data sample in entire input data set Experimental data set is formed, to meet specified ratio.Selection number is concentrated from entire input data using random sampling in the present invention According to sample, because random sampling can be to avoid the isolated area for falling into data set.After obtaining small data set, selected experiment is used The example of deployment specified quantity and the task that brings into operation are set, later using test parameters and task completion time as constructing The training data of prediction model.
Step 2: in the Construction of A Model stage, implementation is as follows,
Model builder is made of model constructor and model transformer.Utilize the training number of the particular instance type of collection According to model constructor can establish basic forecast model.Later, model transformer exports remaining according to basic forecast model conversation The prediction model of example types.
Model constructor: when running the experiment of input data set subset in particular instance type, T is usedbase(x, y) comes Indicate that Runtime, given example number are x, the ratio of data set is y.Large scale analysis task is usually with continuous step (i.e. iteration) operation, until meeting termination condition.Each step is mainly made of two stages: concurrent and data communication. The calculating time of task execution and data set size keep relativeness, and have several representativenesses in large scale analysis task Communication pattern.Therefore, the runing time of large scale analysis task is inferred by analytical Calculation time and call duration time.This hair Bright middle main target is calculating and the communication pattern by task, and designs and be related to the fit term of x and y, to obtain Given task Performance prediction function Tbase(x,y)。
Time-consuming is calculated, user-defined iterative algorithm can carry out the time caused by operation to each sample of input data Cost.Task is handled for the large-scale data in cluster computing environment, can according to the feature of data set (for example, it is intensive or It is sparse) and algorithm, by several different fit terms come the approximate calculation time.Calculating the time as a result, can be about example quantity With a function of data set scale.
Communication is time-consuming, and data pass through the time cost that transmission of network is generated to destination node.Fig. 1 has taken out extensive number According to communication pattern representative in analysis task.Although having differences in terms of programming model and execution mechanism, common is logical Letter mode can represent most of signal intelligences in cluster application program.The time-consuming letter primarily with regard to example quantity of communication Number, can be according to the different communication modes of task, to be inferred to the fit term of function.For example, working as the size of data of each example When constant, communication is time-consuming as the instance number of partition-aggregate communication pattern is linearly increasing, but for Shuffle communication pattern is the relationship of quadratic power.
Given function TbaseAll candidate fit terms of (x, y), use mutual information as the selection criteria of fit term, exclude Redundancy and only select good predictive factor as fit term.IfThe set of all candidate items is indicated, wherein each item fkIt is x and y by calculating the function determined with communication pattern.To what is collected under the example and different data scale for being scheduled on different number M training data sample calculates the K dimensional feature vector F of each experimental setup firsti=(f1,i,…,fK,i), such as fk,i=yi/ xi.Then, the mutual information between each item and runing time is calculated, and selects to be higher than threshold value with the mutual information of runing time ?.According to m trained runing time sample, fitting obtains basic forecast model Middle wkValue.Wherein βkIt indicates whether to have selected quasi- Close item fkk=1 indicates to select this).
Model transformer: cloud provider usually provide have different CPU, memory, hard disk and network capacity combination it is various Example series, to meet the needs of different work, such as general and calculating/memory/storage optimization.By many experiments The runing time of one example types can be converted to different examples according to simple mapping by Given task and fixed data set Type.It therefore, there is no need to test every kind of example types to obtain after training data again and to construct prediction model, this is big Reduce training time and training cost greatly.
Converter Φ is the mapping φ from basic forecast model to target prediction model: Tbase(x,y)→Ttarget(x,y)。 By comparing under same task and data set scale, the runing time of different instances type, the fit term class in anticipation function It is not similar.In other words, under same task and data set scale, if fkIncluded in TbaseIn (x, y), then it is likely to Ttarget(x, y) also should include fk.This is primarily due under the configuration of identical application program and instance number, the calculating of task It is held essentially constant with communication pattern.However, each weight will be different under different example types, so needing It pays close attention to from basic forecast model to the weight of target prediction model and maps.It uses a kind of simple in the present invention and effectively reflects Shooting method.IfIndicate that cost is minimum in the experimental setup of collecting training data device selection, runing time is tbase.Our running experiments in object instance typeTo obtain runing time ttarget.Model transformer is by object instance class The prediction model of type exports asWherein
Step 3: selector construction phase, implementation is as follows,
The runing time prediction model of all example types is integrated into single runing time fallout predictor T (x, y), wherein X configures vector by the cloud that the type and quantity of example form.For the given input data set of task, target is to enable users to Enough find the most preferably cloud configuration for meeting specific run time and cost constraint.Enabling P (x) is the unit time price that cloud configures x, That is the monovalent quantity multiplied by example of example types.Optimal cloud configuration select permeability can be expressed as x*=S (T (x, y), C (x), Ry, wherein Cx=Px × Tx, y, 0≤y≤1
Wherein C (x) is the unit time price of cloud configuration x, and R (y) is the constraint of user's addition, such as maximum tolerance operation Time or maximum tolerance cost.Selector S (*) is determined by user, and the best cloud for selecting to meet expected performance or cost Configure x*
Detailed description of the invention
Fig. 1 is communication pattern brief introduction figure of the invention.
Fig. 2 is master-plan structure chart of the invention
Fig. 3 is effectiveness of the invention comparison diagram
Fig. 4 is the predictablity rate of the invention on Spark
Fig. 5 is the predictablity rate of the invention on Hadoop
Fig. 6 is task total time and model training time comparison diagram of the invention
Fig. 7 is the predictablity rate of TeraSort of the invention under different data collection size
Fig. 8 is the cost of WordCount of the invention in different instances type
Fig. 9 is the deadline of TeraSort and WordCount of the invention on different cluster scales
Specific embodiment
The present invention mainly according to the calculating mode and communication pattern of big data analysis task, proposes a big data analysis The efficient cloud of task configures Selection Framework, so that user is found the cloud configuration for being suitble to given big data analysis task, thus greatly The big calculating cost for reducing large-scale data analysis task.This frame leads to too small amount of experiment to establish prediction model, uses It is few input data and small-scale cluster, and can be by few additional experiments intelligence by the prediction of an example types Model conversion is that the prediction model cloud through the invention of another example types configures Selection Framework, and cloud computing user can be with Lower cost determines best cloud configuration.
Referring to fig. 2, big data point of the embodiment to be realized on Amazon cloud service (AmazonWebService, AWS) A specific elaboration is carried out to process of the invention for cloud configuration selection algorithm (being named as Silhouette) of analysis task, It is as follows:
Step 1: training data collection phase, implementation is as follows,
Training data collector only carries out the experiment of particular instance type to the sub-fraction of input data, this will be used for pre- Survey the performance of the task execution in entire input data.It includes that experimental selection and experiment execute that training data, which is collected,.
Experimental selection: in experimental selection, it is thus necessary to determine that two important experiment parameters: (1) ratio, i.e. experiment use number According to the ratio of the total input data of Zhan;(2) used Cloud Server example number when task execution.Statistics is used in the present embodiment Technology carrys out selected section experiment parameter, main using can generate the experiment parameter of multi information as far as possible come when predicting task run Performance, to guarantee higher forecasting accuracy.According to D-optimality, selection maximizes covariance matrix (information square Battle array) weighted sum experiment parameter.Use Ei=(xi,yi) indicate that experiment parameter is arranged, wherein xiIt is instance number, yiIt is input number According to ratio.Let m represent the experiment parameter setting sum obtained by enumerating all possible ratio and instance number.Then, it utilizes Ei, we can calculate K dimensional feature vector Fi, wherein each corresponds to an item in prediction model.In this way, We can obtain the M feature vector about all experimental setups.According to D-optimality, in experiment parameter selection, We select to maximize covariance matrix (information matrix)The experiment parameter of weighted sum, i.e.,Constraint condition is 0≤αi≤1,i∈[1,M],Wherein αiIndicate choosing Select the probability of i experimental setup.We indicate the totle drilling cost of experiment by addition budgetary restraints item B, wherein yi/xiIt is according to cloud Pricing model on platform carrys out running experiment EiCost.When solving above-mentioned optimization problem, according to probability αiWith non-increasing sequence Choice experiment is ranked up to M experimental setup.
Experiment executes: after selected experimental setup, it is thus necessary to determine that uses which data sample in entire input data set Original composition experimental data set, to meet specified ratio.The present invention is concentrated from entire input data using random sampling and is selected Data sample, because random sampling can be to avoid the isolated area for falling into data set.After obtaining small data set, selected reality is used The example of setting deployment specified quantity and the task that brings into operation are tested, later using test parameters and task completion time as structure Build the training data of prediction model.
The specific implementation process of embodiment is described as follows:
The analysis processing engine of large-scale data used in embodiment is Spark and Hadoop.On Spark, Wo Menfen Do not run 3 kinds of machine learning based on SparkML: classification is returned and is clustered.Wherein sorting algorithm is had using using The text classification benchmark dataset rcv1 of 44000 features, regression algorithm and clustering algorithm have 44000 using 1,000,000 The generated data collection of feature.On Hadoop, it is separately operable TeraSort algorithm and WordCount algorithm.Wherein TeraSort algorithm is a kind of universal reference test application program of large-scale data analysis, and groundwork is to generating at random Record is ranked up, and uses the data set for having 200,000,000 samples, and WordCount algorithm is used to calculate from wikipedia The word frequency of occurrences in the fifty-five million entry of article.
In the EC2 example types pond of AWS, select m4.large (general), c5.large (calculation optimization), r4.large (internal memory optimization) and i3.large (storage optimization), every kind of example types have 2 vCPU, and pre-install linux system.In experiment The Data Analysis Services engine model used is respectively Apache Spark 2.2 and Hadoop 2.8.Table 1 lists every type The configuration and price of type example.
Table 1
Example types Memory (GiB) Example hard disk Price (dollar/hour)
m4.large 8 EBS 0.1
c5.large 4 EBS 0.085
r4.large 15.25 EBS 0.133
i3.large 15.25 SSD 0.156
Setting is 1% to 8% for the data scale of modeling experiment first, and experiment cluster scale is limited in 1 to 8 realities Example.In embodiment, probability α is takeniPreceding 10 experiment is tested.When selecting input data sample, concentrated from input data random Select a starting seed specimen;Then, in each sampling step, output sample is randomly obtained;It repeats the above process, until The quantity of selected sample meets scale requirements in experiment parameter.In embodiment, use m4.large as basic example types, So finally the data set of grab sample is operated on the m4.large cluster of given size in experiment parameter, when record is run Between.
Step 2: in the Construction of A Model stage, implementation is as follows,
Model builder is made of model constructor and model transformer.Utilize the training number of the particular instance type of collection According to model constructor can establish basic forecast model.Later, model transformer exports remaining according to basic forecast model conversation The prediction model of example types.
Model constructor: when running the experiment of input data set subset in particular instance type, T is usedbase(x, y) comes Indicate that Runtime, given example number are x, the ratio of data set is y.Large scale analysis task is usually with continuous step (i.e. iteration) operation, until meeting termination condition.Each step is mainly made of two stages: concurrent and data communication. The calculating time of task execution and data set size keep relativeness, and have several representativenesses in large scale analysis task Communication pattern.Therefore, the runing time of large scale analysis task can be inferred by analytical Calculation time and call duration time. The main target of the present embodiment is the calculating and communication pattern by task, and design be related to the fit term of x and y, come obtain to Determine the performance prediction function T of taskbase(x,y)。
Time-consuming is calculated, user-defined iterative algorithm can carry out the time caused by operation to each sample of input data Cost.Task is handled for the large-scale data in cluster computing environment, can according to the feature of data set (for example, it is intensive or It is sparse) and algorithm, by several different fit terms come the approximate calculation time.Calculating the time as a result, can be about example quantity With a function of data set scale.The definite fit term for determining function needs to combine specific domain knowledge.
Communication is time-consuming, and data pass through the time cost that transmission of network is generated to destination node.Fig. 1 has taken out extensive number According to communication pattern representative in analysis task.Although having differences in terms of programming model and execution mechanism, common is logical Letter mode can represent most of signal intelligences in cluster application program.The time-consuming letter primarily with regard to example quantity of communication Number, can be according to the different communication modes of task, to be inferred to the fit term of function.For example, working as the size of data of each example When constant, communication is time-consuming as the instance number of partition-aggregate communication pattern is linearly increasing, but for Shuffle communication pattern is the relationship of quadratic power.
Given function TbaseAll candidate fit terms of (x, y), we use selection criteria of the mutual information as fit term, It excludes redundancy and only selects good predictive factor as fit term.IfThe set of all candidate items is indicated, wherein often A fkIt is x and y by calculating the function determined with communication pattern.To being received under the example and different data scale for being scheduled on different number M training data sample of collection calculates the K dimensional feature vector F of each experimental setup firsti=(f1,i,…,fK,i), such as fk,i =yi/xi.Then, we calculate the mutual information between each item and runing time, and select to be higher than with the mutual information of runing time The item of threshold value.According to m trained runing time sample, fitting obtains basic forecast model Middle wkValue.Wherein βkIt indicates whether to have selected quasi- Close item fkk=1 indicates to select this).
Model transformer: cloud provider usually provide have different CPU, memory, hard disk and network capacity combination it is various Example series, to meet the needs of different work, such as general and calculating/memory/storage optimization.By many experiments, I Find Given task and fixed data set, the runing time of an example types can be converted to by difference according to simple mapping Example types.It therefore, there is no need to test every kind of example types to obtain after training data again and to construct prediction mould Type, this considerably reduce training times and training cost.
Converter Φ is the mapping φ from basic forecast model to target prediction model: Tbase(x,y)→Ttarget(x,y)。 By comparing under same task and data set scale, the runing time of different instances type has upper know in anticipation function Fit term classification is similar.In other words, under same task and data set scale, if fkIncluded in TbaseIn (x, y), then It is likely to Ttarget(x, y) also should include fk.This is primarily due under the configuration of identical application program and instance number, task Calculating and communication pattern be held essentially constant.However, each weight will be different under different example types, It is mapped so we need to pay close attention to from basic forecast model to the weight of target prediction model.We use a kind of simple and have The mapping method of effect.IfIndicate that cost is minimum in the experimental setup of collecting training data device selection, operation Time is tbase.Our running experiments in object instance typeTo obtain runing time ttarget.Model transformer is by target The prediction model of example types exports asWherein
The specific embodiment of embodiment is as follows:
In embodiment, the fit term being added in anticipation function has: constant term, y/x linear term, data scale square root With instance number?.Fixed constant indicates the time spent in serial computing;For calculating the size of time and data set Linear algorithm adds the fit term of ratio data and instance number y/x;For sparse data set, data scale is added Square root and instance numberFit term.
Table 2
Communication pattern Structure Fit term
Parallel read/write Many One-to-One x
Partition-aggregate Many-to-One logx
Broadcast One-to-Many x
Collect Many-to-One x
Shuffle Many-to-Many x2
Global communication All-to-All x2
In embodiment, according to the communication pattern of different task, uses and communicated fit term as shown in table 2, respectively x、logx、x2.After selecting all items, basic forecast model is calculated using non-negative least square (NNLS) solver. Later, the experimental setup that cost is minimum in infrastest is selected, is run and is appointed in object instance type with same experimental setup Business.The prediction model for finally exporting all example types is
Step 3: selector construction phase, implementation is as follows,
The runing time prediction model of all example types is integrated into single runing time fallout predictor T (x, y), wherein X configures vector by the cloud that the type and quantity of example form.For the given input data set of task, target is to enable users to Enough find the most preferably cloud configuration for meeting specific run time and cost constraint.Enabling P (x) is the unit time price that cloud configures x, That is the monovalent quantity multiplied by example of example types.Optimal cloud configuration select permeability can be expressed as x*=S (T (x, y), C (x), Ry, wherein Cx=Px × Tx, y, 0≤y≤1
Wherein C (x) is the unit time price of cloud configuration x, and R (y) is the constraint of user's addition, such as maximum tolerance operation Time or maximum tolerance cost.Selector S (*) is determined by user, and the best cloud for selecting to meet expected performance or cost Configure x*
The specific implementation process of embodiment is described as follows:
After obtaining all prediction models of all tasks in example types to be selected, need to find operating cost minimum point Optimal cloud allocation plan.Cloud allocation plan enables to task to exist what needs to be satisfied is that under the premise of given cost budgeting It is completed in the shortest time.To embodiment, algorithm is assessed by 4 test results, is respectively as follows: validity, prediction standard Exactness, training cost and application scalability.
Validity: compare the performance of SILHOUETTE and Ernest in 5 tasks.Fig. 3 (a) shows that SILHOUETTE's is pre- Survey precision is suitable with the precision of prediction of Ernest, and Fig. 3 (b) shows that the training time of SILHOUETTE and training cost are far below Ernest.When we establish prediction model for 2 kinds of examples, SILHOUETTE can save for 25% training time and 30% Cost.From Fig. 3 (c) as can be seen that when there is more candidate translation example types, the training time of SILHOUETTE and training cost ratio Ernest wants low more, and when there is 5 kinds of candidate translation example types, the training time of SILHOUETTE and Ernest are respectively 25 points Clock and 83 minutes.When there is more candidate translation example types, it is foreseen that be that SILHOUETTE performance is outstanding.
Prediction accuracy: Fig. 4,5 show m4.large fundamental forecasting model precision of prediction and c5.large transformation Prediction model can realize high-precision, the validity which demonstrate model transformers in SILHOUETTE.
Training cost: SILHOUETTE is intended to find optimal cloud configuration with lower expense.Therefore, by entire task Deadline is compared with the training data time of building basic forecast model.Fig. 6 shows in addition to TeraSort, The training time of SILHOUETTE is lower than the 20% of total deadline of all applications.
Application scalability: on different size of data set, SILHOUETTE constructs base using identical experimental setup Plinth and transformation prediction model, and assess its precision of prediction.Fig. 7 explanation uses 1.5 times, 2 times, 2.5 times and 3 haplotype data collection when us When size, prediction error is consistently lower than 15%, this shows to change even if the size of data set, the prediction mould that SILHOUETTE is established Type still can keep higher accuracy.
Best cloud is selected to configure for WordCount using SILHOUETTE in the present embodiment.Consider four kinds of example class in table 1 Type, it is assumed that selector optimization aim are as follows: given maximum task completion time minimizes totle drilling cost.Fig. 9, which is shown, uses every kind of reality Under example type, the total time of entire data set operation task and totle drilling cost.We can observe that calculation optimization example types The total time of c5.large is suitable with storage optimization example types i3.large, before SILHOUETTE will select cost lower Person.
Later, SILHOUETTE can be used to determine the preferred example number of given example type.Consider two tasks, point It Wei not TeraSort and WordCount.Fig. 9 gives Runtime of two tasks under different flock sizes, The runing time of SILHOUETTE prediction is very close to actual run time, it is possible thereby to select specific cluster scale.
Specific embodiment described herein is only an example for the spirit of the invention.The neck of technology belonging to the present invention The technical staff in domain can make various modifications or additions to the described embodiments or replace by a similar method In generation, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.

Claims (6)

1. the efficient cloud of big data analysis task configures selection algorithm, which is characterized in that comprise the steps of:
Step 1: training data is collected: being made when choosing corresponding with the ratio task execution of multiple a certain proportion of input datas Cloud Server example number determines every group of test parameters and task completion time, wherein the certain proportion refers to that experiment makes The ratio of input data is accounted for data;
Step 2: Construction of A Model: using the test parameters and task completion time in step 1, with the input data ratio and Example number, design are related to the polynomial fitting of input data ratio and example number, determine basic forecast modelMiddle wkValue.Wherein βkIt indicates whether to have selected quasi- Close item fkk=1 indicates to select this);
Model conversion: it is t that least test parameters time-consuming in step 1 is obtained runing time under object instance typetarget, In the way of mapping, the prediction model of object instance type is exported asIts In
Step 3: selector construction:
For the given input data set of task, the prediction model obtained using step 2, calculating meet the specific run time and at The most preferably cloud of this constraint configures.
2. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that:
Used cloud clothes when corresponding with the ratio task execution of multiple a certain proportion of input datas are chosen in the step 1 Business device example number detailed process are as follows:
The input data and a certain range of Cloud Server example number for first choosing certain proportion range, according to D- Optimality, in experiment parameter selection, selection maximizes covariance matrix (information matrix)The experiment of weighted sum is joined Number, i.e.,Constraint condition is 0≤αi≤1,i∈[1,M],Wherein αi Indicate the probability of selection i experimental setup, xiIt is instance number, yiIt is input data ratio, M is indicated by enumerating all possible ratio The experiment parameter setting sum that example and instance number obtain;
The totle drilling cost of experiment is indicated by addition budgetary restraints item B, wherein yi/xiBe according to the pricing model in cloud platform come Running experiment EiCost;
According to probability αiM experimental setup is ranked up with non-increasing sequence, and experiment parameter group forward in selected and sorted is made For training data.
3. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that: passed non- The data group of the row of selection preceding 10 is as training data in increasing sequence sequence.
4. the efficient cloud of big data analysis task according to claim 2 configures selection algorithm, it is characterised in that:
The input data of certain proportion range is specially the 1%~10% of data in the step 1, a certain range of cloud clothes Business device example number is 1-10 platform.
5. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that:
A certain proportion of input data described in step 1 is concentrated from entire input data by random sampling and is chosen.
6. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that: the mould Fit term is related to calculating time-consuming and communicate time-consuming in type construction.
CN201910294273.4A 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task Active CN110048886B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910294273.4A CN110048886B (en) 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910294273.4A CN110048886B (en) 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task

Publications (2)

Publication Number Publication Date
CN110048886A true CN110048886A (en) 2019-07-23
CN110048886B CN110048886B (en) 2020-05-12

Family

ID=67277094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910294273.4A Active CN110048886B (en) 2019-04-12 2019-04-12 Efficient cloud configuration selection algorithm for big data analysis task

Country Status (1)

Country Link
CN (1) CN110048886B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113301067A (en) * 2020-04-01 2021-08-24 阿里巴巴集团控股有限公司 Cloud configuration recommendation method and device for machine learning application
CN114996228A (en) * 2022-06-01 2022-09-02 南京大学 Server-unaware-oriented data transmission cost optimization method
CN115118592A (en) * 2022-06-15 2022-09-27 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103220337A (en) * 2013-03-22 2013-07-24 合肥工业大学 Cloud computing resource optimizing collocation method based on self-adaptation elastic control
CN108053026A (en) * 2017-12-08 2018-05-18 武汉大学 A kind of mobile application background request adaptive scheduling algorithm
US20180285903A1 (en) * 2014-04-04 2018-10-04 International Business Machines Corporation Network demand forecasting
CN109088747A (en) * 2018-07-10 2018-12-25 郑州云海信息技术有限公司 The management method and device of resource in cloud computing system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103220337A (en) * 2013-03-22 2013-07-24 合肥工业大学 Cloud computing resource optimizing collocation method based on self-adaptation elastic control
US20180285903A1 (en) * 2014-04-04 2018-10-04 International Business Machines Corporation Network demand forecasting
CN108053026A (en) * 2017-12-08 2018-05-18 武汉大学 A kind of mobile application background request adaptive scheduling algorithm
CN109088747A (en) * 2018-07-10 2018-12-25 郑州云海信息技术有限公司 The management method and device of resource in cloud computing system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN-CHUN CHEN 等: "Using Deep Learning to Predict and Optimize Hadoop Data Analytic Service in a Cloud Platform", 《2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS》 *
郑万波: "低可靠环境中云计算系统的服务质量预测与优化调度研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113301067A (en) * 2020-04-01 2021-08-24 阿里巴巴集团控股有限公司 Cloud configuration recommendation method and device for machine learning application
CN114996228A (en) * 2022-06-01 2022-09-02 南京大学 Server-unaware-oriented data transmission cost optimization method
CN115118592A (en) * 2022-06-15 2022-09-27 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis
CN115118592B (en) * 2022-06-15 2023-08-08 中国科学院软件研究所 Deep learning application cloud configuration recommendation method and system based on operator feature analysis

Also Published As

Publication number Publication date
CN110048886B (en) 2020-05-12

Similar Documents

Publication Publication Date Title
Du et al. A novel data placement strategy for data-sharing scientific workflows in heterogeneous edge-cloud computing environments
Jin et al. MRPGA: an extension of MapReduce for parallelizing genetic algorithms
Mishra et al. Towards characterizing cloud backend workloads: insights from google compute clusters
Verma et al. Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN110740079B (en) Full link benchmark test system for distributed scheduling system
CN110048886A (en) A kind of efficient cloud configuration selection algorithm of big data analysis task
CN106803799B (en) Performance test method and device
CN118069380B (en) Computing power resource processing method
Hua et al. Hadoop configuration tuning with ensemble modeling and metaheuristic optimization
Dongarra et al. Parallel processing and applied mathematics
Balouek-Thomert et al. Parallel differential evolution approach for cloud workflow placements under simultaneous optimization of multiple objectives
Lyu et al. Fine-grained modeling and optimization for intelligent resource management in big data processing
Miao et al. Efficient flow-based scheduling for geo-distributed simulation tasks in collaborative edge and cloud environments
Mariani et al. DeSpErate++: An enhanced design space exploration framework using predictive simulation scheduling
CN113010296A (en) Task analysis and resource allocation method and system based on formalized model
Nematpour et al. Enhanced genetic algorithm with some heuristic principles for task graph scheduling
Tiwari et al. Identification of critical parameters for MapReduce energy efficiency using statistical Design of Experiments
Koch et al. SMiPE: estimating the progress of recurring iterative distributed dataflows
CN115270921A (en) Power load prediction method, system and storage medium based on combined prediction model
Li et al. Cluster resource adjustment based on an improved artificial fish swarm algorithm in Mesos
CN111522644B (en) Method for predicting running time of parallel program based on historical running data
Zhou et al. Automated HPC Workload Generation Combining Statistical Modeling and Autoregressive Analysis
Yan et al. The post-game analysis framework-developing resource management strategies for concurrent systems
Xue et al. A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant