CN110048886A - A kind of efficient cloud configuration selection algorithm of big data analysis task - Google Patents
A kind of efficient cloud configuration selection algorithm of big data analysis task Download PDFInfo
- Publication number
- CN110048886A CN110048886A CN201910294273.4A CN201910294273A CN110048886A CN 110048886 A CN110048886 A CN 110048886A CN 201910294273 A CN201910294273 A CN 201910294273A CN 110048886 A CN110048886 A CN 110048886A
- Authority
- CN
- China
- Prior art keywords
- cloud
- task
- data
- experiment
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention proposes a kind of efficient clouds of big data analysis task to configure selection algorithm, small-scale cluster experiment is carried out by selected part input data, and then construct performance prediction model, utility prediction model estimates performance of the task on large-scale cluster, and passage capacity prediction result determines optimal cloud configuration.By using above-mentioned algorithm, can be configured with the lower model training time and at effectively helping user to find optimal cloud originally.Large-scale data analysis task to be deployed on cloud computing platform selects optimal cloud configuration, can significantly improve its operational efficiency, and reduce operating cost.
Description
Technical field
The invention belongs to field of cloud calculation, more particularly to the efficient cloud placement algorithm based on big data analysis task.
Background technique
Large-scale data analysis task is growing, and the task definition being related to is also increasingly complicated, wherein frequently involving machine
Study, natural language processing and image procossing etc..Compared with traditional calculations task, this generic task is usually data-intensive
With computation-intensive, longer calculating time and higher calculating cost are needed.Therefore it is analyzed to complete large-scale data
Task, usually using the huge computing capability of cloud computing come the task of helping through.For the optimal cloud of large scale analysis task choosing
Configuration, can be improved the operational efficiency of task, and can reduce the calculating cost of user.
In order to meet different calculating requirements, existing cloud service provider provides hundreds of with different resource for user
The example types (ComputeEngine of the EC2 of such as Amazon, the Azure of Microsoft and Google) of configuration.Although most of cloud clothes
Business provider only allows user to carry out selection example types, but the Compute Engine of Google from available example types pond
User is allowed to custom-configure virtual machine (configuration vCPU and memory), this is but also select correct cloud configuration to become more to challenge
Property.In addition to this, major cloud service provider also provides Serverless cloud framework (such as Amazon Lambda, Google
Cloud Functions and Microsoft Azure Functions), this service allows user using task as Serverless function
It can run, without using preassigned configuration starting example.But Serverless framework may need application program weight
Its code of structure, and Serverless cloud provider can not help user to minimize task completion time, or help
User, which reduces, calculates cost.
The selection of cloud configuration, the i.e. selection of the type of example and example quantity directly affect deadline and the consumption of task
The economic cost taken.The cloud configuration correctly selected can realize identical performance objective with lower cost.Due to counting on a large scale
According to the longer runing time of analysis task, excavating potential escapable cost just seems even more important.Due to the diversification of task,
And the combination diversification of example types and cluster scale, so that the search space of cloud configuration becomes huge.
In so huge search space, to the both not practical using exhaustive search of best cloud configuration or it is difficult to extend.
To limit search space, CherryPick algorithm limits search space by using limited mission bit stream, is selected with this
Best cloud configuration.CherryPick is optimized for cost minimization, but cannot be used for optimizing other targets, such as pass through
Cost budgeting minimizes the operation deadline.In addition to this, then service performance modeling method selects cloud to Ernest and PARIS
Configuration.By using this kind of performance prediction model, the cloud configuration that user can be different for the different task choosing of optimization aim, example
Such as, most cheap or the fastest cloud configuration is selected.But Ernest needs to train prediction model for each example types, and
PARIS only selects preferred example type in multiple public clouds, and cannot provide flock size.
Summary of the invention
In view of the deficiencies of the prior art, the present invention proposes a kind of efficient clouds of big data analysis task to configure selection algorithm.
The technical scheme is that a kind of efficient cloud of big data analysis task configures selection algorithm, it include following step
It is rapid:
Step 1: training data collection phase, implementation is as follows,
Training data collector only carries out the experiment of particular instance type to the sub-fraction of input data, this will be used for pre-
Survey the performance of the task execution in entire input data.It includes that experimental selection and experiment execute that training data, which is collected,.
Experimental selection: in experimental selection, it is thus necessary to determine that two important experiment parameters: (1) ratio, i.e. experiment use number
According to the ratio of the total input data of Zhan;(2) used Cloud Server example number when task execution.The present invention uses statistical technique
Carry out selected section experiment parameter, property when task run is mainly predicted using the experiment parameter that can generate multi information as far as possible
Can, to guarantee higher forecasting accuracy.According to D-optimality, selection maximizes covariance matrix (information matrix) and adds
Weigh the experiment parameter of sum.Use Ei=(xi,yi) indicate that experiment parameter is arranged, wherein xiIt is instance number, yiIt is input data ratio
Example.Let m represent the experiment parameter setting sum obtained by enumerating all possible ratio and instance number.Then, E is utilizedi, can
To calculate K dimensional feature vector Fi, wherein each corresponds to an item in prediction model.In this way, obtain about
M feature vector of all experimental setups.According to D-optimality, in experiment parameter selection, selection maximizes covariance
Matrix (information matrix)The experiment parameter of weighted sum, i.e.,Constraint condition is 0≤αi≤1,
i∈[1,M],Wherein αiIndicate the probability of selection i experimental setup.By addition budgetary restraints item B come
Indicate the totle drilling cost of experiment, wherein yi/xiIt is according to the pricing model in cloud platform come running experiment EiCost.In solution
When stating optimization problem, according to probability αiM experimental setup is ranked up using non-increasing sequence select forward data group as
Training data.Select preceding 10 data groups as training data in the present invention.
Experiment executes: after selected experimental setup, determining and is come using which data sample in entire input data set
Experimental data set is formed, to meet specified ratio.Selection number is concentrated from entire input data using random sampling in the present invention
According to sample, because random sampling can be to avoid the isolated area for falling into data set.After obtaining small data set, selected experiment is used
The example of deployment specified quantity and the task that brings into operation are set, later using test parameters and task completion time as constructing
The training data of prediction model.
Step 2: in the Construction of A Model stage, implementation is as follows,
Model builder is made of model constructor and model transformer.Utilize the training number of the particular instance type of collection
According to model constructor can establish basic forecast model.Later, model transformer exports remaining according to basic forecast model conversation
The prediction model of example types.
Model constructor: when running the experiment of input data set subset in particular instance type, T is usedbase(x, y) comes
Indicate that Runtime, given example number are x, the ratio of data set is y.Large scale analysis task is usually with continuous step
(i.e. iteration) operation, until meeting termination condition.Each step is mainly made of two stages: concurrent and data communication.
The calculating time of task execution and data set size keep relativeness, and have several representativenesses in large scale analysis task
Communication pattern.Therefore, the runing time of large scale analysis task is inferred by analytical Calculation time and call duration time.This hair
Bright middle main target is calculating and the communication pattern by task, and designs and be related to the fit term of x and y, to obtain Given task
Performance prediction function Tbase(x,y)。
Time-consuming is calculated, user-defined iterative algorithm can carry out the time caused by operation to each sample of input data
Cost.Task is handled for the large-scale data in cluster computing environment, can according to the feature of data set (for example, it is intensive or
It is sparse) and algorithm, by several different fit terms come the approximate calculation time.Calculating the time as a result, can be about example quantity
With a function of data set scale.
Communication is time-consuming, and data pass through the time cost that transmission of network is generated to destination node.Fig. 1 has taken out extensive number
According to communication pattern representative in analysis task.Although having differences in terms of programming model and execution mechanism, common is logical
Letter mode can represent most of signal intelligences in cluster application program.The time-consuming letter primarily with regard to example quantity of communication
Number, can be according to the different communication modes of task, to be inferred to the fit term of function.For example, working as the size of data of each example
When constant, communication is time-consuming as the instance number of partition-aggregate communication pattern is linearly increasing, but for
Shuffle communication pattern is the relationship of quadratic power.
Given function TbaseAll candidate fit terms of (x, y), use mutual information as the selection criteria of fit term, exclude
Redundancy and only select good predictive factor as fit term.IfThe set of all candidate items is indicated, wherein each item
fkIt is x and y by calculating the function determined with communication pattern.To what is collected under the example and different data scale for being scheduled on different number
M training data sample calculates the K dimensional feature vector F of each experimental setup firsti=(f1,i,…,fK,i), such as fk,i=yi/
xi.Then, the mutual information between each item and runing time is calculated, and selects to be higher than threshold value with the mutual information of runing time
?.According to m trained runing time sample, fitting obtains basic forecast model Middle wkValue.Wherein βkIt indicates whether to have selected quasi-
Close item fk(βk=1 indicates to select this).
Model transformer: cloud provider usually provide have different CPU, memory, hard disk and network capacity combination it is various
Example series, to meet the needs of different work, such as general and calculating/memory/storage optimization.By many experiments
The runing time of one example types can be converted to different examples according to simple mapping by Given task and fixed data set
Type.It therefore, there is no need to test every kind of example types to obtain after training data again and to construct prediction model, this is big
Reduce training time and training cost greatly.
Converter Φ is the mapping φ from basic forecast model to target prediction model: Tbase(x,y)→Ttarget(x,y)。
By comparing under same task and data set scale, the runing time of different instances type, the fit term class in anticipation function
It is not similar.In other words, under same task and data set scale, if fkIncluded in TbaseIn (x, y), then it is likely to
Ttarget(x, y) also should include fk.This is primarily due under the configuration of identical application program and instance number, the calculating of task
It is held essentially constant with communication pattern.However, each weight will be different under different example types, so needing
It pays close attention to from basic forecast model to the weight of target prediction model and maps.It uses a kind of simple in the present invention and effectively reflects
Shooting method.IfIndicate that cost is minimum in the experimental setup of collecting training data device selection, runing time is
tbase.Our running experiments in object instance typeTo obtain runing time ttarget.Model transformer is by object instance class
The prediction model of type exports asWherein
Step 3: selector construction phase, implementation is as follows,
The runing time prediction model of all example types is integrated into single runing time fallout predictor T (x, y), wherein
X configures vector by the cloud that the type and quantity of example form.For the given input data set of task, target is to enable users to
Enough find the most preferably cloud configuration for meeting specific run time and cost constraint.Enabling P (x) is the unit time price that cloud configures x,
That is the monovalent quantity multiplied by example of example types.Optimal cloud configuration select permeability can be expressed as x*=S (T (x, y), C
(x), Ry, wherein Cx=Px × Tx, y, 0≤y≤1
Wherein C (x) is the unit time price of cloud configuration x, and R (y) is the constraint of user's addition, such as maximum tolerance operation
Time or maximum tolerance cost.Selector S (*) is determined by user, and the best cloud for selecting to meet expected performance or cost
Configure x*。
Detailed description of the invention
Fig. 1 is communication pattern brief introduction figure of the invention.
Fig. 2 is master-plan structure chart of the invention
Fig. 3 is effectiveness of the invention comparison diagram
Fig. 4 is the predictablity rate of the invention on Spark
Fig. 5 is the predictablity rate of the invention on Hadoop
Fig. 6 is task total time and model training time comparison diagram of the invention
Fig. 7 is the predictablity rate of TeraSort of the invention under different data collection size
Fig. 8 is the cost of WordCount of the invention in different instances type
Fig. 9 is the deadline of TeraSort and WordCount of the invention on different cluster scales
Specific embodiment
The present invention mainly according to the calculating mode and communication pattern of big data analysis task, proposes a big data analysis
The efficient cloud of task configures Selection Framework, so that user is found the cloud configuration for being suitble to given big data analysis task, thus greatly
The big calculating cost for reducing large-scale data analysis task.This frame leads to too small amount of experiment to establish prediction model, uses
It is few input data and small-scale cluster, and can be by few additional experiments intelligence by the prediction of an example types
Model conversion is that the prediction model cloud through the invention of another example types configures Selection Framework, and cloud computing user can be with
Lower cost determines best cloud configuration.
Referring to fig. 2, big data point of the embodiment to be realized on Amazon cloud service (AmazonWebService, AWS)
A specific elaboration is carried out to process of the invention for cloud configuration selection algorithm (being named as Silhouette) of analysis task,
It is as follows:
Step 1: training data collection phase, implementation is as follows,
Training data collector only carries out the experiment of particular instance type to the sub-fraction of input data, this will be used for pre-
Survey the performance of the task execution in entire input data.It includes that experimental selection and experiment execute that training data, which is collected,.
Experimental selection: in experimental selection, it is thus necessary to determine that two important experiment parameters: (1) ratio, i.e. experiment use number
According to the ratio of the total input data of Zhan;(2) used Cloud Server example number when task execution.Statistics is used in the present embodiment
Technology carrys out selected section experiment parameter, main using can generate the experiment parameter of multi information as far as possible come when predicting task run
Performance, to guarantee higher forecasting accuracy.According to D-optimality, selection maximizes covariance matrix (information square
Battle array) weighted sum experiment parameter.Use Ei=(xi,yi) indicate that experiment parameter is arranged, wherein xiIt is instance number, yiIt is input number
According to ratio.Let m represent the experiment parameter setting sum obtained by enumerating all possible ratio and instance number.Then, it utilizes
Ei, we can calculate K dimensional feature vector Fi, wherein each corresponds to an item in prediction model.In this way,
We can obtain the M feature vector about all experimental setups.According to D-optimality, in experiment parameter selection,
We select to maximize covariance matrix (information matrix)The experiment parameter of weighted sum, i.e.,Constraint condition is 0≤αi≤1,i∈[1,M],Wherein αiIndicate choosing
Select the probability of i experimental setup.We indicate the totle drilling cost of experiment by addition budgetary restraints item B, wherein yi/xiIt is according to cloud
Pricing model on platform carrys out running experiment EiCost.When solving above-mentioned optimization problem, according to probability αiWith non-increasing sequence
Choice experiment is ranked up to M experimental setup.
Experiment executes: after selected experimental setup, it is thus necessary to determine that uses which data sample in entire input data set
Original composition experimental data set, to meet specified ratio.The present invention is concentrated from entire input data using random sampling and is selected
Data sample, because random sampling can be to avoid the isolated area for falling into data set.After obtaining small data set, selected reality is used
The example of setting deployment specified quantity and the task that brings into operation are tested, later using test parameters and task completion time as structure
Build the training data of prediction model.
The specific implementation process of embodiment is described as follows:
The analysis processing engine of large-scale data used in embodiment is Spark and Hadoop.On Spark, Wo Menfen
Do not run 3 kinds of machine learning based on SparkML: classification is returned and is clustered.Wherein sorting algorithm is had using using
The text classification benchmark dataset rcv1 of 44000 features, regression algorithm and clustering algorithm have 44000 using 1,000,000
The generated data collection of feature.On Hadoop, it is separately operable TeraSort algorithm and WordCount algorithm.Wherein
TeraSort algorithm is a kind of universal reference test application program of large-scale data analysis, and groundwork is to generating at random
Record is ranked up, and uses the data set for having 200,000,000 samples, and WordCount algorithm is used to calculate from wikipedia
The word frequency of occurrences in the fifty-five million entry of article.
In the EC2 example types pond of AWS, select m4.large (general), c5.large (calculation optimization), r4.large
(internal memory optimization) and i3.large (storage optimization), every kind of example types have 2 vCPU, and pre-install linux system.In experiment
The Data Analysis Services engine model used is respectively Apache Spark 2.2 and Hadoop 2.8.Table 1 lists every type
The configuration and price of type example.
Table 1
Example types | Memory (GiB) | Example hard disk | Price (dollar/hour) |
m4.large | 8 | EBS | 0.1 |
c5.large | 4 | EBS | 0.085 |
r4.large | 15.25 | EBS | 0.133 |
i3.large | 15.25 | SSD | 0.156 |
Setting is 1% to 8% for the data scale of modeling experiment first, and experiment cluster scale is limited in 1 to 8 realities
Example.In embodiment, probability α is takeniPreceding 10 experiment is tested.When selecting input data sample, concentrated from input data random
Select a starting seed specimen;Then, in each sampling step, output sample is randomly obtained;It repeats the above process, until
The quantity of selected sample meets scale requirements in experiment parameter.In embodiment, use m4.large as basic example types,
So finally the data set of grab sample is operated on the m4.large cluster of given size in experiment parameter, when record is run
Between.
Step 2: in the Construction of A Model stage, implementation is as follows,
Model builder is made of model constructor and model transformer.Utilize the training number of the particular instance type of collection
According to model constructor can establish basic forecast model.Later, model transformer exports remaining according to basic forecast model conversation
The prediction model of example types.
Model constructor: when running the experiment of input data set subset in particular instance type, T is usedbase(x, y) comes
Indicate that Runtime, given example number are x, the ratio of data set is y.Large scale analysis task is usually with continuous step
(i.e. iteration) operation, until meeting termination condition.Each step is mainly made of two stages: concurrent and data communication.
The calculating time of task execution and data set size keep relativeness, and have several representativenesses in large scale analysis task
Communication pattern.Therefore, the runing time of large scale analysis task can be inferred by analytical Calculation time and call duration time.
The main target of the present embodiment is the calculating and communication pattern by task, and design be related to the fit term of x and y, come obtain to
Determine the performance prediction function T of taskbase(x,y)。
Time-consuming is calculated, user-defined iterative algorithm can carry out the time caused by operation to each sample of input data
Cost.Task is handled for the large-scale data in cluster computing environment, can according to the feature of data set (for example, it is intensive or
It is sparse) and algorithm, by several different fit terms come the approximate calculation time.Calculating the time as a result, can be about example quantity
With a function of data set scale.The definite fit term for determining function needs to combine specific domain knowledge.
Communication is time-consuming, and data pass through the time cost that transmission of network is generated to destination node.Fig. 1 has taken out extensive number
According to communication pattern representative in analysis task.Although having differences in terms of programming model and execution mechanism, common is logical
Letter mode can represent most of signal intelligences in cluster application program.The time-consuming letter primarily with regard to example quantity of communication
Number, can be according to the different communication modes of task, to be inferred to the fit term of function.For example, working as the size of data of each example
When constant, communication is time-consuming as the instance number of partition-aggregate communication pattern is linearly increasing, but for
Shuffle communication pattern is the relationship of quadratic power.
Given function TbaseAll candidate fit terms of (x, y), we use selection criteria of the mutual information as fit term,
It excludes redundancy and only selects good predictive factor as fit term.IfThe set of all candidate items is indicated, wherein often
A fkIt is x and y by calculating the function determined with communication pattern.To being received under the example and different data scale for being scheduled on different number
M training data sample of collection calculates the K dimensional feature vector F of each experimental setup firsti=(f1,i,…,fK,i), such as fk,i
=yi/xi.Then, we calculate the mutual information between each item and runing time, and select to be higher than with the mutual information of runing time
The item of threshold value.According to m trained runing time sample, fitting obtains basic forecast model Middle wkValue.Wherein βkIt indicates whether to have selected quasi-
Close item fk(βk=1 indicates to select this).
Model transformer: cloud provider usually provide have different CPU, memory, hard disk and network capacity combination it is various
Example series, to meet the needs of different work, such as general and calculating/memory/storage optimization.By many experiments, I
Find Given task and fixed data set, the runing time of an example types can be converted to by difference according to simple mapping
Example types.It therefore, there is no need to test every kind of example types to obtain after training data again and to construct prediction mould
Type, this considerably reduce training times and training cost.
Converter Φ is the mapping φ from basic forecast model to target prediction model: Tbase(x,y)→Ttarget(x,y)。
By comparing under same task and data set scale, the runing time of different instances type has upper know in anticipation function
Fit term classification is similar.In other words, under same task and data set scale, if fkIncluded in TbaseIn (x, y), then
It is likely to Ttarget(x, y) also should include fk.This is primarily due under the configuration of identical application program and instance number, task
Calculating and communication pattern be held essentially constant.However, each weight will be different under different example types,
It is mapped so we need to pay close attention to from basic forecast model to the weight of target prediction model.We use a kind of simple and have
The mapping method of effect.IfIndicate that cost is minimum in the experimental setup of collecting training data device selection, operation
Time is tbase.Our running experiments in object instance typeTo obtain runing time ttarget.Model transformer is by target
The prediction model of example types exports asWherein
The specific embodiment of embodiment is as follows:
In embodiment, the fit term being added in anticipation function has: constant term, y/x linear term, data scale square root
With instance number?.Fixed constant indicates the time spent in serial computing;For calculating the size of time and data set
Linear algorithm adds the fit term of ratio data and instance number y/x;For sparse data set, data scale is added
Square root and instance numberFit term.
Table 2
Communication pattern | Structure | Fit term |
Parallel read/write | Many One-to-One | x |
Partition-aggregate | Many-to-One | logx |
Broadcast | One-to-Many | x |
Collect | Many-to-One | x |
Shuffle | Many-to-Many | x2 |
Global communication | All-to-All | x2 |
In embodiment, according to the communication pattern of different task, uses and communicated fit term as shown in table 2, respectively
x、logx、x2.After selecting all items, basic forecast model is calculated using non-negative least square (NNLS) solver.
Later, the experimental setup that cost is minimum in infrastest is selected, is run and is appointed in object instance type with same experimental setup
Business.The prediction model for finally exporting all example types is
Step 3: selector construction phase, implementation is as follows,
The runing time prediction model of all example types is integrated into single runing time fallout predictor T (x, y), wherein
X configures vector by the cloud that the type and quantity of example form.For the given input data set of task, target is to enable users to
Enough find the most preferably cloud configuration for meeting specific run time and cost constraint.Enabling P (x) is the unit time price that cloud configures x,
That is the monovalent quantity multiplied by example of example types.Optimal cloud configuration select permeability can be expressed as x*=S (T (x, y), C
(x), Ry, wherein Cx=Px × Tx, y, 0≤y≤1
Wherein C (x) is the unit time price of cloud configuration x, and R (y) is the constraint of user's addition, such as maximum tolerance operation
Time or maximum tolerance cost.Selector S (*) is determined by user, and the best cloud for selecting to meet expected performance or cost
Configure x*。
The specific implementation process of embodiment is described as follows:
After obtaining all prediction models of all tasks in example types to be selected, need to find operating cost minimum point
Optimal cloud allocation plan.Cloud allocation plan enables to task to exist what needs to be satisfied is that under the premise of given cost budgeting
It is completed in the shortest time.To embodiment, algorithm is assessed by 4 test results, is respectively as follows: validity, prediction standard
Exactness, training cost and application scalability.
Validity: compare the performance of SILHOUETTE and Ernest in 5 tasks.Fig. 3 (a) shows that SILHOUETTE's is pre-
Survey precision is suitable with the precision of prediction of Ernest, and Fig. 3 (b) shows that the training time of SILHOUETTE and training cost are far below
Ernest.When we establish prediction model for 2 kinds of examples, SILHOUETTE can save for 25% training time and 30%
Cost.From Fig. 3 (c) as can be seen that when there is more candidate translation example types, the training time of SILHOUETTE and training cost ratio
Ernest wants low more, and when there is 5 kinds of candidate translation example types, the training time of SILHOUETTE and Ernest are respectively 25 points
Clock and 83 minutes.When there is more candidate translation example types, it is foreseen that be that SILHOUETTE performance is outstanding.
Prediction accuracy: Fig. 4,5 show m4.large fundamental forecasting model precision of prediction and c5.large transformation
Prediction model can realize high-precision, the validity which demonstrate model transformers in SILHOUETTE.
Training cost: SILHOUETTE is intended to find optimal cloud configuration with lower expense.Therefore, by entire task
Deadline is compared with the training data time of building basic forecast model.Fig. 6 shows in addition to TeraSort,
The training time of SILHOUETTE is lower than the 20% of total deadline of all applications.
Application scalability: on different size of data set, SILHOUETTE constructs base using identical experimental setup
Plinth and transformation prediction model, and assess its precision of prediction.Fig. 7 explanation uses 1.5 times, 2 times, 2.5 times and 3 haplotype data collection when us
When size, prediction error is consistently lower than 15%, this shows to change even if the size of data set, the prediction mould that SILHOUETTE is established
Type still can keep higher accuracy.
Best cloud is selected to configure for WordCount using SILHOUETTE in the present embodiment.Consider four kinds of example class in table 1
Type, it is assumed that selector optimization aim are as follows: given maximum task completion time minimizes totle drilling cost.Fig. 9, which is shown, uses every kind of reality
Under example type, the total time of entire data set operation task and totle drilling cost.We can observe that calculation optimization example types
The total time of c5.large is suitable with storage optimization example types i3.large, before SILHOUETTE will select cost lower
Person.
Later, SILHOUETTE can be used to determine the preferred example number of given example type.Consider two tasks, point
It Wei not TeraSort and WordCount.Fig. 9 gives Runtime of two tasks under different flock sizes,
The runing time of SILHOUETTE prediction is very close to actual run time, it is possible thereby to select specific cluster scale.
Specific embodiment described herein is only an example for the spirit of the invention.The neck of technology belonging to the present invention
The technical staff in domain can make various modifications or additions to the described embodiments or replace by a similar method
In generation, however, it does not deviate from the spirit of the invention or beyond the scope of the appended claims.
Claims (6)
1. the efficient cloud of big data analysis task configures selection algorithm, which is characterized in that comprise the steps of:
Step 1: training data is collected: being made when choosing corresponding with the ratio task execution of multiple a certain proportion of input datas
Cloud Server example number determines every group of test parameters and task completion time, wherein the certain proportion refers to that experiment makes
The ratio of input data is accounted for data;
Step 2: Construction of A Model: using the test parameters and task completion time in step 1, with the input data ratio and
Example number, design are related to the polynomial fitting of input data ratio and example number, determine basic forecast modelMiddle wkValue.Wherein βkIt indicates whether to have selected quasi-
Close item fk(βk=1 indicates to select this);
Model conversion: it is t that least test parameters time-consuming in step 1 is obtained runing time under object instance typetarget,
In the way of mapping, the prediction model of object instance type is exported asIts
In
Step 3: selector construction:
For the given input data set of task, the prediction model obtained using step 2, calculating meet the specific run time and at
The most preferably cloud of this constraint configures.
2. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that:
Used cloud clothes when corresponding with the ratio task execution of multiple a certain proportion of input datas are chosen in the step 1
Business device example number detailed process are as follows:
The input data and a certain range of Cloud Server example number for first choosing certain proportion range, according to D-
Optimality, in experiment parameter selection, selection maximizes covariance matrix (information matrix)The experiment of weighted sum is joined
Number, i.e.,Constraint condition is 0≤αi≤1,i∈[1,M],Wherein αi
Indicate the probability of selection i experimental setup, xiIt is instance number, yiIt is input data ratio, M is indicated by enumerating all possible ratio
The experiment parameter setting sum that example and instance number obtain;
The totle drilling cost of experiment is indicated by addition budgetary restraints item B, wherein yi/xiBe according to the pricing model in cloud platform come
Running experiment EiCost;
According to probability αiM experimental setup is ranked up with non-increasing sequence, and experiment parameter group forward in selected and sorted is made
For training data.
3. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that: passed non-
The data group of the row of selection preceding 10 is as training data in increasing sequence sequence.
4. the efficient cloud of big data analysis task according to claim 2 configures selection algorithm, it is characterised in that:
The input data of certain proportion range is specially the 1%~10% of data in the step 1, a certain range of cloud clothes
Business device example number is 1-10 platform.
5. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that:
A certain proportion of input data described in step 1 is concentrated from entire input data by random sampling and is chosen.
6. the efficient cloud of big data analysis task according to claim 1 configures selection algorithm, it is characterised in that: the mould
Fit term is related to calculating time-consuming and communicate time-consuming in type construction.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910294273.4A CN110048886B (en) | 2019-04-12 | 2019-04-12 | Efficient cloud configuration selection algorithm for big data analysis task |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910294273.4A CN110048886B (en) | 2019-04-12 | 2019-04-12 | Efficient cloud configuration selection algorithm for big data analysis task |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110048886A true CN110048886A (en) | 2019-07-23 |
CN110048886B CN110048886B (en) | 2020-05-12 |
Family
ID=67277094
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910294273.4A Active CN110048886B (en) | 2019-04-12 | 2019-04-12 | Efficient cloud configuration selection algorithm for big data analysis task |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110048886B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113301067A (en) * | 2020-04-01 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Cloud configuration recommendation method and device for machine learning application |
CN114996228A (en) * | 2022-06-01 | 2022-09-02 | 南京大学 | Server-unaware-oriented data transmission cost optimization method |
CN115118592A (en) * | 2022-06-15 | 2022-09-27 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103220337A (en) * | 2013-03-22 | 2013-07-24 | 合肥工业大学 | Cloud computing resource optimizing collocation method based on self-adaptation elastic control |
CN108053026A (en) * | 2017-12-08 | 2018-05-18 | 武汉大学 | A kind of mobile application background request adaptive scheduling algorithm |
US20180285903A1 (en) * | 2014-04-04 | 2018-10-04 | International Business Machines Corporation | Network demand forecasting |
CN109088747A (en) * | 2018-07-10 | 2018-12-25 | 郑州云海信息技术有限公司 | The management method and device of resource in cloud computing system |
-
2019
- 2019-04-12 CN CN201910294273.4A patent/CN110048886B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103220337A (en) * | 2013-03-22 | 2013-07-24 | 合肥工业大学 | Cloud computing resource optimizing collocation method based on self-adaptation elastic control |
US20180285903A1 (en) * | 2014-04-04 | 2018-10-04 | International Business Machines Corporation | Network demand forecasting |
CN108053026A (en) * | 2017-12-08 | 2018-05-18 | 武汉大学 | A kind of mobile application background request adaptive scheduling algorithm |
CN109088747A (en) * | 2018-07-10 | 2018-12-25 | 郑州云海信息技术有限公司 | The management method and device of resource in cloud computing system |
Non-Patent Citations (2)
Title |
---|
CHEN-CHUN CHEN 等: "Using Deep Learning to Predict and Optimize Hadoop Data Analytic Service in a Cloud Platform", 《2017 IEEE 15TH INTL CONF ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING, 15TH INTL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING, 3RD INTL CONF ON BIG DATA INTELLIGENCE AND COMPUTING AND CYBER SCIENCE AND TECHNOLOGY CONGRESS》 * |
郑万波: "低可靠环境中云计算系统的服务质量预测与优化调度研究", 《中国博士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113301067A (en) * | 2020-04-01 | 2021-08-24 | 阿里巴巴集团控股有限公司 | Cloud configuration recommendation method and device for machine learning application |
CN114996228A (en) * | 2022-06-01 | 2022-09-02 | 南京大学 | Server-unaware-oriented data transmission cost optimization method |
CN115118592A (en) * | 2022-06-15 | 2022-09-27 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on operator characteristic analysis |
CN115118592B (en) * | 2022-06-15 | 2023-08-08 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on operator feature analysis |
Also Published As
Publication number | Publication date |
---|---|
CN110048886B (en) | 2020-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Du et al. | A novel data placement strategy for data-sharing scientific workflows in heterogeneous edge-cloud computing environments | |
Jin et al. | MRPGA: an extension of MapReduce for parallelizing genetic algorithms | |
Mishra et al. | Towards characterizing cloud backend workloads: insights from google compute clusters | |
Verma et al. | Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance | |
CN107908536B (en) | Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment | |
CN110740079B (en) | Full link benchmark test system for distributed scheduling system | |
CN110048886A (en) | A kind of efficient cloud configuration selection algorithm of big data analysis task | |
CN106803799B (en) | Performance test method and device | |
CN118069380B (en) | Computing power resource processing method | |
Hua et al. | Hadoop configuration tuning with ensemble modeling and metaheuristic optimization | |
Dongarra et al. | Parallel processing and applied mathematics | |
Balouek-Thomert et al. | Parallel differential evolution approach for cloud workflow placements under simultaneous optimization of multiple objectives | |
Lyu et al. | Fine-grained modeling and optimization for intelligent resource management in big data processing | |
Miao et al. | Efficient flow-based scheduling for geo-distributed simulation tasks in collaborative edge and cloud environments | |
Mariani et al. | DeSpErate++: An enhanced design space exploration framework using predictive simulation scheduling | |
CN113010296A (en) | Task analysis and resource allocation method and system based on formalized model | |
Nematpour et al. | Enhanced genetic algorithm with some heuristic principles for task graph scheduling | |
Tiwari et al. | Identification of critical parameters for MapReduce energy efficiency using statistical Design of Experiments | |
Koch et al. | SMiPE: estimating the progress of recurring iterative distributed dataflows | |
CN115270921A (en) | Power load prediction method, system and storage medium based on combined prediction model | |
Li et al. | Cluster resource adjustment based on an improved artificial fish swarm algorithm in Mesos | |
CN111522644B (en) | Method for predicting running time of parallel program based on historical running data | |
Zhou et al. | Automated HPC Workload Generation Combining Statistical Modeling and Autoregressive Analysis | |
Yan et al. | The post-game analysis framework-developing resource management strategies for concurrent systems | |
Xue et al. | A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |