CN113627664A - Run-time prediction system and method for graph-oriented iterative operation in Gaia system - Google Patents

Run-time prediction system and method for graph-oriented iterative operation in Gaia system Download PDF

Info

Publication number
CN113627664A
CN113627664A CN202110890134.5A CN202110890134A CN113627664A CN 113627664 A CN113627664 A CN 113627664A CN 202110890134 A CN202110890134 A CN 202110890134A CN 113627664 A CN113627664 A CN 113627664A
Authority
CN
China
Prior art keywords
iteration
graph
similarity
job
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110890134.5A
Other languages
Chinese (zh)
Inventor
岳晓飞
王国仁
赵宇海
郑军
李博扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110890134.5A priority Critical patent/CN113627664A/en
Publication of CN113627664A publication Critical patent/CN113627664A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Computational Linguistics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Development Economics (AREA)
  • Evolutionary Computation (AREA)
  • Game Theory and Decision Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Security & Cryptography (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)

Abstract

The invention discloses a running time prediction system and a running time prediction method for graph-oriented iterative operation in a Gaia system. The method comprises the steps of quickly capturing offline features of a current graph iterative algorithm by sampling execution before operation execution, wherein the offline features comprise convergence features and key input features of each iteration; continuously capturing runtime characteristics including job parameters, resource utilization and detailed statistical data during job execution; the similarity between the jobs is taken as the basis for job matching and final predicted value calculation, and mainly comprises the static similarity captured by sampling execution and the dynamic similarity captured by real execution. The matching algorithm can train specific parameters of the algorithm through established similarity evaluation criteria to enable the iterative operation to automatically adapt to various similarities. The invention discloses an end-to-end operation time prediction method, which integrates the offline characteristics and the operation time characteristics of graph iteration operation and can accurately predict the operation time of distributed graph iteration operation under lower training cost.

Description

Run-time prediction system and method for graph-oriented iterative operation in Gaia system
Technical Field
The invention relates to the technical field of distributed big data calculation, in particular to a graph iteration operation-oriented running time prediction system in a Gaia system.
Background
The Gaia system is a high-efficiency and extensible new-generation big data computing system based on the hybrid coexistence of multiple computing models. The method solves a series of key technical problems at several core levels of big data analysis systems such as self-adaption, telescopic big data storage, batch flow fusion big data calculation, high-dimensional large-scale machine learning, high-timeliness big data intelligent interaction guide and the like, constructs a new generation of autonomous and controllable high-timeliness telescopic big data analysis system, and masters the international leading big data analysis system core technology.
Iterative computation is typically a core part of machine learning and graph processing algorithms, i.e., one or more steps are repeatedly performed until a convergence criterion is met, common in various applications such as recommendation systems, data mining, etc. Moreover, with the rapid development of internet technology, the data sets in various scenes are increasing in scale, so that it has become very common to use a distributed computing framework for iterative computation in the industrial and scientific fields, and distributed computing systems like Gaia allow users to perform iterative computation on computer clusters to analyze large-scale image data sets. The Gaia system supports iterative computations by defining and embedding a step function into a special iterative operator, and further supports incremental iterations. To maintain DAG-based runtimes and schedulers, Gaia constructs iterative channels by building iterative "head" and "tail" tasks and interconnected feedback edges, and fig. 1 shows the structure of a Gaia iterator.
Most graph iteration jobs running on a distributed computing framework need to comply with service level objectives because external applications generally need timely feedback of analysis results, for example, Twitter corporation explicitly requires that searches and queries for TB level log data indexes be completed within ten minutes, and a service level agreement with a user is violated if the time delay is too high. In most cases, users often excessively supply resources for graph iteration jobs, which results in low resource utilization of the whole cluster and huge cluster operation and maintenance costs, and particularly, in the worst case, even if all the remaining resources are used, the jobs cannot be guaranteed to be completed within a required time limit, and the graph iteration jobs are often repeatedly executed.
Although the Gaia system has extremely high iterative computation efficiency compared with other distributed frameworks, the time limit requirement of the graph iterative operation is not taken as the basis of resource scheduling, so in order to meet the time limit requirement of the graph iterative operation and timely optimize the resource supply of the graph iterative operation, the prediction of the running time of the graph iterative operation on the distributed computation framework is needed to be solved at present.
However, it is difficult to estimate the runtime or resource requirement of the distributed graph iteration job before execution, because the runtime of the distributed graph iteration job depends on various factors, including the logic of the user-defined function, program parameters, input data set, system configuration, and the like, and it is more difficult to obtain the information because each iteration of the incremental iteration calculation supported by the Gaia system has a dependency relationship with the result of the last iteration. The traditional method simply uses execution data of a job history to conduct runtime prediction, one part trains a performance model to predict the runtime of the job under a given resource configuration by sampling a data set and running job collection data on a small number of servers, and the other part directly conducts sampling execution to extract input features of each iteration to conduct runtime prediction. Although the methods are quite effective, the methods are limited to prediction under a fixedly configured cluster, and the obtained data are offline data or rough statistical data, so that the prediction accuracy is not high.
Disclosure of Invention
In view of this, the present invention provides a runtime prediction system for graph iteration job in Gaia system, which can perform the runtime prediction of graph iteration job without being limited to the cluster with fixed configuration, and has higher prediction accuracy.
In order to achieve the purpose, the technical scheme of the invention is as follows: a runtime prediction system for graph-oriented iterative jobs in a Gaia system includes a similarity management module, a job matching module, and a runtime prediction module.
The similarity management module comprises a sampling execution unit, a static similarity calculation unit and a dynamic similarity calculation unit.
The sampling execution unit samples the graph data set through a biased random jump method to obtain sample graph data, and an iterative algorithm is executed on the sample graph data set to capture the offline characteristics of the sample graph iterative operation.
The static similarity calculation unit calculates the static similarity between the current graph iteration operation and the historical graph iteration operation through the offline features captured by the sampling execution unit.
The dynamic similarity calculation unit calculates the dynamic similarity between the current graph iteration operation and the historical graph iteration operation according to the statistical information captured in the real execution of the graph iteration operation.
And the similarity management module sends the calculated static similarity and the calculated dynamic similarity to the operation matching module.
The operation matching module comprises a similarity evaluation unit, a training point collection unit and a parameter training unit.
And the similarity evaluation unit constructs a similarity evaluation function, performs effectiveness evaluation on the static similarity and the dynamic similarity sent by the similarity management module, selects matched historical graph iteration operation according to the effectiveness evaluation result and sends the selected historical graph iteration operation to the running time prediction module.
The training point collecting unit is used for collecting training data points of parameters in the training similarity evaluation function.
The parameter training unit utilizes the training data points to train to obtain the parameters of the similarity evaluation function and sends the parameters into the similarity evaluation unit.
And the running time prediction module predicts the residual running time of the current graph iteration operation through the historical graph iteration operation matched by the operation matching module.
Further, the static similarity calculation unit calculates the current graph iterative operation Job by adopting the following methodrunJob iterative operation with History diagramcompStatic similarity between them, including static computational similarity and static messaging similarity.
Further, the statically calculated similarity is:
Figure BDA0003195634930000041
wherein S is1Iteratively operating Job for a current graphrunJob iterative operation with History diagramcompThe similarity of static calculation between the two, namely the active node proportion of the graph iteration Job Job in the ith iteration
Figure BDA0003195634930000042
ActVerti(Job) represents the number of active nodes in the ith iteration of Job, TotVerti(Job) represents the number of summary points of the graph iteration Job Job during the ith iteration, and N represents the total iteration number in the sampling execution process; the graph iteration Job Job comprises a current graph iteration Job JobrunJob iterative operation with History diagramcomp
Static message passing similarity is
Figure BDA0003195634930000043
Wherein S is2Represent Job, an iterative operation of the current graphrunJob iterative operation with History diagramcompStatic messaging similarity between them; the total amount of messages passed by Job in the ith iteration is
Figure BDA0003195634930000044
Wherein RemMsgi(Job) represents the number of messages that Job passes in the ith iteration,
Figure BDA0003195634930000045
the average size of the messages of the graph iteration Job Job at the ith iteration is shown, and N represents the total number of iterations in the process of sampling execution.
Further, dynamic similarity calculation current graph iterative operation JobrunJob iterative operation with History diagramcompStatic similarity between them, including runtime similarity S3Resource allocation similarity S4Convergence similarity S5And resource utilization similarity S6
Run time similarity S3Is composed of
Figure BDA0003195634930000046
Step represents the current iteration number of the operation, and runtime (Job, i) represents the running time of the iterative operation Job in the ith iteration; the graph iteration Job Job comprises a current graph iteration Job JobrunJob iterative operation with History diagramcomp
The resource allocation similarity is as follows:
Figure BDA0003195634930000051
where Step represents the current number of iterations of the job, consavg(Job, Step) represents the average of the number of containers used by Job for all iterations of the iterative operation from the start to the current execution; the graph iteration Job Job comprises a current graph iteration Job JobrunJob iterative operation with History diagramcomp
The convergence similarity is:
Figure BDA0003195634930000052
wherein, workSet (Job, i) represents the data number actively processed by Job in the ith iteration;
similarity of resource utilization is
Figure BDA0003195634930000053
Wherein cons is the number of containers used by the graph iteration; RsUtiljob[hw,Step,i]The average utilization of the hardware devices hw in container i when the graph iteration work is executed to the first Step iteration is shown.
Further, the similarity evaluation function constructed by the similarity evaluation unit is specifically as follows:
average optimal precision function h (t) and optimal similarity ratio function n (t);
h(t):=avg{acc|(DoS,acc)∈P,DoS≥t};
n(t):=|{acc|(DoS,acc)∈P,DoS≥t}|/|P|;
wherein acc is the prediction accuracy of the runtime prediction system; DoS is the overall similarity of the two graph iteration operations, and is the weighted average of all the single similarity, and the weighted values of all the single similarity form a weight vector; t is a similarity threshold; p is a simulation point set consisting of a set of simulation points (DoS, acc); wherein the weight vector and the similarity threshold are parameters of the similarity evaluation function.
Further, the training point collection unit firstly carries out n times of ordered and unrepeated extraction on the historical graph iteration operation set, and 2 graph iteration operations are extracted from the historical graph iteration operation set each time, namely the first graph iteration operation JiAnd a second graph iteration operation Ki,i=1,2,3,...,n。
Extracted first graph iteration job JiRunning and just after executing Mth E [1, M ]i]Second iteration, miIs a graph iteration operation JiThe maximum number of iterations.
Second graph iteration K of the extractioniIs and JiPredicting the running graph iteration job J by the running time prediction module based on the similarity between the extracted 2 graph iteration jobsiAnd the first graph iterates job JiThe real remaining run time is known, and a series of sets p (w) of elements in the form of (predict (w), actual) are calculated as training data points by using different current graph iteration operations and matching historical graph iteration operations and assuming that the current graph iteration operations are in different iteration orders; where the prediction (w) is the predicted residual execution time after w is used as the weight vector, and actual is the real residual execution time of the current graph iteration.
The parameter training unit finds a proper similarity threshold t and a similarity weight vector w for the currently running graph iteration operation (w ═ w)1,...,w6) And enabling the values of the similarity evaluation functions h (t) and n (t) in the similarity evaluation unit to be larger than a set threshold value so as to ensure that the number of matching results of the current graph iteration operation is larger than the set threshold value.
Further, the run-time prediction module includes a single prediction unit and a final prediction unit; and the single prediction unit predicts the running time of the current graph iteration operation based on the matched single historical operation to obtain a series of single prediction estimated values, forms a prediction time estimation set and outputs the prediction time estimation set to the final prediction unit.
The final prediction unit performs weighted average combination on all estimation values in the prediction estimation set to form a final prediction value on the basis of the prediction estimation set; the weights are then used to iterate the overall similarity values corresponding to the jobs for each of the historians when matching is performed.
The invention also provides a running time prediction method of the graph oriented iteration operation in the Gaia system, and the prediction process is as follows:
step 1: acquiring a historical graph iteration operation set, an initially input data set, graph iteration operation to be predicted, operation parameters and an iteration termination condition of the graph iteration operation;
step 2: the sampling execution unit samples an input data set, the sampling ratio is 10%, then the convergence parameters of the graph iteration operation to be predicted are zoomed according to the sampling ratio, and the parameters are initialized after the zooming;
and step 3: setting a data source of the iteration operation of the graph to be predicted as a sampled sample set, setting the initial iteration number as 0, then executing the iteration operation of the predicted graph, and storing the statistical information captured in each iteration process into a memory database;
and 4, step 4: calculating the static similarity between the iteration operation of the graph to be predicted and each operation in the iteration operation set of the historical graph according to the statistical information captured in the step 3;
and 5: re-acquiring operation parameters and iteration termination conditions of the iteration operation of the graph to be predicted, and initializing the parameters, the static similarity and the dynamic similarity; setting a data source of iterative operation of a graph to be predicted as an initial input data set, and setting the iteration times as 0;
step 6: when the iteration number is 0, outputting historical image iteration operation matched with the iteration operation of the image to be predicted according to the static similarity calculated in the step 4, and then calculating a static prediction value of the operation time of the iteration operation of the image to be predicted by an operation time prediction module according to the matched historical image iteration operation;
and 7: running iteration operation of the graph to be predicted, and performing iteration calculation on input data to obtain dynamic similarity between the iteration operation of the graph to be predicted and each iteration operation of each graph in the iteration operation set of the historical graph;
and 8: the operation matching module outputs historical image iteration operation matched with the image iteration operation to be predicted according to the static similarity calculated in the step 4 and the dynamic similarity calculated in the step 7, and then calculates a dynamic prediction value of the operation time of the image iteration operation to be predicted according to the matched historical image iteration operation;
and step 9: and judging whether the iteration times or the iteration output result meets the iteration termination condition, if not, repeatedly executing the step 7, and if so, executing the step 10.
Step 10: and calculating the prediction deviation of the iteration operation time of the graph to be predicted under different iterations, wherein the dynamic prediction value with the minimum prediction deviation is the final prediction value of the iteration operation time of the graph to be predicted.
Further, step 7 is performed using steps 7.1 to 7.4 as follows:
step 7.1: the number of iterations is increased by 1.
Step 7.2: and executing the iterative step function on each node of the Gaia cluster in parallel, and synchronizing data on each node in a broadcasting mode after the execution of each node is finished, so that the consistency of data on different nodes is ensured.
Step 7.3: and after the data synchronization is finished, adding the statistical information captured in the iteration process to a memory database for updating.
Step 7.4: and (4) calculating the dynamic similarity between the iteration operation of the graph to be predicted and each iteration operation of each graph in the iteration operation set of the historical graph according to the updated statistical information in the step 7.3.
Has the advantages that:
the invention provides a graph iteration operation-oriented runtime prediction method in a Gaia system, which is characterized in that off-line characteristics of a current graph iteration algorithm are quickly captured by sampling execution before operation execution, wherein the off-line characteristics comprise convergence characteristics and key input characteristics of each iteration; continuously capturing runtime characteristics including job parameters, resource utilization and detailed statistical data during job execution; the similarity between the jobs is defined to be used as the basis of job matching and final predicted value calculation, and mainly comprises static similarity captured by sampling execution and dynamic similarity captured by real execution. The core matching algorithm can train specific parameters of the algorithm through established similarity evaluation criteria to enable the iterative operation to automatically adapt to various similarities. The method is an end-to-end runtime prediction method, combines the offline characteristics and the runtime characteristics of graph iteration operation, and can accurately predict the runtime of distributed graph iteration operation under low training overhead.
Drawings
FIG. 1 is a diagram of an iterative model architecture supported in a Gaia system as provided by the background of the invention;
FIG. 2 is a flow chart of prediction in a runtime prediction method according to an embodiment of the present invention;
FIG. 3 is a diagram of the system architecture of a runtime prediction system on a Gaia system according to an embodiment of the present invention;
FIG. 4 is an organizational chart of a runtime prediction method according to an embodiment of the present invention;
FIG. 5 is a graph of predicted relative error distribution for various iterative operations by the runtime prediction system provided by an embodiment of the present invention;
fig. 6 is a graph illustrating average predicted relative errors of a runtime prediction system for various iterative operations according to an embodiment of the present invention.
Detailed Description
The invention is described in detail below by way of example with reference to the accompanying drawings.
The invention provides a running time prediction system for graph-oriented iterative operation in a Gaia system, which comprises: the system comprises a similarity management module, a job matching module and a running time prediction module.
The similarity management module comprises a sampling execution unit, a static similarity calculation unit and a dynamic similarity calculation unit.
The sampling execution unit samples the graph data set by a biased random jump method to obtain sample graph data, and then for an iterative algorithm such as PageRank, in which a convergence threshold value changes according to the size of the data set, a scaling function is required to adjust parameters of the iterative algorithm, wherein the scaling function T can be described as follows:
Figure BDA0003195634930000091
wherein, ConfS=>ConfGShowing a mapping of configuration parameters, conf configuration parameters, s-original, G-sampled sample map, ConvS=>ConvGRepresents the mapping of the convergence parameter, sr being the sampling ratio. And then quickly executing an iterative algorithm after parameter scaling on the sampling data set to capture the offline characteristics of the iterative operation of the sample graph.
The static similarity calculation unit calculates the static similarity defined by the static similarity calculation unit on the basis of comprehensively considering the execution process of the iterative algorithm through the characteristic information acquired by the sampling execution unit. The static similarity is the current graph iterative operation JobrunJob iterative operation with History diagramcompThe similarity between them includes the following 2 kinds:
(1) and calculating the similarity. In the calculation stage, each vertex of the input graph data is used for executing the iterative algorithm logic defined by the user, and for a large-scale iterative algorithm, the cost of performing calculation for each vertex is fixed, so that the time spent in the calculation stage is in direct proportion to the number of messages to be processed, namely the number of active neighbor nodes. Then the calculated similarity is defined as
Figure BDA0003195634930000101
Wherein the content of the first and second substances,
Figure BDA0003195634930000102
ActVerti(Job) represents the number of active nodes for the Job at the ith iteration, TotVerti(Job) represents the number of summary points for the Job at the ith iteration, and N represents the total number of iterations in the sample execution.
(2) Message passing similarity. Since messages from different processors need to be sent over the network and added to the memory of the destination node during the messaging phase, the runtime of this phase is proportional to the number and size of remote messages, and in particular, the memory sharing of the Gaia system makes the communication cost of local messages negligible. Then the message passing similarity is defined as
Figure BDA0003195634930000103
Wherein the content of the first and second substances,
Figure BDA0003195634930000104
RemMsgi(Job) represents the number of messages passed by the Job at the ith iteration,
Figure BDA0003195634930000105
indicating the average size of the message at the i-th iteration of the job and N indicating the total number of iterations in the sampling execution.
The dynamic similarity calculation unit calculates the dynamic similarity defined by the dynamic similarity calculation unit on the basis of comprehensively considering the domain knowledge and the experiment through the statistical information captured in the real execution of the graph iteration operation. Similarly, the dynamic similarity is the similarity between the current graph iteration operation and the historical graph iteration operation, and the dynamic similarities cover the whole process of the graph iteration operation in the running time layer. The method comprises the following 4 types:
(1) run-time similarity. The runtime similarity compares the runtime of each iteration of the graph iteration job from the beginning until the current execution. For the graph iteration job currently being executed, it is reasonable to assume that the run times of the remaining iterations are also likely to be similar, assuming that the iterations that have already been executed have a large similarity to the historical graph iteration job at run time. The run-time similarity is obtained by calculating the average of the run-time deviations for each iteration corresponding to the two graph iteration jobs. Run-time similarity is defined as
Figure BDA0003195634930000111
Step represents the current iteration number of the operation;
Figure BDA0003195634930000112
runtime (Job, i) represents the runtime of the graph iteration Job at the ith iteration.
(2) Resource allocation similarity. The resource allocation similarity is compared with the container number used by all iterations of the graph iteration operation from the beginning to the current execution. As described in subsection 3.2, a container is an abstract representation of the underlying computing resources of the resource management system, such as CPU core count, memory, disk, network bandwidth, etc. The resource allocation similarity can fully utilize the resource allocation strategy of the resource management system to feed back the performance consumption of the graph iteration operation in the execution process. The resource allocation similarity is defined as:
Figure BDA0003195634930000113
where Step denotes the number of iterations of the Job at present, cons (Job, i) denotes the number of containers used in the ith iteration of the iterative Job, and consavg(Job, Step) represents the average of the number of containers used for all iterations of the graph iteration Job from the start to the current execution.
(3) And (5) converging the similarity. Under the premise of not considering the parallel capability of the distributed cluster and the influence of data partitioning, the convergence behavior of the graph iteration job mainly depends on the input data set and the system parameters. We use the number of data actively processed by the graph iteration job to intuitively measure the iteration behavior, and obtain the convergence similarity by calculating the average of the deviation of the number of data actively processed at each iteration of the two graph iteration jobs from the beginning of execution until the current state. The convergence similarity is defined as:
Figure BDA0003195634930000121
Figure BDA0003195634930000122
where Step represents the current iteration number of the Job, and workSet (Job, i) represents the number of data to be actively processed in the ith iteration of the iterative Job.
(4) Resource utilization similarity. The resource management system monitors the iterative application and supervises the lifecycle management of the containers and returns the resources in each container via a heartbeatThe situation is utilized. Then, calculating the arithmetic mean value of the resource utilization rate under two angles of a container and iteration times in the graph iteration operation execution process according to the returned information, and maintaining the mean value in a three-dimensional array RsUtil as the state of the graph iteration operationjob[hw,step,containerId]In the method, three dimensions of the array respectively represent system hardware equipment, iteration times and container numbers from low to high, namely RsUtiljob[i,j,k]The average utilization of the hardware devices i in the container k is shown until the jth iteration of the graph iteration job. The resource utilization similarity is defined as:
Figure BDA0003195634930000123
Figure BDA0003195634930000124
where Step represents the current iteration number of the operation, and cons represents the number of containers used in the iterative operation. Equation 8 only applies to iterations using the same number of containers because it is done by calculating the relative average deviation of resource utilization under pairs of containers at each iteration. For iterations using different numbers of containers, their average deviation under all containers was calculated directly.
The operation matching module comprises a similarity evaluation unit, a training point collection unit and a parameter training unit.
The similarity evaluation unit guarantees the effectiveness of the similarity management module by establishing an evaluation function. Specifically, for the history graph iteration job set H, we assume that a history job J (J ∈ H) is currently running and has just executed the ith iteration, and assume that a history job K (K ∈ HandK ≠ J) has been run before the job J starts running, we can calculate the job similarity S ═ S { S ≠ of the two jobs when the I-th iteration is executed1,...,S7},S1To S5Respectively corresponding to seven individual similarities of the job. Finally, we define Job J and JobThe overall similarity DoS of the trade K is the weighted average of the individual similarity:
Figure BDA0003195634930000131
wherein the parameter wi(i 1.., 6.) each corresponds to a single similarity S of a jobiThe weight of (c). It is then assumed that the run-time prediction system herein predicts the remaining run-time of job J based on the similarity between them
Figure BDA0003195634930000132
Then we can calculate the prediction accuracy acc. Therefore, by repeating the above simulation operation many times, we can finally obtain a set P of a set of points (DoS, acc).
Obviously, the points in P express the relationship of the similarity of the graph iteration operation and the final prediction precision. We define a similarity evaluation criterion from the set P obtained from the above simulation:
h(t):=avg{acc|(DoS,acc)∈P,DoS≥t} (11)
n(t):=|{acc|(DoS,acc)∈P,DoS≥t}|/|P| (12)
t in the evaluation criterion is a threshold value of similarity, and the overall similarity of the current graph iteration job and the matched historical job should be greater than the threshold value t. Specifically, the average optimal precision function h (t) is used to calculate an average value of the precision corresponding to all the similarities greater than a specific threshold t in the set P. Then, the optimal similarity ratio function n (t) represents the share of all points in the set P with a similarity greater than a certain threshold t.
The training point collecting unit is used for collecting data points of the training similarity weight vector. Firstly, n times of ordered and unrepeated extraction are carried out on the historical graph iteration operation set, and 2 graph iteration operations J are extracted from the historical graph iteration operation set each timeiAnd Ki(i ═ 1,2,3,. and, n), and assume job JiIs running and just finished (M E [1, M)i]) Second iteration, miIs a graph iteration operation JiThe maximum number of iterations. Then assume Job KiIs and JiMatching historical jobs, such that a running graph iteration job J can be predicted by the run-time prediction module based on the similarity between the two jobsiThe remaining run time of, and job JiThe true remaining run time is known, and a series of sets p (w) of elements in the form of (predict (w), actual) can be computed as training points by using different current jobs and matching historical jobs and assuming that the current jobs are in different iteration orders. Where the prediction (w) is the predicted residual execution time after w is used as the weight vector, and actual is the real residual execution time of the current graph iteration.
The parameter training unit finds a proper similarity threshold t and a similarity weight vector w for the currently running graph iteration operation (w ═ w)1,...,w6) And enabling the values of the functions h (t) and n (t) in the similarity evaluation unit to be at a higher level so as to ensure that the current graph iteration operation has enough matching results to perform runtime prediction.
And training a similarity threshold. The similarity threshold is used to decide which jobs in the history graph iteration job set are to be matched. Except for using a global threshold t to eliminate historical graph iteration operation with low overall similarity, the single similarity S of the graph iteration operationi(i 1.., 6) each set a corresponding threshold tiAnd the method is used for eliminating the historical operation with low similarity under a certain dimensionality. The parameter training unit uses the similarity evaluation standard to evaluate the global threshold t and each single threshold tiAnd (5) training. Specifically, it sets a minimum similarity ratio nminAnd minimum average precision hminIn order to avoid that enough history jobs cannot be screened out due to the fact that the threshold value is too high in the parameter searching process, specific conditions are as follows.
t=max{t∈[0,1]|h(t)≥hmin∧n(t)≥nmin} (13)
ti=max{ti∈[0,1]|h(ti)≥hmin∧n(ti)≥nmin} (14)
And training a similarity weight vector. To be able to get efficient hyper-parameters quickly, we perform distributed asynchronous optimization on the weight vector w by using the hyper pt tool (the tool Python for training the parameter vector), the data set uses the set p (w) collected by the data point collection unit, and uses the average relative prediction error in the training set as the loss function o:
Figure BDA0003195634930000141
where the prediction (w) is the predicted residual execution time after w is used as the weight vector, and actual is the real residual execution time of the current graph iteration.
The running time prediction module mainly comprises a single prediction unit and a final prediction unit, and the residual running time of the current graph iteration operation is predicted through the history graph iteration operation matched by the operation matching module, and the specific prediction process is shown in fig. 2.
The single prediction unit predicts the running time of the current graph iteration job based mainly on the matched single historical job. First, assume that the current job will have the same number of iterations as the matching historical job. Since the cluster configuration and program logic will not always be identical, the current job may have a high similarity to the historical job, but may still have a slight difference in runtime. For this purpose, the relative average deviation coefficient delta of the current-map iterative operation and each matched historical operation can be calculated by a single prediction unit, and the current operation JobrunWith the matched ith history job
Figure BDA0003195634930000151
The relative mean deviation factor calculation procedure performed up to the S-th iteration is as follows:
Figure BDA0003195634930000152
wherein the content of the first and second substances,
Figure BDA0003195634930000153
and
Figure BDA0003195634930000154
representing the true run times of the current job and the historical job, respectively, at the k-th iteration.
Assuming that the difference coefficient for runtime up to the current iteration will remain until job execution ends, the run-time prediction based on a single matching historical job is calculated as follows:
Figure BDA0003195634930000155
the output of a single prediction unit is an estimate set runtimerunWhere the element is a series of estimates of the current job run time obtained by a single prediction. end maximum number of iterations.
The final prediction unit merges into one final prediction value by weighted averaging the individual estimation values based on the matched historical jobs based on the estimation set that we calculated in the previous step, the weight then using the overall similarity value corresponding to each historical job when matching is performed. In summary, if the historical jobs with high similarity to the current job also contribute much to the final evaluation value, the runtime prediction value runtime of the current-graph iterative job from the execution to the S-th iteration is largeSThe calculation process of (2) is as follows:
Figure BDA0003195634930000161
wherein, DoSiRepresenting the integral similarity of the ith historical operation and the current operation, end is the total iteration number, and N is the number of matched historical operations.
The specific prediction process of the runtime prediction method of the graph iteration operation is as follows:
step 1: and acquiring a historical graph iteration operation set, an initially input data set, an iteration operation of a graph to be predicted, and parameters and an iteration termination condition of the operation.
Step 2: the sampling execution unit samples the input data set, the sampling ratio is 10%, then the convergence parameters of the graph iteration operation to be predicted are scaled according to the sampling ratio, and the parameters are initialized after scaling.
And step 3: setting a data source of iteration operation of a graph to be predicted as a sampled sample set, setting the initial iteration times as 0, then executing the operation, and storing the statistical information captured in each iteration process into an in-memory database.
And 4, step 4: and (3) calculating the calculation similarity and the message transmission similarity between the iteration operation of the graph to be predicted and each operation in the iteration operation set of the historical graph according to the statistical information acquired in the step (3).
And 5: and acquiring parameters and iteration termination conditions again, and initializing the parameters and the six operation similarities. And setting a data source of the iterative operation of the graph to be predicted as an initial input data set, and setting the iteration times as 0.
Step 6: if the iteration number is 0, the operation matching module outputs historical image iteration operation matched with the image iteration operation to be predicted according to the static similarity calculated in the step 4, and then the operation time prediction module calculates a static prediction value of the operation time of the image iteration operation to be predicted according to the matched historical operation, so that the operation time can be rapidly grasped in a short time.
And 7: and running graph iteration operation to be predicted, and performing iterative computation on input data.
Step 7.1: the number of iterations is increased by 1.
Step 7.2: and executing the iteration step function on each node of the cluster in parallel, and synchronizing data on each node in a broadcasting mode after the execution of each node is finished, so that the consistency of data on different nodes is ensured.
Step 7.3: and after the data synchronization is finished, adding the statistical information captured in the iteration process to a memory database for updating.
Step 7.4: and the dynamic similarity calculation unit calculates 4 kinds of dynamic similarities between the iteration operation of the graph to be predicted and each operation in the iteration operation set of the historical graph according to the updated statistical information in the step 7.4.
And 8: and the operation matching module outputs historical image iteration operation matched with the image iteration operation to be predicted according to the static similarity calculated in the step 4 and the dynamic similarity calculated in the step 7.4, and the running time prediction module calculates a dynamic prediction value of the running time of the image iteration operation to be predicted according to the matched historical operation.
And step 9: judging whether the iteration times or the iteration output result meets the iteration termination condition, if not, repeatedly executing the step 7, and if so, executing the step 10;
step 10: and calculating the prediction deviation of the iteration operation running time of the graph under different iterations, wherein the dynamic prediction value with the minimum deviation is the final prediction value.
Example 1:
the present embodiment uses a runtime prediction system integrated into a Gaia system as shown in fig. 3 to process four iterative algorithms, namely PageRank, Connected Components, SSSP, and addition, as practical application scenarios, and the data sets used by the iterative algorithms are shown in table 1.
The single source shortest path algorithm (SSSP) calculates the shortest distance from a certain source node to all other nodes in the graph. PageRank is a well-known web page ranking algorithm that updates the importance score of each page node through iterative recursive computation. The Adsorption algorithm diffuses the labels in the graph according to a Random Walk model until the distribution of the labels at each node in the graph is stabilized. The Connected Components algorithm finds the interconnected parts in a large graph by iteratively searching. The data sets in table 1 are all from the real world, downloaded to the stanford large network data set website.
TABLE 1 data set
Figure BDA0003195634930000181
In this embodiment, a method for predicting runtime of graph-oriented iterative operation in a Gaia system, as shown in fig. 4, includes: the system comprises a similarity management module, a job matching module and an operation time prediction module;
the similarity management module comprises a sampling execution unit, a static similarity calculation unit and a dynamic similarity calculation unit.
The sampling execution unit samples the data set by a biased random jump method, and then for an iterative algorithm such as PageRank, in which a convergence threshold changes according to the size of the data set, a scaling function is required to adjust parameters of the iterative algorithm, and the form of the scaling function T can be described as follows:
Figure BDA0003195634930000182
wherein, ConfS=>ConfGRepresenting a mapping of configuration parameters, ConvS=>ConvGRepresents the mapping of the convergence parameter, sr being the sampling ratio. And then quickly executing an iterative algorithm after parameter scaling on the sampling data set to capture the offline characteristics of the graph iteration operation.
The static similarity calculation unit calculates the static similarity defined by the static similarity calculation unit on the basis of comprehensively considering the execution process of the iterative algorithm through the characteristic information acquired by the sampling execution unit. The static similarity is the similarity between the current graph iteration operation and the historical graph iteration operation, and comprises the following 2 types:
1. and calculating the similarity. In the calculation stage, each vertex of the input graph data is used for executing the iterative algorithm logic defined by the user, and for a large-scale iterative algorithm, the cost of performing calculation for each vertex is fixed, so that the time spent in the calculation stage is in direct proportion to the number of messages to be processed, namely the number of active neighbor nodes. Then the calculated similarity is defined as
Figure BDA0003195634930000191
Wherein the content of the first and second substances,
Figure BDA0003195634930000192
ActVerti(Job) represents the number of active nodes for the Job at the ith iteration, TotVerti(Job) represents the number of summary points for the Job at the ith iteration, and N represents the total number of iterations in the sample execution.
2. Message passing similarity. Since messages from different processors need to be sent over the network and added to the memory of the destination node during the messaging phase, the runtime of this phase is proportional to the number and size of remote messages, and in particular, the memory sharing of the Gaia system makes the communication cost of local messages negligible. Then the message passing similarity is defined as
Figure BDA0003195634930000193
Wherein the content of the first and second substances,
Figure BDA0003195634930000194
RemMsgi(Job) represents the number of messages passed by the Job at the ith iteration,
Figure BDA0003195634930000195
indicating the average size of the message at the i-th iteration of the job and N indicating the total number of iterations in the sampling execution.
The dynamic similarity calculation unit calculates the dynamic similarity defined by the dynamic similarity calculation unit on the basis of comprehensively considering the domain knowledge and the experiment through the statistical information captured in the real execution of the graph iteration operation. Similarly, the dynamic similarity is the similarity between the current graph iteration operation and the historical graph iteration operation, and the dynamic similarities cover the whole process of the graph iteration operation in the running time layer. The method comprises the following 4 types:
(1) run-time similarity. The runtime similarity compares the runtime of each iteration of the graph iteration job from the beginning until the current execution. For the graph iteration job currently being executed, it is reasonable to assume that the run times of the remaining iterations are also likely to be similar, assuming that the iterations that have already been executed have a large similarity to the historical graph iteration job at run time. The run-time similarity is obtained by calculating the average of the run-time deviations for each iteration corresponding to the two graph iteration jobs. Run-time similarity is defined as
Figure BDA0003195634930000201
Where Step represents the current iteration number of the Job, and runtime (Job, i) represents the running time of the iterative Job in the ith iteration.
(2) Resource allocation similarity. The resource allocation similarity is compared with the container number used by all iterations of the graph iteration operation from the beginning to the current execution. As described in subsection 3.2, a container is an abstract representation of the underlying computing resources of the resource management system, such as CPU core count, memory, disk, network bandwidth, etc. The resource allocation similarity can fully utilize the resource allocation strategy of the resource management system to feed back the performance consumption of the graph iteration operation in the execution process. The resource allocation similarity is defined as:
Figure BDA0003195634930000202
where Step denotes the number of iterations of the Job at present, cons (Job, i) denotes the number of containers used in the ith iteration of the iterative Job, and consavg(Job, Step) represents the average of the number of containers used for all iterations of the graph iteration Job from the start to the current execution.
(3) And (5) converging the similarity. Under the premise of not considering the parallel capability of the distributed cluster and the influence of data partitioning, the convergence behavior of the graph iteration job mainly depends on the input data set and the system parameters. We use the number of data actively processed by the graph iteration job to intuitively measure the iteration behavior, and obtain the convergence similarity by calculating the average of the deviation of the number of data actively processed at each iteration of the two graph iteration jobs from the beginning of execution until the current state. The convergence similarity is defined as:
Figure BDA0003195634930000211
Figure BDA0003195634930000212
where Step represents the current iteration number of the Job, and workSet (Job, i) represents the number of data to be actively processed in the ith iteration of the iterative Job.
(4) Resource utilization similarity. The resource management system monitors the iterative application and supervises the lifecycle management of the containers and returns the utilization of these resources in each container via a heartbeat. Then, calculating the arithmetic mean value of the resource utilization rate under two angles of a container and iteration times in the graph iteration operation execution process according to the returned information, and maintaining the mean value in a three-dimensional array RsUtil as the state of the graph iteration operationjob[hw,step,containerId]In the method, three dimensions of the array respectively represent system hardware equipment, iteration times and container numbers from low to high, namely RsUtiljob[i,j,k]The average utilization of the hardware devices i in the container k is shown until the jth iteration of the graph iteration job. The resource utilization similarity is defined as:
Figure BDA0003195634930000213
Figure BDA0003195634930000214
where Step represents the current iteration number of the operation, and cons represents the number of containers used in the iterative operation. Equation 26 only applies to iterations using the same number of containers because it is done by calculating the relative average deviation of resource utilization under pairs of containers at each iteration. For iterations using different numbers of containers, their average deviation under all containers was calculated directly.
The operation matching module comprises a similarity evaluation unit, a training point collection unit and a parameter training unit.
The similarity evaluation unit guarantees the effectiveness of the similarity management module by establishing an evaluation function. Specifically, for the history graph iteration job set H, we assume that a history job J (J ∈ H) is currently running and has just executed the ith iteration, and assume that a history job K (K ∈ HandK ≠ J) has been run before the job J starts running, we can calculate the job similarity S ═ S { S ≠ of the two jobs when the I-th iteration is executed1,...,S7},S1To S5Respectively corresponding to seven individual similarities of the job. Finally, defining the overall similarity CoS of the operation J and the operation K as the weighted average of the similarity of each single item:
Figure BDA0003195634930000221
wherein the parameter wi(i 1.., 6.) each corresponds to a single similarity S of a jobiThe weight of (c). It is then assumed that the run-time prediction system herein predicts the remaining run-time of job J based on the similarity between them
Figure BDA0003195634930000222
Then we can calculate the prediction accuracy acc. Therefore, by repeating the above simulation operation many times, we can finally obtain a set P of a set of points (DoS, acc).
Obviously, the points in P express the relationship of the similarity of the graph iteration operation and the final prediction precision. We define a similarity evaluation criterion from the set P obtained from the above simulation:
h(t):=avg{acc|(DoS,acc)∈P,DoS≥t} (29)
n(t):=|{acc|(DoS,acc)∈P,DoS≥t}|/|P| (30)
t in the evaluation criterion is a threshold value of similarity, and the overall similarity of the current graph iteration job and the matched historical job should be greater than the threshold value t. Specifically, the average optimal precision function h (x) is used to calculate an average value of the precision corresponding to all the similarities greater than the specific threshold t in the set P. Then, the optimal similarity ratio function n (t) represents the share of all points in the set P with a similarity greater than a certain threshold t.
The training point collecting unit is used for collecting data points of the training similarity weight vector. Firstly, n times of ordered and unrepeated extraction are carried out on the historical graph iteration operation set, and 2 graph iteration operations J are extracted from the historical graph iteration operation set each timeiAnd Ki(i ═ 1,2,3,. and, n), and assume job JiIs running and just finished (M E [1, M)i]) Second iteration, miIs a graph iteration operation JiThe maximum number of iterations. Then assume Job KiIs and JiMatching historical jobs, so that the running graph iteration job J can be predicted by the job prediction module based on the similarity between the two jobsiThe remaining run time of, and job JiThe true remaining run time is known, and a series of sets p (w) of elements in the form of (predict (w), actual) can be computed as training points by using different current jobs and matching historical jobs and assuming that the current jobs are in different iteration orders.
The parameter training unit finds a proper similarity threshold t and a similarity weight vector w for the currently running graph iteration operation (w ═ w)1,...,w6) And enabling the values of the functions h (t) and n (t) in the similarity evaluation unit to be at a higher level so as to ensure that the current graph iteration operation has enough matching results to perform runtime prediction.
And training a similarity threshold. The similarity threshold is used to decide which jobs in the history graph iteration job set are to be matched. Except for using a global threshold t to eliminate historical graph iteration operation with low overall similarity, the single similarity S of the graph iteration operationi(i 1.., 6) each set a corresponding threshold tiFor rejecting images with insufficient similarity in a certain dimensionHigh history jobs. The parameter training unit uses the similarity evaluation standard to evaluate the global threshold t and each single threshold tiAnd (5) training. Specifically, it sets a minimum similarity ratio nminAnd minimum average precision hminIn order to avoid that enough history jobs cannot be screened out due to the fact that the threshold value is too high in the parameter searching process, specific conditions are as follows.
t=max{t∈[0,1]|h(t)≥hmin∧n(t)≥nmin} (31)
ti=max{ti∈[0,1]|h(ti)≥hmin∧n(ti)≥nmin} (32)
And training a similarity weight vector. In order to be able to quickly obtain effective hyper-parameters, we perform distributed asynchronous optimization on the weight vector w by using a Hyperopt tool, the data set uses the set p (w) collected by the data point collection unit, and uses the average relative prediction error in the training set as the loss function o:
Figure BDA0003195634930000231
where the prediction (w) is the predicted residual execution time after w is used as the weight vector, and actual is the real residual execution time of the current graph iteration.
The running time prediction module mainly comprises a single prediction unit and a final prediction unit, and the residual running time of the current graph iteration operation is predicted through the history graph iteration operation matched by the operation matching module, and the specific prediction process is shown in fig. 2.
The single prediction unit predicts the running time of the current graph iteration job based mainly on the matched single historical job. First, assume that the current job will have the same number of iterations as the matching historical job. Since the cluster configuration and program logic will not always be identical, the current job may have a high similarity to the historical job, but may still have a slight difference in runtime. To this end, a current graph iteration may be computed by a single prediction unitRelative average deviation factor delta of Job to each matched historical Job, current Job JobrunWith the matched ith history job
Figure BDA0003195634930000241
The relative mean deviation factor calculation procedure performed up to the S-th iteration is as follows:
Figure BDA0003195634930000242
wherein the content of the first and second substances,
Figure BDA0003195634930000243
and
Figure BDA0003195634930000244
representing the true run times of the current job and the historical job, respectively, at the k-th iteration.
Assuming that the difference coefficient for runtime up to the current iteration will remain until job execution ends, the run-time prediction based on a single matching historical job is calculated as follows:
Figure BDA0003195634930000245
the output of a single prediction unit is an estimate set runtimerunWhere the element is a series of estimates of the current job run time obtained by a single prediction.
The final prediction unit merges into one final prediction value by weighted averaging the individual estimation values based on the matched historical jobs based on the estimation set that we calculated in the previous step, the weight then using the overall similarity value corresponding to each historical job when matching is performed. In summary, if the historical jobs with high similarity to the current job also contribute much to the final evaluation value, the runtime prediction value runtime of the current-graph iterative job from the execution to the S-th iteration is largeSThe calculation process of (2) is as follows:
Figure BDA0003195634930000251
wherein, DoSiRepresenting the integral similarity of the ith historical operation and the current operation, end is the total iteration number, and N is the number of matched historical operations.
In this embodiment, the runtime prediction method of the present invention predicts the runtime of a plurality of graph iteration jobs composed of the above four iteration algorithms under different data sets, and obtains a prediction relative error that changes with the iteration process, as shown in fig. 5. It can be seen that the prediction error of the running time of each graph iteration job is basically kept at about 10% after being stabilized, but there is a trend that a short relative error is changed from low to high at the beginning stage of the graph iteration job, because the proportion of the static similarity is large at the beginning stage of the job running, but the proportion of the dynamic similarity is gradually increased as the iteration starts to be carried out, but only little statistical information can be captured, so the accuracy of the dynamic similarity is low, and the relative error is increased. As the iteration is carried out, more and more statistical information for constructing the dynamic similarity is captured, and the predicted relative error is lower and reaches the lowest about 25% -30% of the total iteration number. At the end of the graph iteration job run, even a small absolute error will result in a high relative error, since the predictable remaining run time is small.
For each graph iteration job in fig. 5, the present invention calculates the average relative error for that job by choosing representative sample points, which follow the following rules: starting from 5% of the total number of iterations, 5 sample points are taken in steps of 20%. Fig. 6 shows the average relative error of each graph iteration, i.e. the 4 graph iteration algorithms described above, for each data set, and it can be seen that the relative average error for all the predictions for the work is around 8%.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. The running time prediction system for the graph-oriented iterative operation in the Gaia system is characterized by comprising a similarity management module, an operation matching module and a running time prediction module;
the similarity management module comprises a sampling execution unit, a static similarity calculation unit and a dynamic similarity calculation unit;
the sampling execution unit samples the graph data set by a biased random jump method to obtain sample graph data, and an iterative algorithm is executed on the sample graph data set to capture the offline characteristics of the sample graph iterative operation;
the static similarity calculation unit calculates the static similarity between the current graph iteration operation and the historical graph iteration operation through the offline features captured by the sampling execution unit;
the dynamic similarity calculation unit calculates the dynamic similarity between the current graph iteration operation and the historical graph iteration operation through the statistical information captured in the real execution of the graph iteration operation;
the similarity management module sends the calculated static similarity and dynamic similarity to the job matching module;
the operation matching module comprises a similarity evaluation unit, a training point collection unit and a parameter training unit;
the similarity evaluation unit constructs a similarity evaluation function, carries out effectiveness evaluation on the static similarity and the dynamic similarity sent by the similarity management module, selects matched historical image iteration operation according to the effectiveness evaluation result and sends the selected historical image iteration operation to the running time prediction module;
the training point collecting unit is used for collecting training data points of parameters in a training similarity evaluation function;
the parameter training unit utilizes the training data points to train to obtain parameters of a similarity evaluation function and sends the parameters of the similarity evaluation function to the similarity evaluation unit;
and the running time prediction module predicts the residual running time of the current graph iteration operation through the historical graph iteration operation matched by the operation matching module.
2. The system according to claim 1, wherein the static similarity calculation unit calculates the current graph iterative Job Job by using the methodrunJob iterative operation with History diagramcompStatic similarity between them, including static computational similarity and static messaging similarity.
3. The system of claim 2, wherein the statically calculated similarity is:
Figure FDA0003195634920000021
wherein S is1Iteratively operating Job for a current graphrunJob iterative operation with History diagramcompThe similarity of static calculation between the two, namely the active node proportion of the graph iteration Job Job in the ith iteration
Figure FDA0003195634920000022
ActVerti(Job) represents the number of active nodes in the ith iteration of Job, TotVerti(Job) represents the number of summary points of the graph iteration Job Job during the ith iteration, and N represents the total iteration number in the sampling execution process; the graph iteration Job Job comprises a current graph iteration Job JobrunJob iterative operation with History diagramcomp
The static message passing similarity is
Figure FDA0003195634920000023
Wherein S is2Represent Job, an iterative operation of the current graphrunJob iterative operation with History diagramcompStatic messaging similarity between them; job in the iterative operation of the drawingThe total amount of messages delivered at the ith iteration is
Figure FDA0003195634920000024
Wherein RemMsgi(Job) represents the number of messages that Job passes in the ith iteration,
Figure FDA0003195634920000025
the average size of the messages of the graph iteration Job Job at the ith iteration is shown, and N represents the total number of iterations in the process of sampling execution.
4. The system of claim 3, wherein the dynamic similarity calculation current graph iterative Job JobrunJob iterative operation with History diagramcompStatic similarity between them, including runtime similarity S3Resource allocation similarity S4Convergence similarity S5And resource utilization similarity S6
The run-time similarity S3Is composed of
Figure FDA0003195634920000026
Step represents the current iteration number of the operation, and runtime (Job, i) represents the running time of the iterative operation Job in the ith iteration; the graph iteration Job Job comprises a current graph iteration Job JobrunJob iterative operation with History diagramcomp
The resource allocation similarity is as follows:
Figure FDA0003195634920000031
where Step represents the current number of iterations of the job, consavg(Job, Step) represents the average of the number of containers used by Job for all iterations of the iterative operation from the start to the current execution; the graph iteration Job Job comprises a current graph iteration Job JobrunJob iterative operation with History diagramcomp
The convergence similarity is:
Figure FDA0003195634920000032
wherein, workSet (Job, i) represents the data number actively processed by Job in the ith iteration;
the resource utilization similarity is
Figure FDA0003195634920000033
Wherein cons is the number of containers used by the graph iteration; RsUtiljob[hw,Step,i]The average utilization of the hardware devices hw in container i when the graph iteration work is executed to the first Step iteration is shown.
5. The system according to any one of claims 1 to 4, wherein the similarity evaluation function constructed by the similarity evaluation unit is specifically:
average optimal precision function h (t) and optimal similarity ratio function n (t);
h(t):=avg{acc|(DoS,acc)∈P,DoS≥t};
n(t):=|{acc|(DoS,acc)∈P,DoS≥t}|/|P|;
wherein acc is the prediction accuracy of the run-time prediction system; DoS is the overall similarity of the two graph iteration operations, and is the weighted average of all the single similarity, and the weighted values of all the single similarity form a weight vector; t is a similarity threshold; p is a simulation point set consisting of a set of simulation points (DoS, acc);
wherein the weight vector and the similarity threshold are parameters of the similarity evaluation function.
6. The system of claim 5, wherein the training point collection unit first iteratively collects operations on a history mapThe lines are sequentially extracted for n times without repetition, and 2 graph iteration operations are extracted from the lines each time, namely a first graph iteration operation JiAnd a second graph iteration operation Ki,i=1,2,3,...,n;
Extracted first graph iteration job JiRunning and just after executing Mth E [1, M ]i]Second iteration, miIs a graph iteration operation JiThe maximum number of iterations;
second graph iteration K of the extractioniIs and JiPredicting the running graph iteration job J by the running time prediction module based on the similarity between the extracted 2 graph iteration jobsiAnd the first graph iterates job JiThe real remaining run time is known, and a series of sets p (w) of elements in the form of (predict (w), actual) are calculated as training data points by using different current graph iteration operations and matching historical graph iteration operations and assuming that the current graph iteration operations are in different iteration orders; wherein, the prediction (w) is the predicted residual execution time after w is taken as the weight vector, and actual is the real residual execution time of the current graph iteration operation;
the parameter training unit finds a proper similarity threshold t and a similarity weight vector w (w) for the currently running graph iteration operation1,...,w6) And enabling the values of the similarity evaluation functions h (t) and n (t) in the similarity evaluation unit to be larger than a set threshold value so as to ensure that the number of matching results of the current graph iteration operation is larger than the set threshold value.
7. The system of claim 6, wherein the runtime prediction module comprises a single prediction unit and a final prediction unit;
the single prediction unit predicts the running time of the current graph iteration operation based on the matched single historical operation to obtain a series of single prediction estimated values, and a prediction time estimation set is formed and output to the final prediction unit;
the final prediction unit is used for carrying out weighted average combination on all estimation values in the prediction estimation set to form a final prediction value on the basis of the prediction estimation set; the weights are then used to iterate the overall similarity values corresponding to the jobs for each of the historians when matching is performed.
8. A method for predicting the running time of a graph oriented iterative operation in a Gaia system is characterized in that the prediction process is as follows:
step 1: acquiring a historical graph iteration operation set, an initially input data set, graph iteration operation to be predicted, operation parameters and an iteration termination condition of the graph iteration operation;
step 2: the sampling execution unit samples an input data set, the sampling ratio is 10%, then the convergence parameters of the graph iteration operation to be predicted are zoomed according to the sampling ratio, and the parameters are initialized after the zooming;
and step 3: setting a data source of the iteration operation of the graph to be predicted as a sampled sample set, setting the initial iteration number as 0, then executing the iteration operation of the predicted graph, and storing the statistical information captured in each iteration process into a memory database;
and 4, step 4: calculating the static similarity between the iteration operation of the graph to be predicted and each operation in the iteration operation set of the historical graph according to the statistical information captured in the step 3;
and 5: re-acquiring operation parameters and iteration termination conditions of the iteration operation of the graph to be predicted, and initializing the parameters, the static similarity and the dynamic similarity; setting a data source of iterative operation of a graph to be predicted as an initial input data set, and setting the iteration times as 0;
step 6: when the iteration number is 0, outputting historical image iteration operation matched with the iteration operation of the image to be predicted according to the static similarity calculated in the step 4, and then calculating a static prediction value of the operation time of the iteration operation of the image to be predicted by an operation time prediction module according to the matched historical image iteration operation;
and 7: running iteration operation of the graph to be predicted, and performing iteration calculation on input data to obtain dynamic similarity between the iteration operation of the graph to be predicted and each iteration operation of each graph in the iteration operation set of the historical graph;
and 8: the operation matching module outputs historical image iteration operation matched with the image iteration operation to be predicted according to the static similarity calculated in the step 4 and the dynamic similarity calculated in the step 7, and then calculates a dynamic prediction value of the operation time of the image iteration operation to be predicted according to the matched historical image iteration operation;
and step 9: judging whether the iteration times or the iteration output result meets the iteration termination condition, if not, repeatedly executing the step 7, and if so, executing the step 10;
step 10: and calculating the prediction deviation of the iteration operation time of the graph to be predicted under different iterations, wherein the dynamic prediction value with the minimum prediction deviation is the final prediction value of the iteration operation time of the graph to be predicted.
9. The method of claim 8, wherein step 7 is performed using steps 7.1-7.4 as follows:
step 7.1: adding 1 to the iteration times;
step 7.2: executing iterative step functions on each node of the Gaia cluster in parallel, and synchronizing data on each node in a broadcasting mode after the execution of each node is finished, so that the consistency of data on different nodes is ensured;
step 7.3: after the data synchronization is finished, adding the statistical information captured in the iteration process to a memory database for updating;
step 7.4: and (4) calculating the dynamic similarity between the iteration operation of the graph to be predicted and each iteration operation of each graph in the iteration operation set of the historical graph according to the updated statistical information in the step 7.3.
CN202110890134.5A 2021-08-04 2021-08-04 Run-time prediction system and method for graph-oriented iterative operation in Gaia system Pending CN113627664A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110890134.5A CN113627664A (en) 2021-08-04 2021-08-04 Run-time prediction system and method for graph-oriented iterative operation in Gaia system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110890134.5A CN113627664A (en) 2021-08-04 2021-08-04 Run-time prediction system and method for graph-oriented iterative operation in Gaia system

Publications (1)

Publication Number Publication Date
CN113627664A true CN113627664A (en) 2021-11-09

Family

ID=78382536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110890134.5A Pending CN113627664A (en) 2021-08-04 2021-08-04 Run-time prediction system and method for graph-oriented iterative operation in Gaia system

Country Status (1)

Country Link
CN (1) CN113627664A (en)

Similar Documents

Publication Publication Date Title
Alipourfard et al. {CherryPick}: Adaptively unearthing the best cloud configurations for big data analytics
CN104317658B (en) A kind of loaded self-adaptive method for scheduling task based on MapReduce
CN103605662B (en) Distributed computation frame parameter optimizing method, device and system
CN114008594A (en) Scheduling operations on a computational graph
Alam et al. Hierarchical PSO clustering based recommender system
Sokolinsky et al. Methods of resource management in problem-oriented computing environment
US20220261696A1 (en) Recommmender system for adaptive computation pipelines in cyber-manufacturing computational services
Ulanov et al. Modeling scalability of distributed machine learning
Kaedi et al. Biasing Bayesian optimization algorithm using case based reasoning
Li Parallel nonconvex generalized Benders decomposition for natural gas production network planning under uncertainty
Raju et al. Hybrid ant colony optimization and cuckoo search algorithm for job scheduling
Jeon et al. Intelligent resource scaling for container based digital twin simulation of consumer electronics
Nasr et al. Task scheduling algorithm for high performance heterogeneous distributed computing systems
CN116974249A (en) Flexible job shop scheduling method and flexible job shop scheduling device
CN113627664A (en) Run-time prediction system and method for graph-oriented iterative operation in Gaia system
CN112149826B (en) Profile graph-based optimization method in deep neural network inference calculation
Wang et al. Multi-Installment Scheduling for Large-Scale Workload Computation with Result Retrieval
Sharifi et al. Modeling real-time application processor scheduling for fog computing
CN114091686A (en) Data processing method and device, electronic equipment and storage medium
Banalagay et al. Resource estimation in high performance medical image computing
Du et al. OctopusKing: A TCT-aware task scheduling on spark platform
Rumyantsev Stabilization of a high performance cluster model
Liu et al. Cloud Configuration Optimization for Recurring Batch-Processing Applications
Cao et al. A fault-tolerant workflow mapping algorithm under end-to-end delay constraint
Banerjee et al. Offloading work to mobile devices: An availability-aware data partitioning approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination