CN106021495B

CN106021495B - A kind of task parameters optimization method of distributed iterative computing system

Info

Publication number: CN106021495B
Application number: CN201610341201.7A
Authority: CN
Inventors: 王建民; 龙明盛; 陈侨安; 黄向东
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2016-05-20
Filing date: 2016-05-20
Publication date: 2017-10-31
Anticipated expiration: 2036-05-20
Also published as: CN106021495A

Abstract

The invention relates to a task parameter optimization method in a distributed iterative computing system, belonging to the technical field of distributed data processing. This method first collects the running data of the historical tasks in the distributed iterative computing system, and builds the historical database; when optimizing the task parameters, it filters the significantly irrelevant running data in the historical database according to the constraints; The operating data in the database and the operating data after the first filtering are calculated for the similarity of the directed acyclic graph, and the operating data whose similarity is lower than a certain threshold is filtered twice; finally, the results after the two filtering are calculated and sorted , and use the task parameters corresponding to the sorted running data as the task parameter optimization results. The invention can automatically optimize the task parameters of the distributed iterative computing system, is a plug-and-play self-adaptive optimization method, and can significantly reduce the threshold for users to use the distributed iterative computing system.

Description

A Method of Task Parameter Optimization for Distributed Iterative Computing System

技术领域technical field

本发明属于分布式数据处理技术领域，特别涉及一种分布式迭代计算系统中任务参数优化方法。The invention belongs to the technical field of distributed data processing, in particular to a task parameter optimization method in a distributed iterative computing system.

背景技术Background technique

使用分布式迭代计算系统处理大规模数据集已成为目前数据处理的主要做法。相比于传统的单机数据处理方案，现在流行并被大量使用的分布式迭代计算系统，如ApacheSpark，利用了多台机器对数据进行划分，从而大幅度的提高了数据处理的规模。并且，多台机器参与到数据处理的流程中，提高了数据处理的并行数目，加快了大规模数据的处理速度。Using distributed iterative computing systems to process large-scale data sets has become the main practice of data processing. Compared with traditional stand-alone data processing solutions, distributed iterative computing systems that are now popular and widely used, such as Apache Spark, use multiple machines to divide data, thereby greatly increasing the scale of data processing. Moreover, multiple machines participate in the data processing process, which increases the parallel number of data processing and speeds up the processing speed of large-scale data.

尽管拥有以上的优点，一个分布式迭代计算系统任务的正常运行需要合理的任务参数。不合理的任务参数会导致该任务在分布式迭代计算系统中的处理速度下降。合理的任务参数能增加任务在分布式迭代计算系统中数据处理的并行度，减少网络的传输开销和减少调度时间开销，因此能加快任务的处理速度。分布式迭代计算系统所涉及的任务参数多达数十个，并且任务参数之间存在错综复杂的关系。任务参数的配置工作给开发人员带来了额外的开销，并且人工决策的任务参数不一定取得良好的运行性能。Despite the above advantages, the normal operation of a distributed iterative computing system task requires reasonable task parameters. Unreasonable task parameters will reduce the processing speed of the task in the distributed iterative computing system. Reasonable task parameters can increase the parallelism of task data processing in the distributed iterative computing system, reduce network transmission overhead and reduce scheduling time overhead, thus speeding up task processing. There are dozens of task parameters involved in the distributed iterative computing system, and there are intricate relationships among the task parameters. The configuration of task parameters brings additional overhead to developers, and task parameters determined manually may not necessarily achieve good operating performance.

分布式迭代计算系统中存在任务参数众多并且不容易配好的难题，由此引出的一个问题是，能否给分布式迭代计算系统中的任务参数进行优化。目前，分布式迭代计算系统中任务参数优化工作主要依赖于工程师的经验进行决策。但这种优化方法过于主观，经验充足的工程师往往能得出较好的任务参数，而经验不足的工程师却得不出较好的任务参数。There are many task parameters in the distributed iterative computing system and it is difficult to configure them well. One of the problems that arises from this is whether to optimize the task parameters in the distributed iterative computing system. At present, the optimization of task parameters in distributed iterative computing systems mainly relies on the experience of engineers to make decisions. But this optimization method is too subjective, experienced engineers can often get better task parameters, but inexperienced engineers can't get better task parameters.

发明内容Contents of the invention

本发明的目的是针对现有分布式迭代计算系统中任务参数众多并且不容易配置好的难题，提出一种分布式迭代计算系统的任务参数优化方法。本发明能自动进行分布式迭代计算系统的任务参数优化，是一种即插即用型自适应调优方法，能够显著降低用户使用分布式迭代计算系统的门槛。The purpose of the present invention is to propose a method for optimizing task parameters of a distributed iterative computing system in view of the problem that there are many task parameters in the existing distributed iterative computing system and it is not easy to configure well. The invention can automatically optimize the task parameters of the distributed iterative computing system, is a plug-and-play self-adaptive optimization method, and can significantly reduce the threshold for users to use the distributed iterative computing system.

本发明提出的分布式迭代计算系统中的任务参数优化方法，首先采集分布式迭代计算系统中历史任务的运行数据，构建历史数据库；进行任务参数优化时，根据约束条件对历史数据库中显著不相关的运行数据进行一次过滤；然后对待优化任务对应的历史数据库中的运行数据与一次过滤后的运行数据进行有向无环图的相似度计算，并对相似度低于一定阈值的运行数据进行二次过滤；最后将两次过滤后的结果经过计算排序，并将排序后的运行数据所对应的任务参数作为任务参数优化结果。该方法具体包括以下步骤：The task parameter optimization method in the distributed iterative computing system proposed by the present invention firstly collects the operation data of the historical tasks in the distributed iterative computing system, and builds a historical database; when optimizing the task parameters, the historical database is significantly irrelevant according to the constraints Filter the operation data once; then perform DAG similarity calculation on the operation data in the historical database corresponding to the task to be optimized and the operation data after the first filter, and perform secondary calculation on the operation data whose similarity is lower than a certain threshold. Finally, the results after the two filters are calculated and sorted, and the task parameters corresponding to the sorted running data are used as the task parameter optimization results. The method specifically includes the following steps:

(1)从分布式迭代计算系统中获取每个历史任务的运行数据，将每个历史任务的运行数据保存到历史数据库中，历史数据库中每一项数据代表一个历史任务的运行数据；(1) Obtain the operation data of each historical task from the distributed iterative computing system, save the operation data of each historical task in the historical database, and each item of data in the historical database represents the operational data of a historical task;

(2)根据用户请求，对分布式迭代计算系统中的任务进行任务参数优化，设从历史数据库中找出的与该任务相同的的历史任务的运行数据为J_src；(2) according to user request, carry out task parameter optimization to the task in the distributed iterative computing system, suppose the running data of the historical task identical with this task that finds out from historical database be J _src ;

(3)从历史数据库中找出满足所有硬件资源约束的历史任务运行数据组成数据集合S_hardware；(3) Find out from the historical database the historical task operation data that meets all hardware resource constraints to form the data set S _hardware ;

(4)在步骤(3)得到的S_hardware的所有运行数据中找出输入数据总大小与步骤(2)得到的J_src的输入数据总大小在数值上相对差异小于设定的输入数据大小差异阈值的运行数据组成数据集合S_datasize；(4) Find out the relative difference between the total size of the input data and the total size of the input data of J _src obtained in the step (2) from all the operating data of _Shardware obtained in step (3) is smaller than the set input data size difference The operating data of the threshold constitutes the data set S _datasize ;

(5)在步骤(4)得到的S_datasize的所有运行数据中找出有向无环图与J_src的有向无环图在规模上相近的运行数据组成数据集合S_dag；(5) in all operating data of the S _datasize that step (4) obtains, find out directed acyclic graph and the directed acyclic graph of J _src in scale similar operating data to form data set S _dag ;

(6)计算步骤(5)得到的S_dag中每项运行数据的有向无环图与J_src的有向无环图的相似度，并设定相似度阈值；(6) the similarity between the directed acyclic graph of each operation data and the directed acyclic graph of J _src in the S _dag that calculation step (5) obtains, and set the similarity threshold;

(7)遍历步骤(6)的计算结果，抛弃S_dag中有向无环图与J_src的有向无环图相似度低于设定的相似度阈值的运行数据，设剩余运行数据组成的数据集合为S_sim；(7) Traversing the calculation results of step (6), discarding the operating data whose similarity between DAG in S _dag and DAG in J _src is lower than the set similarity threshold, and setting The data set is S _sim ;

(8)对步骤(7)得到的S_sim中的每项运行数据按照公式的计算结果从高到低进行排序，并只保留排序后计算结果中前n项的运行数据,n为正整数；式中，time_dst表示J_dst的运行时间；设排序后所得结果组成的数据集合为S_rank；(8) to each operation data in the S _sim that step (7) obtains according to the formula The calculation results are sorted from high to low, and only the running data of the first n items in the sorted calculation results are kept, n is a positive integer; where time _dst represents the running time of J _dst ; the data composed of the sorted results is set The set is S _rank ;

(9)将步骤(8)得到的S_rank中的每一条运行数据的任务参数在图型显示界面上显示给用户，任务参数优化流程结束；(9) the task parameter of each piece of operating data in the S _rank that step (8) obtains is displayed to the user on the graphic display interface, and the task parameter optimization process ends;

(10)当用户再次请求对分布式迭代计算系统的任务进行优化时，重新返回步骤(2)。(10) When the user requests to optimize the task of the distributed iterative computing system again, return to step (2).

本发明提出的分布式迭代计算系统中任务参数的优化方法，其特点和有益效果是：The optimization method of task parameter in the distributed iterative computing system that the present invention proposes, its characteristic and beneficial effect are:

1.本发明方法能让计算机承担分布式迭代计算系统中的任务参数优化的工作，减少了用户在使用分布式迭代计算系统时的工作量。在用户不熟悉分布式迭代计算系统的情况下，能给用户提供较为有效的任务参数，减轻了使用分布式迭代计算系统的压力。1. The method of the present invention enables the computer to undertake the work of optimizing task parameters in the distributed iterative computing system, reducing the workload of the user when using the distributed iterative computing system. In the case that the user is not familiar with the distributed iterative computing system, it can provide the user with more effective task parameters, reducing the pressure of using the distributed iterative computing system.

2.本方法结合了系统优化的经验规则和基于相似性搜索的优化方法，提高了任务参数优化的可靠性和可用性。2. This method combines the empirical rules of system optimization and the optimization method based on similarity search, which improves the reliability and usability of task parameter optimization.

3.本发明方法能适应系统的变化而进行改变，是一种自适应的调优方法。在分布式迭代计算系统运行的过程中，该方法会不断的收集系统所产生的运行数据，使得历史数据库的数据量越来越大，所能覆盖的任务类别越来越多。数据量的增加会使得任务参数优化结果随着系统运行而变得更好。3. The method of the present invention can adapt to changes in the system and be changed, and is an adaptive tuning method. During the operation of the distributed iterative computing system, this method will continuously collect the operating data generated by the system, so that the amount of data in the historical database becomes larger and larger, and the types of tasks that can be covered are more and more. The increase in the amount of data will make the task parameter optimization results better as the system runs.

4.本发明方法不需改动原有的分布式迭代计算系统，属于即插即用型的方法。4. The method of the present invention does not need to modify the original distributed iterative computing system, and belongs to the plug-and-play method.

附图说明Description of drawings

图1是本发明提出的分布式迭代计算系统中任务参数优化方法的总体流程图。Fig. 1 is an overall flowchart of the task parameter optimization method in the distributed iterative computing system proposed by the present invention.

具体实施方式detailed description

本发明提出一种分布式迭代计算系统中任务参数优化方法，下面结合附图和具体实施例进一步详细说明如下。The present invention proposes a task parameter optimization method in a distributed iterative computing system, which will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明提出一种分布式迭代计算系统中任务参数优化方法，总体流程如图1所示，本方法首先采集分布式迭代计算系统中历史任务的运行数据，构建历史数据库；进行任务参数优化时，根据约束条件对历史数据库中显著不相关的运行数据进行一次过滤；然后对待优化任务对应的历史数据库中的运行数据与一次过滤后的运行数据进行有向无环图的相似度计算，并对相似度低于一定阈值的运行数据进行二次过滤；最后将两次过滤后的结果经过计算排序，并将排序后的运行数据所对应的任务参数作为任务参数优化结果。该方法具体包括以下步骤：The present invention proposes a method for optimizing task parameters in a distributed iterative computing system. The overall process is shown in Figure 1. This method first collects the operating data of historical tasks in the distributed iterative computing system and builds a historical database; when optimizing task parameters, According to the constraint conditions, the significantly irrelevant operating data in the historical database is filtered once; then the similarity calculation of the directed acyclic graph is performed between the operating data in the historical database corresponding to the task to be optimized and the filtered operating data, and the similarity The operating data whose degree is lower than a certain threshold is filtered twice; finally, the results after the two filtering are calculated and sorted, and the task parameters corresponding to the sorted operating data are used as the task parameter optimization results. The method specifically includes the following steps:

(1)从分布式迭代计算系统中获取每个历史任务的运行数据，一个历史任务的运行数据包括任务参数、硬件资源信息(总体内存、可运行CPU核数和机器节点数目)、输入数据总大小和对应的有向无环图(任务在执行过程中，任务被分为多个子任务，有向无环图用于反映各个子任务之间的依赖关系；有向无环图上的节点代表子任务，有向无环图上的边代表子任务之间的先后执行顺序关系，节点上的标签代表子任务的具体名字)；然后将每个历史任务的运行数据保存到历史数据库中，历史数据库中每一项数据代表一个历史任务的运行数据；(1) Obtain the running data of each historical task from the distributed iterative computing system. The running data of a historical task includes task parameters, hardware resource information (total memory, the number of CPU cores that can be run, and the number of machine nodes), and the total number of input data. Size and the corresponding directed acyclic graph (during the execution of the task, the task is divided into multiple subtasks, and the directed acyclic graph is used to reflect the dependencies between the subtasks; the nodes on the directed acyclic graph represent Subtasks, the edges on the directed acyclic graph represent the sequence relationship between the subtasks, and the labels on the nodes represent the specific names of the subtasks); then save the running data of each historical task in the historical database, and the history Each item of data in the database represents the running data of a historical task;

(2)根据用户请求，对分布式迭代计算系统中的任务进行任务参数优化，设从历史数据库中找出与该任务相同的历史任务的运行数据为J_src；(2) According to the user's request, the task in the distributed iterative computing system is optimized for task parameters, and the operating data of finding out the same historical task as the task from the historical database is J _src ;

(3)从历史数据库中找出满足所有硬件资源约束的历史任务运行数据组成数据集合S_hardware；所述约束包括：运行数据总体内存与J_src的总体内存在数值上相对差异小于设定的内存差异阈值(本实施例设定的阈值为30％)；运行数据可运行CPU核数与J_src的可运行CPU核数在数值上相对差异小于设定的核数差异阈值(本实施例设定的阈值为30％)；运行数据机器节点数与J_src的机器节点数在数值上相对差异小于设定的机器节点数差异阈值(本实施例设定的阈值为30％)；(3) Find out from the historical database the historical task operation data that meets all hardware resource constraints to form a data set S _hardware ; the constraints include: the relative difference in value between the overall memory of the operating data and the overall memory of J _src is smaller than the set memory The difference threshold (the threshold set in this embodiment is 30%); the relative difference in value between the number of executable CPU cores of the running data and the number of executable CPU cores of J _src is less than the set core number difference threshold (set in this embodiment) The threshold of the machine node number is 30%); the relative difference in value between the number of machine nodes of the running data and the number of machine nodes of J _src is less than the set machine node number difference threshold (the threshold set in this embodiment is 30%);

(4)在步骤(3)得到的S_hardware的所有运行数据中找出输入数据总大小与步骤(2)得到的J_src的输入数据总大小在数值(以兆为单位)上相对差异小于设定的输入数据大小差异阈值的运行数据，组成数据集合S_datasize(本实施例设定的阈值为30％)；(4) Find out the relative difference between the total size of the input data and the total size of the input data of J _src obtained in step (2) in value (in megabytes) from all the operating data of the _Shardware obtained in step (3) is smaller than the set value The operating data of the determined input data size difference threshold, form the data set S _datasize (the threshold set in this embodiment is 30%);

(5)在步骤(4)得到的S_datasize的所有运行数据中找出有向无环图与J_src的有向无环图在规模上相近的运行数据组成数据集合S_dag；两个有向无环图的规模相近包括以下两方面条件：其一，两个有向无环图上的节点数目在数值上相对差异小于设定的有向无环图节点数目差异阈值(本实施例设定的阈值为30％)；其二，两个有向无环图上的边数目在数值上相对差异小于设定的有向无环图边数目差异阈值(本实施例设定的阈值为30％)；(5) Find out the directed acyclic graph and the directed acyclic graph of J _src in all operating data of S _datasize that step (4) obtains and form the data collection S _dag of operating data similar in scale; Two directed The similar scale of the acyclic graph includes the following two conditions: first, the relative difference in value of the number of nodes on the two directed acyclic graphs is smaller than the set difference threshold of the number of directed acyclic graph nodes (set in this embodiment The threshold is 30%); second, the relative difference in value of the number of edges on the two directed acyclic graphs is smaller than the set directed acyclic graph edge number difference threshold (the threshold set in this embodiment is 30% );

(6)计算步骤(5)得到的S_dag中每项运行数据的有向无环图与J_src的有向无环图的相似度，并设定相似度阈值；具体计算过程步骤如下：(6) Calculate the similarity between the directed acyclic graph of each item of operating data in S _dag obtained in step (5) and the directed acyclic graph of J _src , and set the similarity threshold; the specific calculation process steps are as follows:

(6-1)设S_dag中任一项运行数据为J_dst，计算J_dst的有向无环图与J_src的有向无环图相似度，定义J_src的有向无环图为G_src＝(N_src,E_src,L_src)，其中N_src表示有向无环图G_src中的节点集合，E_src表示有向无环图G_src中的边集合，L_src表示有向无环图G_src中每个节点上的标签所构成的集合；定义J_dst的有向无环图为G_dst＝(N_dst,E_dst,L_dst)，其中N_dst表示有向无环图G_dst中的节点集合，E_dst表示有向无环图G_dst中的边集合，L_dst表示有向无环图G_dst中每个节点上的标签所构成的集合；(6-1) Let J _dst be the running data of any item in S _dag , calculate the similarity between the directed acyclic graph of J _dst and the directed acyclic graph of J _src , and define the directed acyclic graph of J _src as G _src = (N _src , E _src , L _src ), where N _src represents the node set in the directed acyclic graph G _src , E _src represents the edge set in the directed acyclic graph G _src , and L _src represents the directed acyclic graph G src A collection of labels on each node in the ring graph G _src ; define the directed acyclic graph of J _dst as G _dst = (N _dst , E _dst , L _dst ), where N _dst represents the directed acyclic graph G The set of nodes in _dst , E _dst represents the set of edges in the directed acyclic graph G _dst , and L _dst represents the set formed by the labels on each node in the directed acyclic graph G _dst ;

(6-2)J_dst与J_src的有向无环图之间的相似度由如下公式定义：(6-2) The similarity between the DAGs of J _dst and J _src is defined by the following formula:

式中，sim(G_src,G_dst)代表J_dst的有向无环图与J_src的有向无环图的相似度，取值范围为[0,1]；In the formula, sim(G _src ,G _dst ) represents the similarity between the DAG of J _dst and the DAG of J _src , and the value range is [0,1];

skipN(G_src,G_dst)代表使G_src和G_dst相等的过程中，G_src与G_dst分别增加或删除的节点的数目之和；skipN(G _src ,G _dst ) represents the sum of the number of nodes added or deleted by G _src and G _dst in the process of making G _src and G _dst equal;

skipE(G_src,G_dst)代表使G_src和G_dst相等过程中，G_src与G_dst分别增加或删除的边的数目之和；skipE(G _src ,G _dst ) represents the sum of the number of edges added or deleted by G _src and G _dst in the process of making G _src and G _dst equal;

n_src和n_dst分别代表在有向无环图G_src中的任一个节点和有向无环图G_dst中的任一个节点；n _src and n _dst respectively represent any node in the directed acyclic graph G _src and any node in the directed acyclic graph G _dst ;

l_src和l_dst分别代表节点n_src和n_dst所对应的标签；l _src and l _dst represent the labels corresponding to nodes n _src and n _dst respectively;

edit(l_src,l_dst)表示l_src和l_dst两个标签上的编辑距离，即由标签l_src转换成标签l_dst过程中所需的最少编辑操作次数(允许的编辑操作包括：将l_src中的一个字符替换成另外一个字符；将l_src中的一个字符删除；添加一个字符到l_src中)；edit(l _src , l _dst ) represents the edit distance between the two labels l _src and l _dst , that is, the minimum number of editing operations required in the process of converting the label l _src into the label l _dst (allowed editing operations include: converting l Replace a character in _src with another character; delete a character in l _src ; add a character to l _src );

(6-3)重复步骤(6-1)至(6-2)，计算得到S_dag中每项运行数据的有向无环图与J_src的有向无环图的相似度；(6-3) Repeat steps (6-1) to (6-2), calculate the similarity between the directed acyclic graph of each operation data in S _dag and the directed acyclic graph of J _src ;

(7)遍历步骤(6)的计算结果，抛弃S_dag中有向无环图与J_src的有向无环图相似度低于设定相似度阈值的运行数据(本实施例设定的阈值为0.3)，设剩余运行数据组成的数据集合为S_sim；(7) Traversing the calculation result of step (6), discarding the DAG similarity between DAG and J _src in S _dag is lower than the operating data of the set similarity threshold (the threshold set in this embodiment is 0.3), let the data set composed of remaining operating data be S _sim ;

(8)对步骤(7)得到的S_sim中的每项运行数据按照公式的计算结果从高到低进行排序，并只保留排序后计算结果中前n项的运行数据，n为正整数(具体保留结果的数目根据实际情况决定，本实施例保留排序后前10项计算结果)；式中，time_dst表示J_dst的运行时间；该公式综合考虑了两项因素，一是J_dst与J_src在有向无环图上的相似度，二是J_dst所对应的历史任务的运行时间；在这种评价指标下进行排序，在有向无环图上相似度越高，或J_dst所对应的历史任务运行时间越短，运行数据在排序结果中越靠前；设排序后所得结果组成的数据集合为S_rank；(8) to each operation data in the S _sim that step (7) obtains according to the formula The calculation results are sorted from high to low, and only the running data of the first n items in the sorted calculation results are retained, and n is a positive integer (the number of specific retained results is determined according to the actual situation, and this embodiment retains the calculation of the first 10 items after sorting result); in the formula, time _dst represents the running time of J _dst ; this formula takes two factors into consideration, one is the similarity between J _dst and J _src on the directed acyclic graph, and the other is the history corresponding to J _dst The running time of the task; sorting under this evaluation index, the higher the similarity on the directed acyclic graph, or the shorter the running time of the historical task corresponding to J _dst , the higher the running data in the sorting result; set the sorting The data set composed of the obtained results is S _rank ;

Claims

1. a kind of method that task parameters optimize in distributed iterative computing system, it is characterised in that this method is gathered point first The service data of historic task in cloth iterative calculation system, builds historical data base；When carrying out task parameters optimization, according to about Beam condition is once filtered to notable incoherent service data in historical data base；Then it is corresponding to task to be optimized to go through Service data in history database carries out the Similarity Measure of directed acyclic graph with the service data after once filtering, and to similar Degree carries out secondary filter less than the service data of certain threshold value；Finally the result after filtering twice is sorted by calculating, and will The task parameters corresponding to service data after sequence are used as task parameters optimum results.

2. the method as described in claim 1, it is characterised in that this method specifically includes following steps：

(1) service data of each historic task is obtained from distributed iterative computing system, by the operation of each historic task Data are saved in historical data base, and each item data represents the service data of a historic task in historical data base；

(2) asked according to user, task parameters optimization is carried out to the task in distributed iterative computing system, if from historical data Being found out in storehouse is J with the historic task of task identical service data_src；

(3) the historic task service data composition data set for meeting all hardware resource constraint is found out from historical data base S_hardware；

(4) S obtained in step (3)_hardwareAll service datas in find out what input data total size and step (2) were obtained J_srcInput data total size numerically relative different be less than setting input data difference in size threshold value service data group Into data acquisition system S_datasize；

(5) S obtained in step (4)_datasizeAll service datas all directed acyclic graphs in find out and J_srcOriented nothing The service data composition data set S of ring figure directed acyclic graph close in scale_dag；

(6) S that calculation procedure (5) is obtained_dagIn each service data directed acyclic graph and J_srcDirected acyclic graph it is similar Degree, and set similarity threshold；

(7) result of calculation of traversal step (6), abandons S_dagMiddle directed acyclic graph and J_srcDirected acyclic graph similarity less than setting The service data of fixed similarity threshold, if the data acquisition system of remaining service data composition is S_sim；

(8) S obtained to step (7)_simIn each service data according to formulaResult of calculation

It is ranked up from high to low, and only retains service data first n in result of calculation after sequence, n is positive integer；In formula, Define J_srcDirected acyclic graph be G_src；If S_dagAny one of service data be J_dst, define J_dstDirected acyclic graph be G_dst, time_dstRepresent J_dstRun time；sim(G_src,G_dst) represent J_dstDirected acyclic graph and J_srcDirected acyclic graph phase Like degree；If the data acquisition system that acquired results are constituted after sequence is S_rank；

(9) S for obtaining step (8)_rankIn the task parameters of each service data be shown to use on pattern display interface Family, task parameters Optimizing Flow terminates；

(10) when user asks to optimize the task of distributed iterative computing system again, step (2) is returned to.

3. method as claimed in claim 2, it is characterised in that S is calculated in step (6)_dagIn each service data oriented nothing Ring figure and J_srcDirected acyclic graph similarity, specific calculation procedure is as follows：

(6-1) sets S_dagAny one of service data be J_dst, calculate J_dstDirected acyclic graph and J_srcDirected acyclic graph it is similar Degree, defines J_srcDirected acyclic graph be G_src=(N_src,E_src,L_src), wherein N_srcRepresent directed acyclic graph G_srcIn set of node Close, E_srcRepresent directed acyclic graph G_srcIn line set, L_srcRepresent directed acyclic graph G_srcIn label institute structure on each node Into set；Define J_dstDirected acyclic graph be G_dst=(N_dst,E_dst,L_dst), wherein N_dstRepresent directed acyclic graph G_dstIn Node set, E_dstRepresent directed acyclic graph G_dstIn line set, L_dstRepresent directed acyclic graph G_dstIn mark on each node The constituted set of label；

(6-2)J_dstWith J_srcDirected acyclic graph between similarity defined by equation below：

<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>G</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>G</mi> <mrow> <mi>d</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mi>s</mi> <mi>k</mi> <mi>i</mi> <mi>p</mi> <mi>N</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>G</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>G</mi> <mrow> <mi>d</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <mi>s</mi> <mi>k</mi> <mi>i</mi> <mi>p</mi> <mi>E</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>G</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>G</mi> <mrow> <mi>d</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> <mo>+</mo> <mn>2</mn> <munder> <munder> <mi>&Sigma;</mi> <mrow> <msub> <mi>n</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>&Element;</mo> <msub> <mi>N</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>,</mo> </mrow> </munder> <mrow> <msub> <mi>n</mi> <mrow> <mi>d</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> <mo>&Element;</mo> <msub> <mi>N</mi> <mrow> <mi>d</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> </munder> <mfrac> <mrow> <mi>e</mi> <mi>d</mi> <mi>i</mi> <mi>t</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>l</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>,</mo> <msub> <mi>l</mi> <mrow> <mi>d</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mrow> <mi>max</mi> <mrow> <mo>(</mo> <mrow> <mo>|</mo> <msub> <mi>l</mi> <mrow> <mi>s</mi> <mi>r</mi> <mi>c</mi> </mrow> </msub> <mo>|</mo> <mo>,</mo> <mo>|</mo> <msub> <mi>l</mi> <mrow> <mi>d</mi> <mi>s</mi> <mi>t</mi> </mrow> </msub> <mo>|</mo> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

In formula, sim (G_src,G_dst) represent J_dstDirected acyclic graph and J_srcDirected acyclic graph similarity, span is [0,1]；

skipN(G_src,G_dst) represent make G_srcAnd G_dstIn equal process, G_srcWith G_dstThe number for the node for increasing or deleting respectively Mesh sum；

skipE(G_src,G_dst) represent make G_srcAnd G_dstDuring equal, G_srcWith G_dstRespectively increase or delete side number it With；

n_srcAnd n_dstRepresent respectively in directed acyclic graph G_srcIn any one node and directed acyclic graph G_dstIn any one section Point；

l_srcAnd l_dstNode n is represented respectively_srcAnd n_dstCorresponding label；

edit(l_src,l_dst) represent l_srcAnd l_dstEditing distance on two labels, i.e., by label l_srcIt is converted into label l_dstCross Minimum edit operation number of times needed for journey；

|l_src| and | l_dst| label l is represented respectively_srcWith label l_dstString length；

(6-3) repeat step (6-1) to (6-2), calculating obtains S_dagIn each service data directed acyclic graph and J_srcHave To the similarity of acyclic figure.

4. method as claimed in claim 2, it is characterised in that step (3) described hardware resource constraints, including：Historical data Service data overall memory and J in storehouse_srcOverall memory numerically relative different be less than setting internal memory discrepancy threshold；Fortune Row data can run CPU core number and J_srcRun CPU core number numerically relative different be less than setting check figure difference threshold Value；Service data machine nodes and J_srcMachine nodes numerically relative different be less than setting machine nodes it is poor Different threshold value.

5. method as claimed in claim 2, it is characterised in that the scale of step (5) described two directed acyclic graphs is close, bag Include following two aspects condition：First, numerically relative different is less than having for setting to the interstitial content on two directed acyclic graphs To acyclic node of graph number discrepancy threshold；Second, numerically relative different is less than and set for side number on two directed acyclic graphs Fixed directed acyclic graph side number discrepancy threshold.