CN112559287B

CN112559287B - Optimization method and device for task flow of data center station

Info

Publication number: CN112559287B
Application number: CN202011448500.3A
Authority: CN
Inventors: 姜水琴; 路平; 张敬谊; 胡杉文; 王维任; 袁峰; 张鑫金; 方幸
Original assignee: WONDERS INFORMATION CO Ltd
Current assignee: WONDERS INFORMATION CO Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2024-08-06
Anticipated expiration: 2040-12-11
Also published as: CN112559287A

Abstract

The invention provides an object of optimizing a task flow of a data center station. In order to achieve the above purpose, a technical scheme of the present invention is to provide a method for optimizing a task flow of a data center station. The invention further provides an optimizing device for the task flow of the data center station. The invention can monitor the execution condition of the task flow and give an alarm, thereby effectively supervising the task flow. When the task flow is abnormal, the method can not only give an alarm, but also accurately locate key abnormal nodes on the task flow, and simultaneously predict the execution time of the optimized task flow, and judge whether the optimized task flow can be completed in a preset time, thereby realizing the optimization of the task flow. Therefore, the invention can optimize the task flow of the data center station, improve the execution efficiency of the data center station, and improve the stability and reliability of the data center station.

Description

Optimization method and device for task flow of data center station

Technical Field

The invention relates to a method and a device for optimizing task flows of a data center, and belongs to the technical field of data center.

Background

The data center creates multi-source heterogeneous data, uniformly manages and manages enterprise data, provides support for enterprise business, provides efficient service for clients, and is a sediment of enterprise business and data. The data center can reduce the cost of repeated construction and can also keep the differentiated competitive advantage of enterprises. Stable and efficient data center has become an infrastructure for enterprise strategic.

A series of task flows are operated on the data center table, the task flows are usually composed of a plurality of different types of tasks such as SQL (structured query language), script, ETL (extract-transform language) and the like, the task types are various, different types of task codes are developed cooperatively by a plurality of people, and depending on different operating environments and resources, the manual configuration resources are difficult to match the demands of the tasks on the resources, and the operating efficiency is greatly different and the resource utilization rate is low. Task dependency in the task flow is complex, and downstream tasks often depend on successful execution of upstream tasks; the upstream task delays the execution of the process which will slow the whole task flow, so that the whole task flow is difficult to finish at a preset moment; and dependencies exist among task periods, such as the current task execution depends on the last scheduling execution result. The task operation efficiency difference and the complex dependency relationship often cause task abnormality phenomena such as congestion in the task flow, so that the execution efficiency of the task flow in the data is low.

The data center station adopts serial and parallel task execution logic of the directed acyclic graph, when the task is abnormal, especially the fault of the upstream task is reported to the police together with the downstream task, and the key abnormal node is difficult to be positioned quickly; when the task runs, the effective supervision on the task flow is lacking, the execution time of the task flow cannot be prejudged, and the task is difficult to finish at a preset moment, so that the stability and the reliability of the data center are lower. Therefore, there is a need to develop an optimization method for the task flow of the data center.

Disclosure of Invention

The purpose of the invention is that: and optimizing the task flow of the data center station.

In order to achieve the above object, a technical solution of the present invention is to provide a method for optimizing a task flow of a data center station, which is characterized by comprising the following steps:

step S1: at least reading task names of all single tasks on which the target task flow depends and a dependency relationship table among all single tasks on which the task flow depends;

Step S2: the method comprises the steps of monitoring the execution result of a target task flow, judging the execution result of the target task flow in the execution process of the target task flow, wherein the judgment result comprises the following steps: task completion, task error and task overtime, judging the execution result of the target task flow as abnormal when the execution result is the task error or the task overtime, entering step S3, and returning to step S2 after recording the execution duration of the single task on which the target task flow depends when the execution result is the task completion;

step S3: for a target task flow with abnormal execution results, calculating a critical path of the target task flow and the longest execution duration of the whole target task flow according to the execution duration of a single task on which the target task flow depends, wherein the method comprises the following steps of:

Step S301: establishing a directed acyclic graph according to the dependency relationship among the single tasks on which the task flows depend and the execution time length of each single task;

step S302: calculating a critical path of the directed acyclic graph as a critical path of the target task flow; the execution time of the key path of the directed acyclic graph is the longest execution time of the whole target task flow;

step S4: according to the critical path and the longest execution duration of the whole target task flow, determining critical abnormal nodes of the target task flow, and defining a single task as a node, in step S4:

when the execution result is that the task is in error, searching a key fault node in which a single task is in error, sending an error alarm, and sending a key path and the key fault node of a target task flow to a user;

when the execution result is that the task is overtime, finding out a key overtime node with the maximum execution time of a single task, wherein the task with the maximum execution time in the key node set CPL is the key overtime node, sending out overtime alarm, and sending a key path of a task flow and the key overtime node to a user;

step S5: forming an optimized task flow according to the key abnormal nodes, and predicting the longest execution duration of the optimized task flow, wherein:

the key abnormal node is the key fault node or the key overtime node obtained in the step S4;

Predicting the longest execution duration of the optimized task flow comprises the following steps: constructing a mathematical model by using a history execution record in the log, predicting the execution time length of a single task on which the optimized task flow depends by using the mathematical model, and calculating the predicted execution time length of the task flow according to the predicted execution time length of the single task;

Step S6: judging whether the optimized task flow can be completed at a preset moment according to the predicted longest execution duration of the optimized task flow, and if not, giving an alarm; if the task flow can be completed, continuously monitoring the execution state of the task flow and recording the execution time of the task flow.

Preferably, in step S2 and step S3, the execution duration of the single task is calculated and obtained according to the start time and the end time of the single task recorded in the log.

Preferably, the directed acyclic graph is represented by g= (V, E), and in step S301, establishing the directed acyclic graph g= (V, E) further includes the steps of:

Defining each event in the target task flow as the vertex of the directed acyclic graph G, wherein the ith vertex of the directed acyclic graph G is denoted as V _i, and the vertices corresponding to all n events in the target task flow form a vertex set V, V= { V ₁,v₂,...,v_n };

Each single task on which the target task flow depends is defined as a directed edge of the directed acyclic graph G, where E _ij represents the directed edge in the directed acyclic graph G pointing from vertex v _i to vertex v _j, and the directed edges corresponding to all the single tasks in the target task flow form a set of edges E, e= { E _ij|(v_i,v_j) }. And the weight of each directed edge is the execution duration of the corresponding single task, and the weight of the directed edge e _ij is c _ij.

Preferably, in step S302, calculating the critical path of the directed acyclic graph and the execution duration of the critical path includes the steps of:

The earliest and latest start times of each vertex thereon are calculated according to the directed acyclic graph G established in step S301. If the earliest starting time and the latest starting time of the current vertex are equal, adding the current vertex into a bottleneck event set, and adding a directed edge corresponding to the current vertex into a key node set CPL; setting the earliest starting time of the vertex v _j as ES _i and the latest starting time as LS _i, if the ES _i is equal to LS _i, adding the vertex v _j into a bottleneck event set, and adding a directed edge e _ij corresponding to the vertex v _j into a key node set CPL;

After traversing all vertexes on the directed acyclic graph G, the finally obtained key node set CPL is the key path of the directed acyclic graph G; the sum of the execution time lengths of all single tasks in the key node set CPL is the execution time length of the key path, namely the longest execution time length of the whole target task flow.

Preferably, in step S5, when forming the optimized task flow:

For the key fault node, checking error reasons, and modifying corresponding errors so as to form an optimized task flow;

and for the key timeout node, reconfiguring resources according to the data volume, the processor utilization rate and the memory occupation condition, so as to form an optimized task flow.

Preferably, in step S5, predicting the longest execution duration of the optimized task flow specifically includes the following steps:

Step S501: according to the number of processors, the memory occupation condition and the data volume of the tasks, which are applied by a user for the tasks, N historical tasks similar to the current single task are found out from the task records of historical execution by adopting a K nearest neighbor algorithm, the average value of the execution time length of the N historical tasks is calculated, and the average value of the execution time length is used as the predicted execution time length of the corresponding single task;

step S502: and establishing a directed acyclic graph according to the dependency relationship among the single tasks on which the optimized task flow depends and the predicted execution time length of the single tasks on which the optimized task flow depends, which is calculated in the step S501, and calculating a key path of the directed acyclic graph, wherein the longest execution time length of the key path is the predicted execution time length of the optimized task flow.

Preferably, in step S501, N tasks similar to the current single task are found by using the similarity, and if the current single task is J ₁ and any one of the history tasks is J ₂, the similarity is sim (J ₁,J₂), and there are:

Wherein x _a1 and x _a2 are the a-th feature vectors of the current single task J ₁ and the historical task J ₂, and m is the total number of feature vectors.

Preferably, in step S6, whether the optimized task flow can be completed at a predetermined time is determined according to a rule, if the rule is satisfied, the task flow is determined to be completed, otherwise, the task flow is determined to be not completed, wherein the rule is determined to be completed according to the following formula:

T_s+t_max+t_cut≤T_f

wherein T _s is a preset starting time of the optimized task flow, T _max is a longest execution time of the optimized task flow obtained in step S5, T _cut is a preset threshold, and T _f is a preset completion time of the optimized task flow.

Another technical solution of the present invention is to provide an optimizing apparatus for a task flow of a data center station, wherein the optimizing method is operated, and includes:

The data reading module is used for reading the task names of the single tasks on which the task flows depend, the dependency relationship table among the single tasks and other data;

The calculation module calculates the execution time of the single task and calculates the critical path of the task flow;

the prediction module predicts the execution time of a single task and predicts the overall execution time of a task flow;

and the monitoring module is used for judging whether the task flow can be completed at a preset moment, and if the task flow cannot be completed, the monitoring module is used for giving an alarm.

The invention can monitor the execution condition of the task flow and give an alarm, thereby effectively supervising the task flow. When the task flow is abnormal, the method can not only give an alarm, but also accurately locate key abnormal nodes on the task flow, and simultaneously predict the execution time of the optimized task flow, and judge whether the optimized task flow can be completed in a preset time, thereby realizing the optimization of the task flow. Therefore, the invention can optimize the task flow of the data center station, improve the execution efficiency of the data center station, and improve the stability and reliability of the data center station.

Drawings

FIG. 1 is a flowchart of a method for optimizing a task flow of a data center station according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of an optimizing device for task flows in a data center in accordance with an embodiment of the present invention;

FIG. 3 is a task flow diagram of a data warehouse of a data center in accordance with an embodiment of the present invention;

FIG. 4 is a directed acyclic graph corresponding to a task flow of a data warehouse in a data center according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a machine learning task flow in data according to an embodiment of the present invention;

Fig. 6 is a directed acyclic graph corresponding to a machine learning task flow in data according to an embodiment of the present invention.

Detailed Description

The application will be further illustrated with reference to specific examples. It is to be understood that these examples are illustrative of the present application and are not intended to limit the scope of the present application. Furthermore, it should be understood that various changes and modifications can be made by one skilled in the art after reading the teachings of the present application, and such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Referring to fig. 1, an optimization method for a task flow of a data center station according to an embodiment of the present invention includes the following steps:

Step S1: reading data such as task names of all single tasks on which the target task flow depends, dependency relationship tables among all single tasks on which the task flow depends and the like;

Step S2: and monitoring the execution result of the target task flow, and calculating the execution duration of the single task on which the target task flow depends.

In this embodiment, in the execution process of the target task flow, the present invention determines an execution result of the target task flow, where the determined execution result includes: task completion, task error, task timeout.

The execution time length of the single task on which the target task flow depends is calculated and obtained according to the starting time and the ending time of the single task recorded in the log.

If the execution result of the target task flow is abnormal, step S3 is entered, in this embodiment, if the execution result is that the task is in error or the task is overtime, the execution result is judged to be abnormal; otherwise, recording the actual execution time length of the single task relied by the target task flow, and calculating the actual execution time length of the single task according to the starting time and the ending time of the single task recorded in the log. In this embodiment, for a target task flow whose execution result is task completion, the actual execution duration of a single task on which the target task flow depends is recorded.

Step S3: and for the target task flow with abnormal execution results, calculating a critical path of the target task flow and the longest execution duration of the whole target task flow according to the execution duration of the single task on which the target task flow depends.

In this step, the critical path of the target task flow and the longest execution duration of the whole target task flow are calculated according to the execution duration of the single task on which the target task flow depends, and further including the following steps:

step S302: calculating a critical path of the directed acyclic graph as a critical path of the target task flow; the execution time of the critical path of the directed acyclic graph is the longest execution time of the whole target task flow.

The directed acyclic graph is represented by g= (V, E), and in step S301, the establishing of the directed acyclic graph g= (V, E) further includes the steps of:

And defining each event in the target task flow as the vertex of the directed acyclic graph G, wherein the ith vertex of the directed acyclic graph G is denoted as V _i, and the vertices corresponding to all n events in the target task flow form a vertex set V, V= { V ₁,v₂,...,v_n }.

In step S302, the critical path of the directed acyclic graph and the execution duration of the critical path are calculated, and further including the following steps:

The earliest and latest start times of each vertex thereon are calculated according to the directed acyclic graph G established in step S301. If the earliest starting time and the latest starting time of the current vertex are equal, adding the current vertex into a bottleneck event set, and adding a directed edge corresponding to the current vertex into a key node set CPL. Let the earliest start time of vertex v _j be ES _i and the latest start time be LS _i, if ES _i is equal to LS _i, then vertex v _j is added to the bottleneck event set, and the directed edge e _ij corresponding to vertex v _j is added to the key node set CPL. After traversing all the vertexes of the directed acyclic graph G, the final obtained key node set CPL is the key path of the directed acyclic graph G. The sum of the execution time lengths of all single tasks in the key node set CPL is the execution time length of the key path, namely the longest execution time length of the whole target task flow.

Step S4: and determining key abnormal nodes of the target task flow according to the key paths and the longest execution duration of the whole target task flow. In the present invention, a single task is defined as a node, and in step S4:

And when the execution result is that the task is overtime, finding out a key overtime node with the maximum execution time of a single task, wherein the task with the maximum execution time in the key node set CPL is the key overtime node, sending out an overtime alarm, and sending a key path of a task flow and the key overtime node to a user.

Step S5: and forming an optimized task flow according to the key abnormal nodes, and predicting the longest execution duration of the optimized task flow. The key abnormal node is the key fault node or the key overtime node obtained in the step S4.

Forming the optimized task flow:

In step S5, predicting the longest execution duration of the optimized task flow further includes the following steps:

Constructing a mathematical model by using a history execution record in a log, predicting the execution time length of a single task on which the optimized task flow depends by using the mathematical model, and calculating the predicted execution time length of the task flow according to the predicted execution time length of the single task, wherein the method further comprises the following steps:

Step S501: according to the number of processors, the memory occupation condition and the data volume of the tasks, which are applied by a user for the tasks, N historical tasks similar to the current single task are found out from the task records of historical execution by adopting a K neighbor algorithm, the average value of the execution time length of the N historical tasks is calculated, and the average value of the execution time length is used as the predicted execution time length of the corresponding single task.

In this embodiment, N tasks similar to the current single task are found by using the similarity, and if the current single task is J ₁ and any one of the history tasks is J ₂, the similarity is sim (J ₁,J₂), and there are:

In this embodiment, the determining whether the optimized task flow can be completed at the predetermined time further includes the following steps:

judging whether the optimized task flow can be completed at a preset moment according to the rule, if so, judging that the task flow can be completed, and if not, judging that the task flow cannot be completed. In this embodiment, the judgment rule adopts the following formula:

T_s+t_max+t_cut≤T_f

To achieve the above object, an embodiment of the present application further provides an optimizing apparatus for a task flow in data, fig. 2 is a block diagram of an optimizing apparatus for a task flow in data according to an embodiment of the present application, and referring to fig. 2, the apparatus includes:

the data reading module 201 reads data such as task names of individual tasks, dependency relationship tables among the individual tasks, and the like, on which the task flows depend.

The calculation module 202 calculates the execution time of the single task and calculates the critical path of the task flow.

And the prediction module 203 predicts the execution time of the single task and predicts the overall execution time of the task flow.

The monitoring module 204 judges whether the task flow can be completed at a preset time, and if not, the task flow gives an alarm.

In order to further understand the optimization method of the task flow of the data center station in the embodiment of the invention, the data warehouse task flow and the machine learning task flow of the data center station are taken as examples to further explain the invention.

The data warehouse task flow optimization method of the data center station in the embodiment of the invention specifically comprises the following steps:

Reading task names of all single tasks relied by the task flow, and data such as a dependency relationship table among the single tasks of the task flow;

And monitoring the execution result of the target task flow, and calculating the execution duration of the single task on which the target task flow depends. In the data warehouse task flow of this embodiment, the dependency relationship between single tasks and the execution duration of the single tasks are as shown in fig. 3, where event t_ods is an original layer data table, t_dwd ₁～T_DWD₃ is a detail layer data table, t_dws ₁～T_DWS₅ is a wide table of an aggregation layer, and t_ads is an application layer data table; SQL ₁～SQL₁₃ is the running node of the workflow; the number preceding SQL _i is the execution time required for each SQL script.

According to the task flow, a directed acyclic graph with weights is established, as shown in fig. 4, and each layer of data table forms a vertex set V of the directed acyclic graph, namely { V ₁,v₂,...,v₁₀ }; the single task node SQL _i of the task flow and its execution sequence form the set E of directed edges E _ij of the directed acyclic graph, and the execution duration of SQL _i forms the set C of weights C _ij of the directed edges.

The earliest start time ES _i calculated for each vertex v _i is:

{ES₁：0,ES₄：11,ES₃：6,ES₆：21,ES₉：26,ES₂：4,ES₅：15,ES₈：42,ES₇：18,ES₁₀：53}

The latest start time LS _i of each vertex v _i is:

{LS₁：0,LS₄：11,LS₃：16,LS₆：21,LS₉：48,LS₂：27,LS₅：38,LS₈：42,LS₇：45,LS₁₀：53}

According to the earliest start time ES _i and the latest start time LS_i,ES₁＝LS₁,ES₄＝LS₄,ES₆＝LS₆,ES₈＝LS₈,ES₁₀＝LS₁₀, of each vertex v _i obtained above, the vertex { v ₁,v₄,v₆,v₈,v₁₀ } is a bottleneck event set, the corresponding directed edge { e ₁₄,e₄₆,e₆₈,e_8,10 } is a key node set CPL, that is, the key path of the target task flow in this embodiment is the task flow node { SQL ₃,SQL₆,SQL₉,SQL₁₂ }, and the longest execution duration of the entire task flow is 53.

According to FIG. 3, the single task with the greatest execution duration in the critical path { SQL ₃,SQL₆,SQL₉,SQL₁₂ } is SQL ₉. The system issues an exception alert and sends the user the critical path { SQL ₃,SQL₆,SQL₉,SQL₁₂ } of the task flow and the critical exception node SQL ₉.

And optimizing the task flow according to specific conditions by a user according to the key abnormal nodes to form an optimized task flow. According to the number of processors, memory occupation and task data volume of the task application which are applied by a user for the optimized task, a K neighbor algorithm is adopted, and the data of the task flow is executed in a historical mode, so that the single task execution duration of the optimized task flow is predicted, and the predicted longest execution duration t _max of the task flow is further obtained.

Judging whether the optimized task flow can be completed on time according to the following formula:

T_s+t_max+t_cut≤T_f

Wherein T _s is a preset start time of the task flow, T _cut is a threshold, and T _f is a preset finish time of the task flow.

According to the judging method, if the optimized task flow can not be completed on time, an alarm is sent out. If the task flow can be completed, continuously monitoring the execution state of the task flow and recording the execution time of the task flow.

The machine learning task flow optimization method of the data center station of the other embodiment of the invention specifically comprises the following steps:

Reading task names in the task flows, dependency relationship tables among single tasks of the task flows and other data;

And monitoring the execution result of the target task flow, and calculating the execution duration of the single task on which the target task flow depends. The machine learns the dependency relationship between the single tasks of the task flow and the execution time of the single tasks according to the embodiment, as shown in fig. 5.

A directed acyclic graph is constructed according to the machine learning workflow described above, as shown in fig. 6. Each layer of data table forms a vertex set V of the directed acyclic graph, namely { V ₁,v₂,...,v₁₅ }; the execution sequence of the single task nodes { load data set i, data cleaning operator i, data merging operator i, feature encoding operator i, machine learning operator i, calculation final result } of the task flow forms a set E of directed edges E _ij of the directed acyclic graph, and the execution duration of each single task node forms a set C of weights C _ij of the directed edges.

The earliest start time ES _i calculated for each vertex v _i is:

{ES₂：0,ES₄：4,ES₆：11,ES₁：0,ES₃：3,ES₅：9,ES₇：13,ES₁₁：19,ES₁₀：22,ES₉：20,ES₈：21,ES₁₂：25,ES₁₄：45,ES₁₃：36,ES₁₅：49}

The latest start time LS _i calculated to obtain each vertex v _i is:

{LS₂：0,LS₄：4,LS₆：11,LS₁：2,LS₃：5,LS₅：11,LS₇：13,LS₁₁：22,LS₁₀：22,LS₉：22,LS₈：22,LS₁₂：25,LS₁₄：45,LS₁₃：45,LS₁₅：49}

The earliest start time ES _i and the latest start time LS_i,ES₂＝LS₂,ES₄＝LS₄,ES₆＝LS₆,ES₇＝LS₇,ES₁₀＝LS₁₀,ES₁₂＝LS₁₂,ES₁₄＝LS₁₄,ES₁₅＝LS₁₅, of each vertex v _i obtained as described above are therefore the bottleneck event set with vertex { v ₂,v₄,v₆,v₇,v₁₀,v₁₂,v₁₄,v₁₅ } and the corresponding directed edge { e ₂₄,e₄₆,e₆₇,e_7,10,e_10,12,e_12,14,e_14,15 } as the critical node set CPL, i.e. the critical path of this embodiment is the task flow node { load dataset 1, data cleansing operator 2, data merge 1, feature encoding operator 3, data merge 2, machine learning operator 2, calculate final result }. The longest execution duration of the entire task flow is 49.

According to fig. 5, the single task with the largest execution duration in the critical path { load dataset 1, data cleansing operator 2, data merge 1, feature encoding operator 3, data merge 2, machine learning operator 2, calculate final result } of this embodiment is { machine learning operator 2}. The system sends out an abnormal alarm and sends a key path { load data set 1, data cleaning operator 2, data merging 1, feature coding operator 3, data merging 2, machine learning operator 2, calculation final result } and key abnormal node { machine learning operator 2} of the task flow to the user.

T_s+t_max+t_cut≤T_f

Compared with the prior art, the method and the device for optimizing the task flow of the data center platform have the following beneficial effects:

Claims

1. The optimizing method of the task flow of the station in the data is characterized by comprising the following steps:

Step S2: monitoring the execution result of the target task flow, and judging the execution result of the target task flow in the execution process of the target task flow, wherein the judgment result comprises the following steps: task completion, task error and task overtime, judging the execution result of the target task flow as abnormal when the execution result is the task error or the task overtime, entering step S3, and returning to step S2 after recording the execution duration of the single task on which the target task flow depends when the execution result is the task completion;

2. The method for optimizing a task flow in a data center as claimed in claim 1, wherein in step S2 and step S3, the execution duration of a single task is calculated and obtained according to the start time and the end time of the single task recorded in the log.

3. The method for optimizing a task flow in a data station according to claim 1, wherein the directed acyclic graph is represented by g= (V, E), and the step S301 of creating the directed acyclic graph g= (V, E) further comprises the steps of:

Defining each single task on which the target task flow depends as a directed edge of the directed acyclic graph G, wherein E _ij represents the directed edge pointing from the vertex v _i to the vertex v _j in the directed acyclic graph G, and the directed edges corresponding to all the single tasks in the target task flow form an edge set E, E= { E _ij|(v_i,v_j) }; and the weight of each directed edge is the execution duration of the corresponding single task, and the weight of the directed edge e _ij is c _ij.

4. A method for optimizing a task flow in a data station according to claim 3, wherein in step S302, calculating the critical path of the directed acyclic graph and the execution time of the critical path comprises the steps of:

Calculating the earliest starting time and the latest starting time of each vertex according to the directed acyclic graph G established in the step S301; if the earliest starting time and the latest starting time of the current vertex are equal, adding the current vertex into a bottleneck event set, and adding a directed edge corresponding to the current vertex into a key node set CPL; setting the earliest starting time of the vertex v _j as ES _i and the latest starting time as LS _i, if the ES _i is equal to LS _i, adding the vertex v _j into a bottleneck event set, and adding a directed edge e _ij corresponding to the vertex v _j into a key node set CPL;

5. The method for optimizing a task flow in data according to claim 1, wherein in step S5, when the optimized task flow is formed:

6. The method for optimizing a task flow in a data platform according to claim 5, wherein in step S5, predicting the longest execution duration of the optimized task flow specifically includes the steps of:

7. The method of optimizing task flows in data according to claim 6, wherein in step S501, N tasks similar to a current task are found by using similarity, and if the current task is J ₁ and any one of the historical tasks is J ₂, the similarity is sim (J ₁,J₂), and there are:

8. The method for optimizing task flows in data according to claim 1, wherein in step S6, it is determined whether the optimized task flows can be completed at a predetermined time according to a rule, if the rule is satisfied, it is determined that the task flows can be completed, otherwise, it is determined that the task flows cannot be completed, wherein the determination rule adopts the following formula:

T_s+t_max+t_cut≤T_f

9. An optimization apparatus for a data-in-data task flow, wherein the optimization method according to claim 1 is executed, and the optimization apparatus comprises:

the data reading module is used for reading task names of all single tasks on which the task flows depend, and dependency relationship table data among the single tasks;