CN110673939B - Task scheduling method, device and medium based on airflow and yarn - Google Patents

Task scheduling method, device and medium based on airflow and yarn Download PDF

Info

Publication number
CN110673939B
CN110673939B CN201910900859.0A CN201910900859A CN110673939B CN 110673939 B CN110673939 B CN 110673939B CN 201910900859 A CN201910900859 A CN 201910900859A CN 110673939 B CN110673939 B CN 110673939B
Authority
CN
China
Prior art keywords
task
tasks
time
resource
group
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910900859.0A
Other languages
Chinese (zh)
Other versions
CN110673939A (en
Inventor
洪嘉凯
巫朝星
陈旺明
林智辉
郑旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honorsun Xiamen Data Co ltd
Original Assignee
Honorsun Xiamen Data Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Honorsun Xiamen Data Co ltd filed Critical Honorsun Xiamen Data Co ltd
Priority to CN201910900859.0A priority Critical patent/CN110673939B/en
Publication of CN110673939A publication Critical patent/CN110673939A/en
Application granted granted Critical
Publication of CN110673939B publication Critical patent/CN110673939B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a task scheduling method, a task scheduling device and a task scheduling medium based on airflow and yarn. The method comprises the steps of determining the dependency relationship among tasks in a group of tasks, the earliest executable time of the tasks and the deadline of finishing each task, dynamically setting the allowable running time range of each task, generating a python file of the tasks, crawling the running conditions of all the tasks of the group of tasks by using a crawler, generating a resource overlay map and a network Gantt map, and adjusting the execution sequence of the tasks based on the resource overlay map and the network Gantt map. According to the invention, by counting the execution time and the memory occupation of the computing resources of the task and combining a key path analysis method and the actual use condition of the computing resources of the system, the execution sequence of the task is automatically optimized by using the resource overlay graph and the network Gantt graph, so that the total execution time of the task is automatically optimized, the busy-time resource occupation is reduced, the idle-time resource utilization rate is improved, and the specific condition of task movement is provided.

Description

Task scheduling method, device and medium based on airflow and yarn
Technical Field
The invention relates to the technical field of computer task scheduling, in particular to a task scheduling method, a task scheduling device and a task scheduling storage medium based on airflow and yarn.
Background
At present, when a group of big data related jobs with dependency relationship run, because the jobs are planned and executed in a network diagram form, there is a situation of parallel execution among jobs, and because the fluctuation of computing resource utilization is large, the parallel jobs have resource preemption at the peak time, which results in the whole running time of the whole group of tasks being lengthened.
In the prior art, the resource usage amount of the run of the yarn task is limited during task scheduling, the peak-valley condition is not considered, and the simple limitation of the resource usage amount is not beneficial to the task to be executed more quickly. However, resource preemption between tasks still exists, which lengthens the overall set of task execution times.
Therefore, the prior art has the following defects: the problem of task preemption cannot be solved based on the dependency relationship between tasks, or the execution duration is not shortened or lengthened after the problem of preemption is solved, that is, the execution efficiency is lower, that is, the computing resources are not fully utilized.
Disclosure of Invention
The present invention provides the following technical solutions to overcome the above-mentioned drawbacks in the prior art.
A task scheduling method based on airflow and yarn, the method comprising:
a step of determining, a task of a group of tasksiDependencies with other tasks DiThe earliest time TZ for the task to be executediAnd a deadline TD by which each task must be endediAnd dynamically setting the time range TP allowed to run for each taskiThen taski、Di、TZi、TDiAnd TPiForm a data record (task)i、Di、TZi、TDi、TPi) The method comprises the steps that the method is stored in a database, wherein a group of tasks comprises N tasks, and 1 ≦ i ≦ N;
an Airflow configuration generation step of generating a data record (task) for each task of a set of tasks stored in a databasei、Di、TZi、TDi、TPi) Generating a python file of the task, and submitting the generated python file to an Airflow server for executing the task;
a task running recording step, wherein a crawler is used for accessing a yann management interface to obtain running conditions of all tasks of the group of tasks and storing the running conditions into a database, the running conditions comprise resource occupation conditions of each task and actual starting and ending time of each task, and a resource superposition graph is generated based on the running conditions;
and a task adjusting step, namely, accessing the Airflow server by using a crawler based on the group identifier dag _ id of the group of tasks to acquire the actual running time, the consumed memory and the dependency relationship of each task in the group of tasks, generating a network Gantt chart based on the actual running time, the consumed memory and the dependency relationship of each task, and adjusting the execution sequence of the tasks based on the resource superposition chart and the network Gantt chart.
Furthermore, the resource occupation situation is a memory occupation situation or a processor occupation situation.
Furthermore, the abscissa of the resource superposition graph is time, the ordinate is the sum of resources occupied by all tasks corresponding to each time point, each rectangle in the network gantt chart represents a task, the length of the rectangle is the running time of the task, the starting and ending time corresponds to the time on the time axis, and the height of the rectangle frame represents the amount of resources consumed by the task.
Further, the operation of adjusting the execution sequence of the tasks based on the resource overlay graph and the network gantt chart is as follows: calculating the time point with the maximum resource consumption based on the resource overlay graph, determining the identifications of all tasks corresponding to the time point, determining all corresponding rectangles in the network Gantt graph based on the identifications of all tasks corresponding to the time point, and recording data (task) based on each taski、Di、TZi、TDi、TPi) And adjusting the sequence of all tasks corresponding to the time point.
Further, based on the data record (task) of each taski、Di、TZi、TDi、TPi) The sequence operation of all tasks corresponding to the time point is adjusted as follows: calculating a critical path based on the data records of each task in the set of tasks, the critical pathThe path is a logic path with the longest execution time from the beginning to the end in a group of tasks, each task does not move forwards, when a certain task is moved, the task depending on the task is determined according to the data record, and the task depending on the task is correspondingly moved; if the task to be moved is located on the critical path, the task is not moved; judging whether the ending time of the task to be moved and the task depending on the task after the movement exceeds the set allowed latest ending time, if so, not moving; and judging whether the ending time of the task to be moved and the task depending on the task after the movement is greater than the ending time of the latest ending task before the movement, and if so, not moving.
The invention also provides a task scheduling device based on airflow and yarn, which comprises:
a determining unit for determining a task in a group of tasksiDependencies with other tasks DiThe earliest time TZ for the task to be executediAnd a deadline TD by which each task must be endediAnd dynamically setting the time range TP allowed to run for each taskiThen taski、Di、TZi、TDiAnd TPiForm a data record (task)i、Di、TZi、TDi、TPi) The method comprises the steps that the method is stored in a database, wherein a group of tasks comprises N tasks, and 1 ≦ i ≦ N;
an Airflow configuration generation unit that records (task) data for each task of a set of tasks stored in a databasei、Di、TZi、TDi、TPi) Generating a python file of the task, and submitting the generated python file to an Airflow server for executing the task;
the task running recording unit is used for acquiring the running conditions of all tasks of the group of tasks by utilizing the management interface of the crawler access yann and storing the running conditions into the database, wherein the running conditions comprise the resource occupation condition of each task and the actual starting and ending time of each task, and a resource superposition graph is generated based on the running conditions;
and the task adjusting unit is used for accessing the Airflow server by using a crawler based on the group identifier dag _ id of the group of tasks to acquire the actual running time, the consumed memory and the dependency relationship of each task in the group of tasks, generating a network Gantt chart based on the actual running time, the consumed memory and the dependency relationship of each task, and adjusting the execution sequence of the tasks based on the resource superposition chart and the network Gantt chart.
Furthermore, the resource occupation situation is a memory occupation situation or a processor occupation situation.
Furthermore, the abscissa of the resource superposition graph is time, the ordinate is the sum of resources occupied by all tasks corresponding to each time point, each rectangle in the network gantt chart represents a task, the length of the rectangle is the running time of the task, the starting and ending time corresponds to the time on the time axis, and the height of the rectangle frame represents the amount of resources consumed by the task.
Further, the operation of adjusting the execution sequence of the tasks based on the resource overlay graph and the network gantt chart is as follows: calculating the time point with the maximum resource consumption based on the resource overlay graph, determining the identifications of all tasks corresponding to the time point, determining all corresponding rectangles in the network Gantt graph based on the identifications of all tasks corresponding to the time point, and recording data (task) based on each taski、Di、TZi、TDi、TPi) And adjusting the sequence of all tasks corresponding to the time point.
Further, based on the data record (task) of each taski、Di、TZi、TDi、TPi) The sequence operation of all tasks corresponding to the time point is adjusted as follows: calculating a key path based on the data record of each task in the group of tasks, wherein the key path refers to a logic path with the longest execution time from the beginning to the end in the group of tasks, each task does not move forwards, when a certain task is moved, the task depending on the task is determined according to the data record, and the task depending on the task is correspondingly moved; if the task to be moved is located on the critical path, the task is not moved; determining tasks to be movedAnd depending on whether the end time of its task after the move exceeds its set allowed latest end time, if so, then not move; and judging whether the ending time of the task to be moved and the task depending on the task after the movement is greater than the ending time of the latest ending task before the movement, and if so, not moving.
The invention also proposes a computer-readable storage medium having stored thereon computer program code which, when executed by a computer, performs any of the methods described above.
The invention has the technical effects that: the invention first determines a task in a group of tasksiDependencies with other tasks DiThe earliest time TZ for the task to be executediAnd a deadline TD by which each task must be endediAnd dynamically setting the time range TP allowed to run for each taskiThen taski、Di、TZi、TDiAnd TPiForm a data record (task)i、Di、TZi、TDi、TPi) The method comprises the steps that the method is stored in a database, wherein a group of tasks comprises N tasks, and 1 ≦ i ≦ N; and then records (task) according to the data of each task in a group of tasks stored in the databasei、Di、TZi、TDi、TPi) Generating a python file of the task, and submitting the generated python file to an Airflow server for executing the task; then, utilizing a management interface of a crawler access yann to obtain the running conditions of all tasks of the group of tasks and storing the running conditions into a database, wherein the running conditions comprise the resource occupation condition of each task and the actual starting and ending time of each task, and generating a resource superposition graph based on the running conditions; finally, the crawler is used for accessing the Airflow server based on the group identification dag _ id of the group of tasks to obtain the actual running time, the consumed memory and the dependency relationship of each task in the group of tasks, a network Gantt chart is generated based on the actual running time, the consumed memory and the dependency relationship of each task, and the execution of the tasks is adjusted based on the resource overlay chart and the network Gantt chartAnd (4) sequencing. According to the invention, by counting the execution time and the memory occupation of the computing resources of the tasks, combining the actual use condition of the computing resources of the key path analysis method and the system, the execution sequence of the tasks is automatically optimized by using the resource overlay graph and the network Gantt graph, so that the total execution time of the tasks is automatically optimized, the accurate strong dependence relationship among the tasks is realized by the technology, and the tasks are intelligently scheduled by further analyzing the key path, so that the resource competition is avoided, the limited computing resources are fully utilized, the busy hour resource preemption is reduced, the idle time resource utilization rate is improved, and the specific condition of task movement is provided.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
Fig. 1 is a flowchart of a task scheduling method based on airflow and yarn according to an embodiment of the present invention.
FIG. 2 is a resource overlay prior to task scheduling according to an embodiment of the present invention.
Fig. 3 is a network gantt chart before task scheduling according to an embodiment of the invention.
FIG. 4 is a resource overlay after task scheduling according to an embodiment of the present invention.
FIG. 5 is a network Gantt graph after task scheduling according to an embodiment of the invention.
Fig. 6 is a block diagram of an airflow and yarn-based task scheduler according to an embodiment of the present invention.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
Fig. 1 shows an airflow and yarn-based task scheduling method of the present invention, which includes:
a determination step S101 of determining a task in a group of tasksiDependencies with other tasks DiThe earliest time TZ for the task to be executediAnd a deadline TD by which each task must be endediAnd dynamically setting the time range TP allowed to run for each taskiThen taski、Di、TZi、TDiAnd TPiForm a data record (task)i、Di、TZi、TDi、TPi) And storing the tasks in a database, wherein a group of tasks comprises N tasks, and 1 ≦ i ≦ N. Typically, the number of tasks in a group of tasks is at least several tens, and in many cases may be hundreds. This information may be stored in the mysql database, although other databases are possible.
The Airflow configuration generation step S102 of recording data (task) for each task in a set of tasks stored in the databasei、Di、TZi、TDi、TPi) And generating a python file of the task, and submitting all the tasks to an Airflow server for executing the tasks after all the tasks generate the python file.
Since the python code itself has a set of code specifications, only the parameters required by the present invention need to be added to the code specifications. The specific method is that the specific information of the task is defined in the form of a template, as shown in the following code, wherein $ { } represents the name of a specific parameter, and when the task code is generated, the whole $ { } is only required to be replaced by the content of the parameter.
Figure GDA0003320128380000081
And deleting the historical task on the airflow, uploading the python file to the server, adding a new task, and opening the running task after the task is reset when the configuration is changed.
And uploading the task by using sftp, and operating the server by using java codes because the linux server supports the sftp protocol. After uploading, the configuration needs to be refreshed manually because the airflow itself has a long waiting time to read the file that we upload. Specifically, a crawler is used for accessing http:// { ip address }: port }/admin/airflow/treedag _ id ═ dag _ id }, wherein the ip address and the port are used by an airflow server side, and dag _ id refers to the id of a task group uploaded by the crawler. The task being cleared is also accomplished in a similar way, with the interface accessed http:// { ip address }: port }/admin/airflow/clear.
And a task running recording step S103, utilizing the crawler to access the yann management interface to obtain the running conditions of all tasks of the group of tasks and store the running conditions into a database, wherein the running conditions comprise the resource occupation condition of each task and the actual starting and ending time of each task, and a resource overlay is generated based on the running conditions.
The url rule for visiting the yarn interface is: http:// { ip address }: port }/cluster/apps. The running conditions of all tasks can be acquired on the page, and the data are needed for analysis, so that the resource occupation condition and the starting and ending time of the tasks are stored in the mysql database, and of course, other databases are also possible.
And a task adjusting step S104, using a crawler to access the Airflow server based on the group identifier dag _ id of the group of tasks to obtain the actual running time, the consumed memory and the dependency relationship of each task in the group of tasks, generating a network Gantt chart based on the actual running time, the consumed memory and the dependency relationship of each task, and adjusting the execution sequence of the tasks based on the resource superposition chart and the network Gantt chart. The resource occupation condition is the condition of occupying a memory or a processor.
In one embodiment, the abscissa of the resource overlay graph is time, the ordinate is the sum of resources occupied by all tasks corresponding to each time point, each rectangle in the network gantt chart represents a task, the length of the rectangle is the running time of the task, the starting and ending time corresponds to time on a time axis, and the height of the rectangle frame represents the amount of resources consumed by the task.
In one embodiment, the operation of adjusting the execution sequence of the tasks based on the resource overlay graph and the network gantt chart is as follows: calculating the time point with the maximum resource consumption based on the resource overlay graph, determining the identifications of all tasks corresponding to the time point, determining all corresponding rectangles in the network Gantt graph based on the identifications of all tasks corresponding to the time point, and recording data (task) based on each taski、Di、TZi、TDi、TPi) And adjusting the sequence of all tasks corresponding to the time point.
In one embodiment, the data record (task) is based on each taski、Di、TZi、TDi、TPi) The sequence operation of all tasks corresponding to the time point is adjusted as follows: calculating a key path based on the data record of each task in the group of tasks, wherein the key path refers to a logic path with the longest execution time from the beginning to the end in the group of tasks, each task does not move forwards, when a certain task is moved, the task depending on the task is determined according to the data record, and the task depending on the task is correspondingly moved; if the task to be moved is located on the critical path, the task is not moved; judging whether the ending time of the task to be moved and the task depending on the task after the movement exceeds the set allowed latest ending time, if so, not moving; and judging whether the ending time of the task to be moved and the task depending on the task after the movement is greater than the ending time of the latest ending task before the movement, and if so, not moving. And after moving the task, the critical path cannot be changed, and the resource usage after moving also has no obvious peak value, and the two conditions are also conditions for judging whether to perform task movement, and can be used together with other conditions for judging whether to perform task movement.
The following describes in detail how the movement of the task is performed in a specific embodiment. Fig. 2 shows an overlay of resource consumption at various time points, and it can be seen that resource occupation reaches the highest value at about 25 minutes, which may cause resource preemption problem, and other time, such as about 33 minutes, is too idle, so we can move the start time of the task according to the characteristic.
A net gantt chart is drawn according to the actual running time, the consumed memory and the dependency relationship, as shown in fig. 3. The rectangle is a task, the length of the rectangular frame is running time, and the starting and ending time corresponds to time on a time axis, so that the running time is the length of the rectangular frame on the abscissa. While the height of the rectangular box identifies the amount of resources consumed by the task. The number at the beginning of the rectangular box is the task ID. In fig. 3, tasks 11, 14, 16, 17, 19, 20, 24 are all tasks located on the critical path, and the remaining tasks are non-critical path tasks. The critical path refers to a logical path having the longest execution time from the beginning to the end of a group of tasks. Optimizing the critical path is an effective way to increase the execution speed of a set of tasks. Generally, the execution time from a group of tasks depends on the path with the longest execution time, and is independent of other paths with shorter execution times. The critical path may be iteratively calculated during the optimization design process until it is not possible to reduce the critical path delay.
And determining a task to be moved based on the network Gantt chart of the task and the superposed graph of the resource consumption, and selecting to move the task 9 to the task 12 for execution and then move the task 29 to the task 11. The net gantt chart after the move is shown in fig. 4, and the resource overlay chart is shown in fig. 5. It can be seen from fig. 4-5 that the peak in resource consumption is reduced from 283623 to 156724. While the overall running time is reduced from 40 minutes to 37 minutes, the resource occupation of the task number 17 on the critical path is slightly increased from 80627 to 93372, and the resource occupation of the task on the critical path is increased and the running time is reduced, so that the overall running time is reduced.
The method is characterized in that the execution time and the memory occupation of the computing resources of the tasks are counted, the actual use condition of the computing resources of the key path analysis method and the system is combined, the execution sequence of the tasks is automatically optimized by using the resource overlay graph and the network Gantt graph, so that the total execution time of the tasks is automatically optimized, the accurate strong dependence relationship among the tasks is realized by the technology, the tasks are intelligently scheduled by further analyzing the key path, the resource competition is avoided, the limited computing resources are fully utilized, the busy hour resource occupation is reduced, the idle time resource utilization rate is improved, and the specific conditions of task movement are provided.
Fig. 6 shows an airflow and yarn-based task scheduling apparatus of the present invention, which includes:
a determining unit 601 for determining a task in a group of tasksiDependencies with other tasks DiThe earliest time TZ for the task to be executediAnd a deadline TD by which each task must be endediAnd dynamically setting the time range TP allowed to run for each taskiThen taski、Di、TZi、TDiAnd TPiForm a data record (task)i、Di、TZi、TDi、TPi) And storing the tasks in a database, wherein a group of tasks comprises N tasks, and 1 ≦ i ≦ N. Typically, the number of tasks in a group of tasks is at least several tens, and in many cases may be hundreds. This information may be stored in the mysql database, although other databases are possible.
The Airflow configuration generation unit 602, based on the data record (task) of each task in a set of tasks stored in the databasei、Di、TZi、TDi、TPi) And generating a python file of the task, and submitting all the tasks to an Airflow server for executing the tasks after all the tasks generate the python file.
Since the python code itself has a set of code specifications, only the parameters required by the present invention need to be added to the code specifications. The specific method is that the specific information of the task is defined in the form of a template, as shown in the following code, wherein $ { } represents the name of a specific parameter, and when the task code is generated, the whole $ { } is only required to be replaced by the content of the parameter.
Figure GDA0003320128380000121
And deleting the historical task on the airflow, uploading the python file to the server, adding a new task, and opening the running task after the task is reset when the configuration is changed.
And uploading the task by using sftp, and operating the server by using java codes because the linux server supports the sftp protocol. After uploading, the configuration needs to be refreshed manually because the airflow itself has a long waiting time to read the file that we upload. Specifically, a crawler is used for accessing http:// { ip address }: port }/admin/airflow/treedag _ id ═ dag _ id }, wherein the ip address and the port are used by an airflow server side, and dag _ id refers to the id of a task group uploaded by the crawler. The task being cleared is also accomplished in a similar way, with the interface accessed http:// { ip address }: port }/admin/airflow/clear.
And the task running recording unit 603 acquires the running conditions of all tasks of the group of tasks by using the management interface of the crawler access yann and stores the running conditions into the database, wherein the running conditions comprise the resource occupation condition of each task and the actual starting and ending time of each task, and a resource overlay is generated based on the running conditions.
The url rule for visiting the yarn interface is: http:// { ip address }: port }/cluster/apps. The running conditions of all tasks can be acquired on the page, and the data are needed for analysis, so that the resource occupation condition and the starting and ending time of the tasks are stored in the mysql database, and of course, other databases are also possible.
The task adjusting unit 604 accesses the Airflow server by using the crawler based on the group identifier dag _ id of the group of tasks to obtain the actual running time, the consumed memory and the dependency relationship of each task in the group of tasks, generates a network gantt chart based on the actual running time, the consumed memory and the dependency relationship of each task, and adjusts the execution sequence of the tasks based on the resource overlay chart and the network gantt chart. The resource occupation condition is the condition of occupying a memory or a processor.
In one embodiment, the abscissa of the resource overlay graph is time, the ordinate is the sum of resources occupied by all tasks corresponding to each time point, each rectangle in the network gantt chart represents a task, the length of the rectangle is the running time of the task, the starting and ending time corresponds to time on a time axis, and the height of the rectangle frame represents the amount of resources consumed by the task.
In one embodiment, the operation of adjusting the execution sequence of the tasks based on the resource overlay graph and the network gantt chart is as follows: calculating the time point with the maximum resource consumption based on the resource overlay graph, determining the identifications of all tasks corresponding to the time point, determining all corresponding rectangles in the network Gantt graph based on the identifications of all tasks corresponding to the time point, and recording data (task) based on each taski、Di、TZi、TDi、TPi) And adjusting the sequence of all tasks corresponding to the time point.
In one embodiment, the data record (task) is based on each taski、Di、TZi、TDi、TPi) The sequence operation of all tasks corresponding to the time point is adjusted as follows: calculating a key path based on the data record of each task in the group of tasks, wherein the key path refers to a logic path with the longest execution time from the beginning to the end in the group of tasks, each task does not move forwards, when a certain task is moved, the task depending on the task is determined according to the data record, and the task depending on the task is correspondingly moved; if the task to be moved is located on the critical path, the task is not moved; judging whether the ending time of the task to be moved and the task depending on the task after the movement exceeds the set allowed latest ending time, if so, not moving; and judging whether the ending time of the task to be moved and the task depending on the task after the movement is greater than the ending time of the latest ending task before the movement, and if so, not moving. And the critical path cannot be changed after the task is movedThe post-resource usage should have no obvious peak value, and these two conditions are also conditions for determining whether to perform task movement, and may be used together with other conditions to determine whether to perform task movement.
The following describes in detail how the movement of the task is performed in a specific embodiment. Fig. 2 shows an overlay of resource consumption at various time points, and it can be seen that resource occupation reaches the highest value at about 25 minutes, which may cause resource preemption problem, and other time, such as about 33 minutes, is too idle, so we can move the start time of the task according to the characteristic.
A net gantt chart is drawn according to the actual running time, the consumed memory and the dependency relationship, as shown in fig. 3. The rectangle is a task, the length of the rectangular frame is running time, and the starting and ending time corresponds to time on a time axis, so that the running time is the length of the rectangular frame on the abscissa. While the height of the rectangular box identifies the amount of resources consumed by the task. The number at the beginning of the rectangular box is the task ID. In fig. 3, tasks 11, 14, 16, 17, 19, 20, 24 are all tasks located on the critical path, and the remaining tasks are non-critical path tasks. The critical path refers to a logical path having the longest execution time from the beginning to the end of a group of tasks. Optimizing the critical path is an effective means to increase the speed of execution of a set of tasks being designed. Generally, the execution time from a group of tasks depends on the path with the longest execution time, and is independent of other paths with shorter execution times. The critical path may be iteratively calculated during the optimization design process until it is not possible to reduce the critical path delay.
And determining a task to be moved based on the network Gantt chart of the task and the superposed graph of the resource consumption, and selecting to move the task 9 to the task 12 for execution and then move the task 29 to the task 11. The net gantt chart after the move is shown in fig. 4, and the resource overlay chart is shown in fig. 5. It can be seen from fig. 4-5 that the peak in resource consumption is reduced from 283623 to 156724. While the overall running time is reduced from 40 minutes to 37 minutes, the resource occupation of the task number 17 on the critical path is slightly increased from 80627 to 93372, and the resource occupation of the task on the critical path is increased and the running time is reduced, so that the overall running time is reduced.
The device automatically optimizes the execution sequence of the tasks by counting the execution time and the memory occupation of the computing resources of the tasks and combining a key path analysis method and the actual use condition of the computing resources of the system, thereby automatically optimizing the total execution time of the tasks, realizing accurate and strong dependence relationship among the tasks by the technology, further intelligently scheduling the tasks by key path analysis, avoiding resource competition, fully utilizing the limited computing resources, reducing the resource occupation in busy hours, improving the resource utilization rate in idle hours, and providing specific conditions for task movement, which is an important invention point of the invention.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made thereto without departing from the spirit and scope of the invention and it is intended to cover in the claims the invention as defined in the appended claims.

Claims (7)

1. A task scheduling method based on airflow and yarn is characterized by comprising the following steps:
a step of determining, a task of a group of tasksiDependencies with other tasks DiThe earliest time TZ for the task to be executediAnd a deadline TD by which each task must be endediAnd dynamically setting the time range TP allowed to run for each taskiThen taski、Di、TZi、TDiAnd TPiForm a data record (task)i、Di、TZi、TDi、TPi) The method comprises the steps that the method is stored in a database, wherein a group of tasks comprises N tasks, and 1 ≦ i ≦ N;
an Airflow configuration generation step of generating a data record (task) for each task of a set of tasks stored in a databasei、Di、TZi、TDi、TPi) Generating a python file of the task, and submitting the generated python file to an Airflow server for executing the task;
a task running recording step, wherein a crawler is used for accessing a yann management interface to obtain running conditions of all tasks of the group of tasks and storing the running conditions into a database, the running conditions comprise resource occupation conditions of each task and actual starting and ending time of each task, and a resource superposition graph is generated based on the running conditions;
a task adjusting step, namely, accessing an Airflow server by using a crawler based on a group identifier dag _ id of the group of tasks to acquire the actual running time, the consumed memory and the dependency relationship of each task in the group of tasks, generating a network Gantt chart based on the actual running time, the consumed memory and the dependency relationship of each task, and adjusting the execution sequence of the tasks based on the resource superposition chart and the network Gantt chart;
the operation of adjusting the execution sequence of the tasks based on the resource superposition graph and the network Gantt graph is as follows: calculating the time point with the maximum resource consumption based on the resource overlay graph, and determining the identifications of all tasks corresponding to the time pointDetermining all corresponding rectangles in the network Gantt chart based on the identifications of all tasks corresponding to the time point, and recording data (task) based on each taski、Di、TZi、TDi、TPi) Adjusting the sequence of all tasks corresponding to the time point;
data record (task) on a per task basisi、Di、TZi、TDi、TPi) The sequence operation of all tasks corresponding to the time point is adjusted as follows: calculating a key path based on the data record of each task in the group of tasks, wherein the key path refers to a logic path with the longest execution time from the beginning to the end in the group of tasks, each task does not move forwards, when a certain task is moved, the task depending on the task is determined according to the data record, and the task depending on the task is correspondingly moved; if the task to be moved is located on the critical path, the task is not moved; judging whether the ending time of the task to be moved and the task depending on the task after the movement exceeds the set allowed latest ending time, if so, not moving; and judging whether the ending time of the task to be moved and the task depending on the task after the movement is greater than the ending time of the latest ending task before the movement, and if so, not moving.
2. The method of claim 1, wherein the occupied resource condition is a occupied memory or processor condition.
3. The method according to claim 2, wherein the abscissa of the resource overlay is time, the ordinate is the sum of resources occupied by all tasks corresponding to each time point, each rectangle in the network gantt chart represents a task, the length of the rectangle is the running time of the task, the starting and ending time corresponds to time on a time axis, and the height of the rectangle represents the amount of resources consumed by the task.
4. An airflow and yarn based task scheduler, the apparatus comprising:
a determining unit for determining a task in a group of tasksiDependencies with other tasks DiThe earliest time TZ for the task to be executediAnd a deadline TD by which each task must be endediAnd dynamically setting the time range TP allowed to run for each taskiThen taski、Di、TZi、TDiAnd TPiForm a data record (task)i、Di、TZi、TDi、TPi) The method comprises the steps that the method is stored in a database, wherein a group of tasks comprises N tasks, and 1 ≦ i ≦ N;
an Airflow configuration generation unit that records (task) data for each task of a set of tasks stored in a databasei、Di、TZi、TDi、TPi) Generating a python file of the task, and submitting the generated python file to an Airflow server for executing the task;
the task running recording unit is used for acquiring the running conditions of all tasks of the group of tasks by utilizing the management interface of the crawler access yann and storing the running conditions into the database, wherein the running conditions comprise the resource occupation condition of each task and the actual starting and ending time of each task, and a resource superposition graph is generated based on the running conditions;
the task adjusting unit is used for accessing the Airflow server by using a crawler based on the group identifier dag _ id of the group of tasks to acquire the actual running time, the consumed memory and the dependency relationship of each task in the group of tasks, generating a network Gantt chart based on the actual running time, the consumed memory and the dependency relationship of each task, and adjusting the execution sequence of the tasks based on the resource superposition chart and the network Gantt chart;
the operation of adjusting the execution sequence of the tasks based on the resource superposition graph and the network Gantt graph is as follows: calculating the time point with the maximum resource consumption based on the resource overlay graph, determining the identifications of all tasks corresponding to the time point, determining all corresponding rectangles in the network Gantt graph based on the identifications of all tasks corresponding to the time point, and recording data (task) based on each taski、Di、TZi、TDi、TPi) Adjusting the sequence of all tasks corresponding to the time point;
data record (task) on a per task basisi、Di、TZi、TDi、TPi) The sequence operation of all tasks corresponding to the time point is adjusted as follows: calculating a key path based on the data record of each task in the group of tasks, wherein the key path refers to a logic path with the longest execution time from the beginning to the end in the group of tasks, each task does not move forwards, when a certain task is moved, the task depending on the task is determined according to the data record, and the task depending on the task is correspondingly moved; if the task to be moved is located on the critical path, the task is not moved; judging whether the ending time of the task to be moved and the task depending on the task after the movement exceeds the set allowed latest ending time, if so, not moving; and judging whether the ending time of the task to be moved and the task depending on the task after the movement is greater than the ending time of the latest ending task before the movement, and if so, not moving.
5. The apparatus of claim 4, wherein the occupied resource condition is a occupied memory or processor condition.
6. The apparatus according to claim 5, wherein the abscissa of the resource overlay is time, the ordinate is the sum of resources occupied by all tasks corresponding to each time point, each rectangle in the network gantt chart represents a task, the length of the rectangle is the running time of the task, the starting and ending time corresponds to time on a time axis, and the height of the rectangle represents the amount of resources consumed by the task.
7. A computer-readable storage medium, characterized in that the storage medium has stored thereon computer program code which, when executed by a computer, performs the method of any of claims 1-3.
CN201910900859.0A 2019-09-23 2019-09-23 Task scheduling method, device and medium based on airflow and yarn Active CN110673939B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910900859.0A CN110673939B (en) 2019-09-23 2019-09-23 Task scheduling method, device and medium based on airflow and yarn

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910900859.0A CN110673939B (en) 2019-09-23 2019-09-23 Task scheduling method, device and medium based on airflow and yarn

Publications (2)

Publication Number Publication Date
CN110673939A CN110673939A (en) 2020-01-10
CN110673939B true CN110673939B (en) 2021-12-28

Family

ID=69077270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910900859.0A Active CN110673939B (en) 2019-09-23 2019-09-23 Task scheduling method, device and medium based on airflow and yarn

Country Status (1)

Country Link
CN (1) CN110673939B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112363819A (en) * 2020-12-02 2021-02-12 深圳市房多多网络科技有限公司 Big data task dynamic scheduling method and device and computing equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101038559A (en) * 2006-09-11 2007-09-19 中国工商银行股份有限公司 Batch task scheduling engine and dispatching method
CN101685452A (en) * 2008-09-26 2010-03-31 阿里巴巴集团控股有限公司 Data warehouse scheduling method and scheduling system
CN102193826A (en) * 2011-05-24 2011-09-21 哈尔滨工程大学 Method for high-efficiency task scheduling of heterogeneous multi-core processor
CN103838627A (en) * 2014-03-18 2014-06-04 北京工业大学 Workflow dispatching method based on workflow throughput maximization
CN104239141A (en) * 2014-09-05 2014-12-24 北京邮电大学 Task optimized-scheduling method in data center on basis of critical paths of workflow
US8949429B1 (en) * 2011-12-23 2015-02-03 Amazon Technologies, Inc. Client-managed hierarchical resource allocation
CN105117286A (en) * 2015-09-22 2015-12-02 北京大学 Task scheduling and pipelining executing method in MapReduce
CN106919449A (en) * 2017-03-21 2017-07-04 联想(北京)有限公司 The dispatch control method and electronic equipment of a kind of calculating task
CN107688488A (en) * 2016-08-03 2018-02-13 中国移动通信集团湖北有限公司 A kind of optimization method and device of the task scheduling based on metadata
CN109901926A (en) * 2019-01-25 2019-06-18 平安科技(深圳)有限公司 Method, server and storage medium based on big data behavior scheduling application task

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090112618A1 (en) * 2007-10-01 2009-04-30 Johnson Christopher D Systems and methods for viewing biometrical information and dynamically adapting schedule and process interdependencies with clinical process decisioning
US11010193B2 (en) * 2017-04-17 2021-05-18 Microsoft Technology Licensing, Llc Efficient queue management for cluster scheduling

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101038559A (en) * 2006-09-11 2007-09-19 中国工商银行股份有限公司 Batch task scheduling engine and dispatching method
CN101685452A (en) * 2008-09-26 2010-03-31 阿里巴巴集团控股有限公司 Data warehouse scheduling method and scheduling system
CN102193826A (en) * 2011-05-24 2011-09-21 哈尔滨工程大学 Method for high-efficiency task scheduling of heterogeneous multi-core processor
US8949429B1 (en) * 2011-12-23 2015-02-03 Amazon Technologies, Inc. Client-managed hierarchical resource allocation
CN103838627A (en) * 2014-03-18 2014-06-04 北京工业大学 Workflow dispatching method based on workflow throughput maximization
CN104239141A (en) * 2014-09-05 2014-12-24 北京邮电大学 Task optimized-scheduling method in data center on basis of critical paths of workflow
CN105117286A (en) * 2015-09-22 2015-12-02 北京大学 Task scheduling and pipelining executing method in MapReduce
CN107688488A (en) * 2016-08-03 2018-02-13 中国移动通信集团湖北有限公司 A kind of optimization method and device of the task scheduling based on metadata
CN106919449A (en) * 2017-03-21 2017-07-04 联想(北京)有限公司 The dispatch control method and electronic equipment of a kind of calculating task
CN109901926A (en) * 2019-01-25 2019-06-18 平安科技(深圳)有限公司 Method, server and storage medium based on big data behavior scheduling application task

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Priority-Based Resource Scheduling in Distributed Stream Processing Systems for Big Data Applications;Paolo Bellavista等;《2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing》;20150202;363-370 *
结合Backfilling和空闲资源调度的云工作流调度方法;谭海中等;《西南大学学报(自然科学版)》;20180620;第40卷(第6期);149-157 *
面向异构分布式计算环境的并行任务调度优化方法;柳玉等;《系统工程与电子技术》;20151015;第38卷(第2期);332-338 *

Also Published As

Publication number Publication date
CN110673939A (en) 2020-01-10

Similar Documents

Publication Publication Date Title
JP6447120B2 (en) Job scheduling method, data analyzer, data analysis apparatus, computer system, and computer-readable medium
US10884777B2 (en) Apparatus and method to predict a time interval taken for a live migration of a virtual machine
US8826277B2 (en) Cloud provisioning accelerator
US8631412B2 (en) Job scheduling with optimization of power consumption
US9785468B2 (en) Finding resource bottlenecks with low-frequency sampled data
US20170139749A1 (en) Scheduling homogeneous and heterogeneous workloads with runtime elasticity in a parallel processing environment
US9430283B2 (en) Information processing apparatus and job scheduling method
CN113535367A (en) Task scheduling method and related device
US11934874B2 (en) Resource optimization for serverless query processing
CN110673939B (en) Task scheduling method, device and medium based on airflow and yarn
US20170097849A1 (en) Batch processing of oversubscribed system based on subscriber usage patterns
CN117149388A (en) Batch task scheduling method and system, electronic equipment and storage medium
CN112685158A (en) Task scheduling method and device, electronic equipment and storage medium
EP2256630B1 (en) Method and system to perform time consuming follow-up process
CN111625352A (en) Scheduling method, device and storage medium
US20230004440A1 (en) Allocating of computing resources for applications
CN116680051B (en) Task scheduling method, device, equipment and storage medium
US11625400B2 (en) Optimal query scheduling for resource utilization optimization
CN114706671B (en) Multiprocessor scheduling optimization method and system
CN115391039A (en) Distributed task scheduling system and method based on earliest completion time prediction
CN117193973A (en) Data management method, device, equipment and storage medium
CN115658265A (en) Simulation task scheduling method and device, storage medium and computer equipment
CN112068949A (en) Parallel processing apparatus, storage medium, and job management method
JP2023086225A (en) Information processing control device, information processing control method, and information processing control program
CN117762580A (en) Script scheduling method, script scheduling device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant