CN116069462A - Big data DAG task flow scheduling method, system and storage medium - Google Patents

Big data DAG task flow scheduling method, system and storage medium Download PDF

Info

Publication number
CN116069462A
CN116069462A CN202211562460.4A CN202211562460A CN116069462A CN 116069462 A CN116069462 A CN 116069462A CN 202211562460 A CN202211562460 A CN 202211562460A CN 116069462 A CN116069462 A CN 116069462A
Authority
CN
China
Prior art keywords
task
engine
scheduling
dag
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211562460.4A
Other languages
Chinese (zh)
Inventor
许佳裕
吴少华
吴江煌
连慧奇
宋峥晨
庄晓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meiya Yian Information Technology Co ltd
Original Assignee
Xiamen Meiya Yian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meiya Yian Information Technology Co ltd filed Critical Xiamen Meiya Yian Information Technology Co ltd
Priority to CN202211562460.4A priority Critical patent/CN116069462A/en
Publication of CN116069462A publication Critical patent/CN116069462A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a big data DAG task flow scheduling method, which specifically comprises the following steps: performing visual scheduling configuration on the DAG flow based on the nginx; the external system accesses the dispatching platform through a cross-platform interface scheduler-api, and transmits the instruction to a dispatching center scheduler-master for further processing; responding to a scheduling algorithm preset by a scheduling center scheduler-master to finish the DAG flow to split, store tasks, callback results and notify execution operation; the task engine required by the execution center scheduler-worker loading further completes execution of the task and progress tracking record and stores the record. Through the design of each link, the high performance and the high expansion are exerted extremely, along with the development of the Internet, the difficult problems of integrating new and old data, usually oversized data processing and continuously expanding in modern digital construction of enterprises are effectively solved, an omnibearing dispatching platform is constructed through multidimensional construction, the enterprise efficiency is improved, a large amount of manpower is saved, and the investment of enterprises in data maintenance in the data expansion era is reduced.

Description

Big data DAG task flow scheduling method, system and storage medium
Technical Field
The invention belongs to the technical field of big data processing, and particularly relates to a big data DAG task flow scheduling method, a system and a storage medium.
Background
In the present data informatization and explosion age, the data structures of different enterprises are different, and the data types produced in the large-scale business operation stage are mostly traditional structured data. These data are basically data of the traditional support industries of trade, business, logistics, insurance, stock and the like, which belong to very high privacy and security levels. Most of the data types in the Internet age belong to unstructured social network data, electronic commerce transaction data, picture positioning data, unstructured and two-dimensional code pixel data such as business intelligent reports, satellite remote sensing data, monitoring video and the like.
Therefore, the large data task flow scheduling platform is researched, and has important cost and operation significance for the task flows of real-time or offline synchronization, processing, cleaning, treatment, flow, persistence and the like of large data volume built in enterprises.
However, most task scheduling systems in the prior art are invasive scheduling, that is, rely on a framework, and if the framework is removed or replaced by another framework, the framework needs to be modified again, and some scheduling systems are non-invasive, but the mechanism design is insufficient to bear the change speed of data in big data age, and the flexibility, the customizable property and the independence required by the high-speed development of enterprises are not suitable. Existing systems and techniques have failed to address the problem of sudden increases in enterprise data volumes in the current large data context.
In view of this, it is very significant to propose a big data DAG task flow scheduling method, system and storage medium.
Disclosure of Invention
In order to solve the problem that the existing system and technology cannot solve the problem of sudden increase of enterprise data volume under the current big data background, the invention provides a big data DAG task flow scheduling method, a system and a storage medium, so as to solve the technical defect.
In a first aspect, the present invention provides a big data DAG task flow scheduling method, which specifically includes:
performing visual scheduling configuration on the DAG flow based on the nginx;
the external system accesses the dispatching platform through a cross-platform interface scheduler-api, and transmits the instruction to a dispatching center scheduler-master for further processing;
responding to a scheduling algorithm preset by a scheduling center scheduler-master to finish the DAG flow to split, store tasks, callback results and notify execution operation;
the task engine required by the execution center scheduler-worker loading further completes execution of the task and progress tracking record and stores the record.
Preferably, the implementation formula of the scheduling algorithm is as follows:
free=freeThread>1&&cpu<threshold&&mem
<threshold&&cpuLoad<threshold&&systemLoad
<threshold
wherein free represents a final determination of whether to idle the schedulable flag, freeThread represents a current system idle thread, cpu represents cpu utilization, threshold represents a configurable threshold, mem represents system memory occupancy, cpu load represents system cpu load level, and systemLoad represents system overall load level.
Further preferably, the scheduling algorithm specifically includes:
when freeThread is greater than 1 and cpu is less than threshold and mem is less than threshold and cpu load is less than threshold and systemLoad is less than threshold, the free is an idle state of true and is in a state of acceptable task scheduling;
the cpu load acquisition algorithm is:
update*oldLoad+(1-update)*newLoad
update=(2 idx -1),idx=0,1,2,3,4
where oldLoad represents the old load and newLoad represents the load;
the system load acquisition algorithm is as follows:
Figure SMS_1
wherein old is old load, EXP 1 Represents a 1/exp (5 sec/1 min) FIXED point, FIXED 1 Representation 1<<11 fixed points new is the newly calculated load.
Preferably, each task engine is provided with standard information of an engine name, an input standard and an output standard, and each task engine carries out information association according to the standard information.
Further preferably, each task engine has a hot loading function, and specifically includes:
when a user uploads the task engine to a worker, the worker performs consistency check on the task engine uploaded by the user;
if the uploaded task engine is unchanged compared with the old task engine oldEngine, the task engine is not processed; if what changes, the worker marks the old engine oldEngine as a deleted state, and points the pointer to the new engine newEngine, so that the next task is ensured to use the new engine;
The executing old engine will change from marked deletion to physical deletion after all of its tasks are completed.
Preferably, the dispatch platform first completes loading the pre-designed communication protocol.
In a second aspect, an embodiment of the present invention further proposes a big data DAG task flow scheduling system, where the system specifically includes:
the Nginx visual scheduling configuration module is used for carrying out visual scheduling configuration on the DAG flow based on the Nginx;
the scheduler-api scheduling unified entry module is used for accessing an external system into a scheduling platform and transmitting an instruction into a scheduling center scheduler-master for further processing;
the scheduler-master scheduling center module is used for responding to a preset scheduling algorithm to finish the DAG flow to split, store tasks, call back results and inform execution operation;
and the scheduler-worker execution center module is used for loading the required task engine to further complete execution of the task and the progress tracking record and store the task.
Further preferably, the method further comprises:
the communication protocol module is used for designing and loading a communication protocol required by the dispatching platform;
the scheduling storage module is used for storing data of the whole scheduling platform executing process, including service data, cache data and engine resource medium data;
And the engine management module is used for completing communication association among task engines according to engine standards, and executing engine detection, engine replacement and engine redirection tasks.
In a third aspect, an embodiment of the present invention provides an electronic device, including: one or more processors; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
Compared with the prior art, the invention has the beneficial effects that:
(1) According to the technical scheme, a set of large data DAG task flow dispatching platform is designed and realized, high performance and high expansion are exerted extremely through the design of each link, the problems of integration and continuous expansion of new and old data, usually oversized data processing, in modern digital construction of enterprises are effectively solved along with the development of the Internet, an omnibearing dispatching platform is constructed through multidimensional construction, the enterprise efficiency is improved, a large amount of manpower is saved, and the investment of enterprises in maintaining data in the data expansion era is reduced.
(2) The big data DAG task flow scheduling platform provided by the invention is sufficient for the task flows of real-time/off-line synchronization, processing and the like of the built-in big data volume of enterprises by virtue of the omnibearing, high-expansion and high-performance architecture design, and the overall data processing and task scheduling capacity is greatly improved.
(3) The invention provides a big data DAG task flow dispatching platform scheme, which is mainly divided into functions of user side process task configuration, big data side DAG process task dispatching, task progress, task log tracing and the like, wherein a B/S front-rear end separation, decentralization framework mode is adopted, task dispatching of ultra-big data is supported by a cluster load balancing mode, task engine dynamic expansion is supported, very complex task dependency relation in a big data processing process is solved, high-performance dispatching and stability in off-line and real-time ultra-large task processes are solved, and in addition, an alarm and message notification mechanism is realized, so that processing loss can be timely informed when a task is abnormal.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Many of the intended advantages of other embodiments and embodiments will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.
FIG. 1 is an exemplary device frame pattern to which an embodiment of the present invention may be applied;
FIG. 2 is a flow chart of a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 3 is a diagram of a big data DAG task flow scheduling platform frame in the big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 4 is a communication protocol byte segmentation diagram in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an engine communication frame in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 6 is a flow chart of an engine hot load mechanism in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 7 is a flow chart of a resource medium in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIGS. 8-1 and 8-2 are a DAG task flow chart and a DAG task operation flow chart, respectively, in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 9 is a flow chart of a callback plugin in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 10 is a flow chart of a callback procedure in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 11 is a signal triggering process flow in a big data DAG task flow scheduling method according to an embodiment of the present invention;
Fig. 12 is a schematic diagram of a DAG task flow drag configuration in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 13-1 is a python task configuration diagram in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 13-2 is a schematic diagram illustrating task flow execution in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIGS. 14-1 and 14-2 are schematic diagrams illustrating task engine execution in a big data DAG task flow scheduling method according to an embodiment of the present invention;
FIG. 15 is a schematic diagram of a big data DAG task flow scheduling system according to an embodiment of the present invention;
fig. 16 is a schematic structural view of a computer device suitable for use in an electronic apparatus for implementing an embodiment of the present invention.
Detailed Description
In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. For this, directional terms, such as "top", "bottom", "left", "right", "upper", "lower", and the like, are used with reference to the orientation of the described figures. Because components of embodiments can be positioned in a number of different orientations, the directional terminology is used for purposes of illustration and is in no way limiting. It is to be understood that other embodiments may be utilized or logical changes may be made without departing from the scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Fig. 1 illustrates an exemplary system architecture 100 for a method of processing information or an apparatus for processing information to which embodiments of the present invention may be applied.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be various electronic devices with communication capabilities including, but not limited to, smartphones, tablet computers, laptop and desktop computers, and the like.
The server 105 may be a server that provides various services, such as a background information processing server that processes verification request information transmitted by the terminal devices 101, 102, 103. The background information processing server may analyze the received verification request information and obtain a processing result (for example, verification success information for characterizing that the verification request is a legal request).
It should be noted that, the method for processing information provided by the embodiment of the present invention is generally performed by the server 105, and accordingly, the device for processing information is generally disposed in the server 105. In addition, the method for transmitting information provided by the embodiment of the present invention is generally performed by the terminal devices 101, 102, 103, and accordingly, the means for transmitting information is generally provided in the terminal devices 101, 102, 103.
The server may be hardware or software. When the server is hardware, the server may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server. When the server is software, it may be implemented as a plurality of software or software modules (for example, to provide a distributed service), or may be implemented as a single software or a plurality of software modules, which are not specifically limited herein.
In the present data informatization and explosion age, the data structures of different enterprises are different, and the data types produced in the large-scale business operation stage are mostly traditional structured data. These data are basically data of the traditional support industries of trade, business, logistics, insurance, stock and the like, which belong to very high privacy and security levels. Most of the data types in the Internet age belong to unstructured social network data, electronic commerce transaction data, picture positioning data, unstructured and two-dimensional code pixel data such as business intelligent reports, satellite remote sensing data, monitoring video and the like.
Therefore, the large data task flow dispatching platform is researched, and has important cost and operation significance for the task flows of real-time/offline synchronization, processing, cleaning, treatment, flow, persistence and the like of large data volume built in enterprises.
Fig. 2 shows that the embodiment of the present invention discloses a big data DAG task flow scheduling method, as shown in fig. 2, the method specifically includes:
s101, carrying out visual scheduling configuration on a DAG flow based on the nginx;
s102, an external system accesses a scheduling platform through a cross-platform interface scheduler-api, and transmits an instruction to a scheduling center scheduler-master for further processing;
S103, responding to a scheduling algorithm preset by a scheduling center scheduler-master to complete the DAG flow to split, store tasks, callback results and notify execution operation;
s104, the task engine required by the execution center scheduler-worker loading further completes execution of the task and progress tracking record and stores the task.
Further, the following details the big data DAG task flow scheduling platform scheme proposed by the present invention, i.e. the big data DAG task flow scheduling platform technical research and application. The overall architecture design is shown in fig. 3.
The whole architecture comprises eight parts of nginx, scheduler-api, scheduler-master, scheduler-worker, task engine, relational library, cache library and resource library.
Wherein, each part acts as follows: the method mainly comprises the steps of enabling a user of a nginx to schedule web pages, providing user page operations such as draggable flow and task execution tracking, enabling a scheduler-api to mainly provide a cross-platform interface for an external system to access a scheduling platform, enabling an instruction to be transmitted into the scheduler-master for processing, enabling the scheduler-master to provide a scheduling algorithm, enabling a DAG flow to be split, enabling tasks to be stored, enabling a result to callback, informing to execute and the like, enabling the scheduler-worker to be responsible for operations such as task execution, progress tracking, stopping, log recording, output storage, engine hot updating, engine management and the like, enabling an execution process of a task engine to be completely recorded, enabling the user page to display results, enabling the task engines to be composed of task engines of different types, enabling the task engines to be in communication association through engine standards, enabling a signal mechanism to conduct signal transmission, enabling a relational library, a cache library and a resource library to store data used for storing the execution process of the whole scheduling platform, wherein the operations comprise business data, cache data and engine resource medium data.
The framework fully considers big data scenes in design, utilizes a decentralised framework to construct a whole dispatching cluster, and constructs a whole task flow system based on a DAG directed acyclic graph, wherein the core is that the real-time/offline big data processing flow of an enterprise is simplified, and all component modules cooperate with each other to serve the service. For the key technical points used therein, a discussion will be made below.
Firstly, a communication protocol is required to be designed, the traditional http protocol does not meet the requirements, and the high-efficiency and stable communication protocol is required in the invention, so that various problems of packet loss, packet sticking, disconnection reconnection, message retransmission and the like in communication can be solved. Next, a specific communication protocol design is performed, wherein the protocol header is 15 bytes, and the bytes are respectively:
proto flag: the first 4 bytes are the protocol identification, which should be unique.
Real body size: the next 4 bytes represent the real data size, such as: size before data compression, size before encryption, etc
Body size: the next 4 bytes represent this data size
Encrypt flag: next 1 byte represents the encrypted identification, the encryption/compression algorithm used for the identification data
Retain flag: next 1 byte is reserved field, and the reserved post-expansion protocol needs
Version: the next 1 byte identifies the protocol version number
Body: all the following bytes are communication data, note: possibly containing the data of the next packet
The protocol ensures the stability and expandability of transmission, the Proto Flag ensures that the protocol is not tampered, real Body size and Body size ensure the processing of unpacking, sticking, compressing, encrypting and the like, and the encrypteflag can support a custom protocol encryption algorithm and the like. The overall communication protocol is shown in fig. 4.
Further, the task engine is designed. Firstly, the necessity and importance of a task engine in the whole framework are required to be described, the task engine is a unit, and can be composed of different modules of different programs and even different development languages.
As shown in fig. 5, each task engine has an external handshake mechanism (input and output), to achieve this feature, task engine criteria must be defined, each task engine must be provided with: the method includes the steps that a handshake mechanism can be established only by standard information such as engine names, input standards and output standards, the engines have the capabilities of information exchange and information analysis, the definition of the standard is explained by taking an sql engine written in java language as an example, firstly, the sql engine is explained, the capability of receiving data sources and executing actions is required, and the standard definition is as follows:
Engine name: engine-sql
Engine description: linking according to data source and executing corresponding sql statement
Inputting a standard:
sql statement to be executed
Data source configuration information including database, database driver, database link, user, password, etc
Execution mode: inquiry, non-inquiry
Database types including MYSQL, ORACLE, etc
Output standard: non-query statement has no output, and query statement outputs data in standard form format
Taking the function of importing excel sheets into database as an example, we need to use python engine and sql engine to explain the engine communication process. Firstly, the python engine receives an excel file, the file content is a table, the python engine performs table analysis, converts the file into an sql engine standard input format, transmits the standard input format to the sql engine, and the sql engine performs processing and warehousing.
Further, a task engine hot loading mechanism is introduced in the embodiment. After the normal development task engine is put into use, update iteration of the task engine may be involved, so that in order to ensure that online service is not affected and the upgrading speed is increased, a hot loading mechanism becomes important, modularization of the task engine is very necessary, responsibility of each task engine is clear, and design of the hot loading mechanism of the task engine is discussed below.
As shown in FIG. 6, the user can upload a task engine (new engine or iterative engine), when uploading to the worker, the worker can check the consistency of the task engine uploaded by the user (md 5/hash), if the uploaded engine is found unchanged from the previous engine, the worker does not process, if the uploaded engine is changed, the worker can mark the old engine (oldEngine) as a deleted state, and point the pointer to the new engine (newEngine), so that the next task uses the new engine, and the old engine being executed can be deleted from the mark to be physically deleted after all tasks of the old engine are completed.
Furthermore, the task engine has a handshake mechanism, which is not enough, so that the independence and the expandability of the task engine are ensured, for example, the following scenes: when a certain engine needs to analyze the file content as input, the engine is relatively zigzag and inefficient if the file is to be acquired, because the engine does not know where the file is from, and when the 'man' helps the task engine to prepare 'materials' required by the engine, the invention provides a concept of resource medium for the situation.
When a task engine needs resources, if the task engine needs a plurality of resources from different places, the task engine does not have expansibility, is bound with the resources, and later needs the task engine to turn over all the time if the resources are changed, the resource medium mechanism is needed, different resources are integrated, the task engine only needs to care where the resources are, the where to read in a file form, and all the task engines can not be affected if a resource medium is added later.
The resources can be of various types: the media carrier concept is needed by how to integrate the databases, files, S3, hdfs and the like, different resources can be abstracted into different media carriers, and the input and the output of each media carrier are managed by standards which help a scheduling platform to prepare materials needed by a task engine in advance for the task engine.
As shown in FIG. 7, the user uploads the resources through the medium inlet, the user does not need to care what type of medium the currently uploaded resources are, the resources are uniformly uploaded to the scheduling platform, the resource medium center of the scheduling platform performs management classification and other operations, when the task engine needs the resources, the resources are exported to each task engine through the resource medium, the task engine is automatically helped to prepare the resources, and similarly, the task engine does not care about the current resource medium and only needs to directly use the resources.
Specifically, the reason why the scheduling system needs to use a DAG (directed acyclic graph) structure is that tasks are isolated in general, and there is no correlation between tasks, so that the use scenario of the task scheduling system is limited, and the following cannot be realized: task sequence, task relevance, task flow control and the like are problems, and how to establish the relation among tasks is a difficult problem.
As shown in fig. 8-1, there are 9 tasks in total, each task has relevance and order, and we next explain how to guarantee the accuracy of the execution of the tasks in the whole process.
As can be seen from fig. 8-1, we need one node to represent the task itself, where each node has multiple edges, and each edge has two nodes (afterNode) in front and behind.
First we need to take the non-dependent task (the front node of the task node edge is empty) first, and such task is scheduled preferentially, namely the graph is: A. g, I, when the task A, G, I is completed, the node of the edge is obtained, the node of the edge of G, I is obtained, at this time, the node B is obtained, and since the task B is not completed, the node D waits for the node B, the node B of the edge of a is obtained, at this time, the node B is obtained, the node B is completed, the task B is started to be executed, the node B is obtained after the execution of the node B is completed, at this time, the node B of C, D, E is obtained, the node B is completed, the node C, D is started to be executed, the node B is still provided with D, and since the node D is not completed, the node E waits for the node D, the node C, D is obtained after the execution of the node B is completed, at this time, the node B is completed, the node E starts to be executed, the node E completes the node B obtains the node B after the execution of the node B is completed, the node B is obtained, the node E69 of the node B is completed, and the node E is completed H, F. The overall execution sequence is shown in fig. 8-2.
Further, in the DAG structure, how to split tasks, and how to do so without causing task stacking if we have a whole cluster, and uniformly distribute tasks, for this case, we need to implement a scheduling platform scheduling algorithm.
In this embodiment, the implemented scheduling algorithm is based on a variant of the dynamic load balancing algorithm, making decisions based on several metrics: the method comprises the steps of judging whether scheduling can be performed currently or not according to a memory, cpu, task number, thread number, system load and cpu load, and finally realizing the following formula:
free=freeThread>1&&cpu<threshold&&mem
<threshold&&cpuLoad<threshold&&systemLoad
<threshold
ree is a final determination of whether to idle the schedulable flag, freeThread is the current idle thread of the system, cpu is the cpu utilization, threshold is a configurable threshold, mem is the system memory occupation, cpu load is the system cpu load level, and systemLoad is the system overall load level.
When freeThread is greater than 1 and cpu is less than threshold and mem is less than threshold and cpu load is less than threshold and systemLoad is less than threshold, the free is true, idle state, and task scheduling is acceptable.
The CPU load acquisition algorithm comprises the following steps:
update*oldLoad+(1-update)*newLoad
update=(2 idx -1),idx=0,1,2,3,4
the average time of the task in runnable on CPU rq is counted, and different k lines are counted according to different periods. Where oldLoad is the old load and newLoad is the load.
The system load acquisition algorithm is (1 minute):
Figure SMS_2
where old is the old load, EXP_1 is a 1/EXP (5 sec/1 min) FIXED point, fixed_1 is a 1< <11 FIXED point, new is the newly calculated load.
For 5 minutes and 15 minutes of calculation, the EXP_1 in the formula is only needed to be replaced by EXP_5/EXP_15.
The formula is used for judging through multiple dimensions, so that each machine in the cluster has the same load, and when one machine in the cluster is damaged, tasks are not piled up, but are queued for waiting, tasks are slowly consumed or the machines are waiting for recovery for scheduling.
Furthermore, the embodiment of the invention also provides a callback mechanism of the dispatching system. In general, after task execution is completed, a task initiator needs to be notified to tell whether a task is successful or failed, and in order to cope with the situation, invasive scheduling well realizes the characteristic, and only a callback function defined by a current space user needs to be called, but the problem of invasive expansibility is solved, and a new callback scheduling mechanism needs to be designed, so that the method is very necessary.
We will solve the following problems: the callback modes of the service systems are inconsistent, callback failure processing, callback performance and the like.
In order to make the callback mechanism not limited by a certain type and stripped from the service system, a unified registration callback interface is designed, the callback mode is registered in a plug-in mode, more callback modes are supported to expand, a user does not need to pay attention to callback realization, and callback logic is registered only when a task is submitted, and the design is shown in fig. 9.
The callback process is shown in fig. 10, a user firstly self-defines callback logic processing, then submits a task, waits for the success or failure of task execution of a dispatching platform, and then performs callback through a callback mechanism, retry is performed after the task execution is finished, the whole callback process is asynchronous operation, service is not blocked, the retry has a frequency limitation, and after the corresponding frequency is reached, the user waits for a period of time to perform retry.
Furthermore, in general, if a task is to be stopped in the task execution process, the invasive scheduling is very easy to implement, only a stopping function defined by a current space user needs to be called, if the task is to be stopped in the non-invasive scheduling, the task engine process cannot be directly killed certainly, and some releasing operations need to be performed when some task engines are stopped, which is reasonable, so that a communication mechanism is required between two processes, for example, a tcp long connection mode is used, a long connection is established between a worker and the task engine, a task stopping message is received, and the task stopping message is excessively complicated, so that the connection number is increased and the instability is caused. Therefore, it is necessary to introduce a scheduling system signaling mechanism.
As described above, for communication problems, a better communication mechanism between the worker and the task engine is designed, the operating system is known to provide the signal concept, and then the signal mechanism is used to stop the task.
As shown in fig. 11, the task engine defines a signal processing function (standard), the processing function processes the resources that the task engine needs to release, and then generally performs an end process exit process, when the user triggers a stop task, the worker sends an interrupt signal, the system space is interrupted by the current process signal, the system kernel functions do_signal (), handle_signal () (linux) are called, the signal processing function (windows is setConsoleCtrl handler) in the task engine space is turned, at this time, the task engine performs a post-processing operation, and after the signal processing function is finished, the signal processing function calls the sigreturn () to perform a post-processing operation in the system space, and then returns to the task engine space to continue executing the logic before the interrupt.
According to the scheme of the big data DAG task flow dispatching platform, coding is realized, the scheme is mainly divided into functions of user side process task configuration, big data side DAG process task dispatching, task progress, task log tracing and the like, a B/S front-rear end separation and decentralization framework mode is adopted, task dispatching of extra-big data is supported in a cluster load balancing mode, task engine dynamic expansion is supported, very complex task dependency relationship in a big data processing process is solved, high-performance dispatching and stability in off-line and real-time extra-big task flows are solved, in addition, an alarm and message notification mechanism is realized, and processing loss can be timely informed when tasks are abnormal.
An example DAG flow configuration, we implemented the python task engine, and constructed a flow from the above-described FIG. 8-1DAG structure with dependencies between tasks, as shown in FIG. 12.
As shown in fig. 13-1, each task is configured to output its own task name, such as "task a" to output "task a" downstream.
The execution of the tasks is shown in fig. 13-2, and the order is seen to be the same as that of fig. 8-2 according to the "start execution time".
Next we intercept the task where two dependencies are more complex: task D and task E see if their dependencies and pre-task output passes are correct, as shown in FIGS. 14-1 and 14-2.
From this, it can be seen that the front-end tasks of task D are G, I, B, and the front-end tasks of task E are C, D, B, and the results are correctly shown in the log of the upper graph.
By means of the example of the big data DAG flow scheduling operation process, it can be seen that the flow configuration adopts a very simplified drag configuration mode, the task execution process can be executed sequentially, the log of the whole execution process is recorded, each task engine also has own input/output standard, and the output of the front-end task can be obtained. The whole platform is very humanized for users, and tasks can be run with expected results only by configuring the interface.
In a comprehensive view, the large data DAG task flow scheduling platform designed by the research of the invention is enough to be capable of self-building large data volume real-time/off-line synchronous, processing and other task flows in enterprises by virtue of the comprehensive, high-expansion and high-performance architecture design, and the overall data processing and task scheduling capacity is greatly improved.
According to the technical scheme, a set of large data DAG task flow dispatching platform is designed and realized, high performance and high expansion are exerted through the design of each link, along with the development of the Internet, the difficult problems of processing and integrating new and old data (usually ultra-large data) and continuously expanding in modern digital construction of enterprises are effectively solved, an omnibearing dispatching platform is constructed through multidimensional construction, the enterprise efficiency is improved, a large amount of manpower is saved, and the investment of enterprises in maintaining data in the data expansion era is reduced.
In a second aspect, the embodiment of the present invention further discloses a big data DAG task flow scheduling system, as shown in fig. 15, where the system specifically includes: the Nginx visual schedule configuration module 151, a scheduler-api schedule unified entry module 152, a scheduler-master scheduling hub module 153, a scheduler-worker execution hub module 154, a communication protocol module 155, a schedule storage module 156, and an engine management module 157.
Specifically, the nginnx visual scheduling configuration module 151 is configured to perform visual scheduling configuration on the DAG flow based on the Nginx; the scheduler-api scheduling unified entry module 152 is configured to access an external system to the scheduling platform, and transmit an instruction to a scheduling center scheduler-master for further processing; the scheduler-master scheduling center module 153 is configured to perform splitting, task storage, result callback and notification execution operations in response to a preset scheduling algorithm to complete a DAG procedure; the scheduler-worker execution center module 154 is used for loading the required task engine to further complete execution of the task and progress tracking record and store the task; a communication protocol module 155, configured to design and load a communication protocol required by the scheduling platform; the scheduling storage module 156 is configured to store data of the whole scheduling platform executing process, including service data, cache data, and engine resource medium data; the engine management module 157 is configured to perform engine detection, engine replacement, and engine redirection tasks according to the communication association between the task engines that is completed by the engine standard.
Referring now to FIG. 16, there is illustrated a schematic diagram of a computer apparatus 1600 suitable for use in an electronic device (e.g., the server or terminal device illustrated in FIG. 1) for implementing an embodiment of the present invention. The electronic device shown in fig. 16 is merely an example, and should not impose any limitation on the functionality and scope of use of embodiments of the present invention.
As shown in fig. 16, the computer device 1600 includes a Central Processing Unit (CPU) 1601 and a Graphics Processor (GPU) 1602, which can perform various appropriate actions and processes according to programs stored in a Read Only Memory (ROM) 1603 or programs loaded from a storage portion 1609 into a Random Access Memory (RAM) 1606. In the RAM 1604, various programs and data required for operation of the device 1600 are also stored. The CPU 1601, GPU1602, ROM 1603, and RAM 1604 are connected to each other by a bus 1605. An input/output (I/O) interface 1606 is also connected to bus 1605.
The following components are connected to I/O interface 1606: an input portion 1607 including a keyboard, a mouse, and the like; an output portion 1608 including a speaker, a Liquid Crystal Display (LCD), and the like; a storage portion 1609 including a hard disk or the like; and a communication section 1610 including a network interface card such as a LAN card, a modem, or the like. The communication section 1610 performs communication processing via a network such as the internet. The drive 1611 may also be connected to the I/O interface 1606 as needed. A removable medium 1612 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1611 as necessary, so that a computer program read therefrom is mounted into the storage section 1609 as necessary.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communications portion 1610, and/or installed from a removable medium 1612. The above-described functions defined in the method of the present invention are performed when the computer program is executed by a Central Processing Unit (CPU) 1601 and a Graphics Processor (GPU) 1602.
It should be noted that the computer readable medium according to the present invention may be a computer readable signal medium or a computer readable medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor apparatus, device, or means, or a combination of any of the foregoing. More specific examples of the computer-readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In the present invention, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based devices which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules involved in the embodiments of the present invention may be implemented in software or in hardware. The described modules may also be provided in a processor.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method steps described in the first aspect of the invention.
The above description is only illustrative of the preferred embodiments of the present invention and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the invention referred to in the present invention is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.

Claims (10)

1. A big data DAG task flow scheduling method is characterized by comprising the following steps:
performing visual scheduling configuration on the DAG flow based on the nginx;
the external system accesses the dispatching platform through a cross-platform interface scheduler-api, and transmits the instruction to a dispatching center scheduler-master for further processing;
responding to a scheduling algorithm preset by a scheduling center scheduler-master to finish the DAG flow to split, store tasks, callback results and notify execution operation;
the task engine required by the execution center scheduler-worker loading further completes execution of the task and progress tracking record and stores the record.
2. The big data DAG task flow scheduling method of claim 1, wherein the implementation formula of the scheduling algorithm is:
free=freeThread>1&&cpu<threshold&&mem
<threshold&&cpuLoad<threshold&&systemLoad
<threshold
wherein free represents a final determination of whether to idle the schedulable flag, freeThread represents a current system idle thread, cpu represents cpu utilization, threshold represents a configurable threshold, mem represents system memory occupancy, cpu load represents system cpu load level, and systemLoad represents system overall load level.
3. The big data DAG task flow scheduling method according to claim 2, wherein the scheduling algorithm specifically comprises:
When freeThread is greater than 1 and cpu is less than threshold and mem is less than threshold and cpu load is less than threshold and systemLoad is less than threshold, the free is an idle state of true and is in a state of acceptable task scheduling;
the cpu load acquisition algorithm is:
update*oldLoad+(1-update)*newLoad
update=(2 idx -1),idx=0,1,2,3,4
where oldLoad represents the old load and newLoad represents the load;
the system load acquisition algorithm is as follows:
Figure FDA0003985157510000021
wherein old is old load, EXP 1 Represents a 1/exp (5 sec/1 min) FIXED point, FIXED 1 Representation 1<<11 fixed points new is the newly calculated load.
4. The big data DAG task flow scheduling method of claim 1, wherein each task engine has a standard information of an engine name, an input standard and an output standard, and each task engine performs information association according to the standard information.
5. The big data DAG task flow scheduling method of claim 4, wherein each task engine has a hot loading function, specifically comprising:
when a user uploads the task engine to a worker, the worker performs consistency check on the task engine uploaded by the user;
if the uploaded task engine is unchanged compared with the old task engine oldEngine, the task engine is not processed; if what changes, the worker marks the old engine oldEngine as a deleted state, and points the pointer to the new engine newEngine, so that the next task is ensured to use the new engine;
The executing old engine will change from marked deletion to physical deletion after all of its tasks are completed.
6. The big data DAG task flow scheduling method of claim 1, further comprising: the dispatch platform firstly completes loading the pre-designed communication protocol.
7. The big data DAG task flow scheduling system is characterized by comprising the following specific steps:
the Nginx visual scheduling configuration module is used for carrying out visual scheduling configuration on the DAG flow based on the Nginx;
the scheduler-api scheduling unified entry module is used for accessing an external system into a scheduling platform and transmitting an instruction into a scheduling center scheduler-master for further processing;
the scheduler-master scheduling center module is used for responding to a preset scheduling algorithm to finish the DAG flow to split, store tasks, call back results and inform execution operation;
and the scheduler-worker execution center module is used for loading the required task engine to further complete execution of the task and the progress tracking record and store the task.
8. The big data DAG task flow scheduling system of claim 7, further comprising:
the communication protocol module is used for designing and loading a communication protocol required by the dispatching platform;
The scheduling storage module is used for storing data of the whole scheduling platform executing process, including service data, cache data and engine resource medium data;
and the engine management module is used for completing communication association among task engines according to engine standards, and executing engine detection, engine replacement and engine redirection tasks.
9. An electronic device, comprising:
one or more processors;
a storage means for storing one or more programs;
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1 to 6.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 6.
CN202211562460.4A 2022-12-07 2022-12-07 Big data DAG task flow scheduling method, system and storage medium Pending CN116069462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211562460.4A CN116069462A (en) 2022-12-07 2022-12-07 Big data DAG task flow scheduling method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211562460.4A CN116069462A (en) 2022-12-07 2022-12-07 Big data DAG task flow scheduling method, system and storage medium

Publications (1)

Publication Number Publication Date
CN116069462A true CN116069462A (en) 2023-05-05

Family

ID=86169098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211562460.4A Pending CN116069462A (en) 2022-12-07 2022-12-07 Big data DAG task flow scheduling method, system and storage medium

Country Status (1)

Country Link
CN (1) CN116069462A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117610320A (en) * 2024-01-23 2024-02-27 中国人民解放军国防科技大学 Directed acyclic graph workflow engine cyclic scheduling method, device and equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117610320A (en) * 2024-01-23 2024-02-27 中国人民解放军国防科技大学 Directed acyclic graph workflow engine cyclic scheduling method, device and equipment
CN117610320B (en) * 2024-01-23 2024-04-02 中国人民解放军国防科技大学 Directed acyclic graph workflow engine cyclic scheduling method, device and equipment

Similar Documents

Publication Publication Date Title
CN109997126B (en) Event driven extraction, transformation, and loading (ETL) processing
US11711420B2 (en) Automated management of resource attributes across network-based services
CN109189835B (en) Method and device for generating data wide table in real time
US9363195B2 (en) Configuring cloud resources
CN103688250B (en) Use dynamic aspect to optimize data to process
US20200410031A1 (en) Systems and methods for cloud computing
CN110262807B (en) Cluster creation progress log acquisition system, method and device
US20190272335A1 (en) Data retention handling for data object stores
US20170193034A1 (en) Object data updating method and apparatus in an object storage system
CN112380227B (en) Data synchronization method, device, equipment and storage medium based on message queue
CN111144926B (en) Service request processing method, device, system, electronic equipment and readable medium
US10182104B1 (en) Automatic propagation of resource attributes in a provider network according to propagation criteria
CN112783874A (en) Data analysis method, device and system
CN116069462A (en) Big data DAG task flow scheduling method, system and storage medium
WO2024021476A1 (en) Data processing method and apparatus, electronic device and storage medium
WO2022156087A1 (en) Data blood relationship establishing method and apparatus, computer device, and storage medium
CN107657155B (en) Method and device for authenticating user operation authority
US9059992B2 (en) Distributed mobile enterprise application platform
US11561995B2 (en) Multitenant database instance view aggregation
CN113448960A (en) Method and device for importing form file
CN112416865A (en) File processing method and device based on big data
CN111984686A (en) Data processing method and device
CN110874349A (en) File sorting method and device
US20240061729A1 (en) Multitenancy cross-tenant collaboration driven by event proxy
US11887725B1 (en) Computer-based systems configured for real-time automated data indexing of resources across disparate electronic platforms and methods of use thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination