CN109684053B

CN109684053B - Task scheduling method and system for big data

Info

Publication number: CN109684053B
Application number: CN201811308063.8A
Authority: CN
Inventors: 方秋水; 刘强; 何建兵; 陈卫国; 吴金成; 罗鸣鸣; 冷梦甜
Original assignee: Guangdong Lingnanpass Co ltd
Current assignee: Guangdong Lingnanpass Co ltd
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2023-08-01
Anticipated expiration: 2038-11-05
Also published as: CN109684053A

Abstract

The invention discloses a task scheduling method of big data, which comprises the following steps: creating a task template according to the task type; selecting a task template and a task scheduling service number to create a task, forming a configuration file, wherein the created task comprises a task name, task content and a task execution period, the task content is configured in a kv value mode, and a dependency relationship between tasks is established in a kv file format; and reading the task, generating a task instance, and acquiring task execution process information. The invention also discloses a task scheduling system of big data. According to the invention, by combining the unique characteristics of big data scene scheduling, distribution, execution and different data types, the characteristics of type separation, multi-triggering, strategy scheduling, blood-margin dependence and the like are highlighted, the construction of internal ecology is carried out on big data, and big data task scheduling management is carried out according to different strategies by big data task scheduling according to different task types.

Description

Task scheduling method and system for big data

Technical Field

The invention relates to the technical field of big data task scheduling management, in particular to a big data task scheduling method and a big data task scheduling system.

Background

In the business application of big data, when the business index is iterated and becomes more complex, the related application for managing the big data becomes a headache, for example: the problems of job dependent scheduling, task running condition monitoring, abnormal problem detection and the like can complicate our daily work.

In big data analysis systems, some scripts or execution units need to be started at a specific time, and some scripts or execution units need to be started even after certain conditions are met, in this case, the implementation is difficult only by manpower, and some systems also provide configuration of some timing tasks, but the configuration is troublesome to manage, and some of the configuration also needs to invade into a system of an execution machine, so that great hidden danger is brought.

Disclosure of Invention

In order to overcome the defects of the prior art, one of the purposes of the invention is to provide a big data task scheduling method, which combines the unique characteristics of big data scene scheduling, distribution, execution and different data types, highlights the characteristics of type, multi-trigger, strategy scheduling, blood-margin dependence and the like, carries out the construction of internal ecology of big data, and carries out big data task scheduling management according to different strategies by big data task scheduling according to different task types.

The second purpose of the invention is to provide a big data task scheduling system, which is used for better solving the problem of manual configuration management, combining the unique characteristics of big data scene scheduling, distribution, execution and different data types, highlighting the characteristics of type, multi-trigger, strategy scheduling, blood-margin dependence and the like, carrying out the construction of internal ecology of big data, carrying out big data task scheduling management according to different strategies by big data task scheduling, and according to different task types.

One of the purposes of the invention is realized by adopting the following technical scheme:

a task scheduling method of big data comprises the following steps:

creating a task template according to the task type;

selecting a task template and a task scheduling service number to create a task, forming a configuration file, wherein the created task comprises a task name, task content and a task execution period, the task content is configured in a kv value mode, and a dependency relationship between tasks is established in a kv file format;

and reading the task, generating a task instance, and acquiring task execution process information.

Further, the creating a task template according to the task type includes:

setting a template name, or/and automatically generating a template ID;

generating template data items, and inputting corresponding data item values and attributes for each template data item according to the task type.

Further, the selecting a task template and creating a task by a task scheduling service number includes:

creating a task name, and selecting a scheduling type and a task execution period;

creating a task, selecting a task template according to the task type, and configuring task content in a kv value mode; and establishing a dependency relationship between tasks through a kv file format.

Further, the reading the task, generating a task instance, and obtaining task execution process information includes:

generating a task execution list according to the task instance;

and monitoring the task execution list, and executing the task when the task meets the trigger condition.

Further, the generating a task execution list according to the task instance includes:

reading the configuration file, and obtaining the interval time of task inspection and the task generation time range;

finding out that all task states of the interval between the next execution time and the current time within a task generation time range are tasks to be checked and task execution time, wherein the tasks to be checked are tasks for which a task execution list is not generated;

generating a task execution list according to the task name, the task content and the task execution period of the task to be checked, wherein the task execution list comprises the task name, the task execution time and the task priority;

updating the state of the task to be checked, and updating the state of the task to be checked into the generated task operation;

the monitoring of the task execution list, when the task meets the triggering condition, executing the task, includes:

detecting a task to be checked according to the interval time of task checking to generate a task execution list;

and circularly checking task execution time in the task execution list at preset intervals, and if the current time meets the task execution time, performing:

creating a task execution sub-thread, calling different classes according to task types of task contents, and reading the task contents; decomposing parameters of task content to generate a task instance, and executing a target task according to the task instance; the target task is a task of which the current time meets the task execution time.

The second purpose of the invention is realized by adopting the following technical scheme:

a big data task scheduling system, comprising:

the task template creation module is used for creating a task template according to the task type;

the task scheduling management module is used for selecting a task template and a task scheduling service number to create a task to form a configuration file, wherein the created task comprises a task name, task content and a task execution period, the task content is configured in a kv value mode, and a dependency relationship between tasks is established in a kv file format;

and the task execution module is used for reading the task, generating a task instance and acquiring task execution process information.

Further, the task template creation module includes:

a setting unit for setting a template name, or/and automatically generating a template ID;

the first generation unit is used for generating template data items, and inputting corresponding data item values and attributes for each template data item according to the task type.

Further, the task scheduling management module includes:

the first creating unit is used for creating a task name, selecting a scheduling type and a task execution period;

the second creating unit is used for creating a task, selecting a task template according to the task type and configuring task content in a kv value mode; and establishing a dependency relationship between tasks through a kv file format.

Further, the task execution module includes:

the second generating unit is used for generating a task execution list according to the task instance;

and the triggering unit is used for monitoring the task execution list, and executing the task when the task meets the triggering condition.

Further, the second generating unit includes:

the reading subunit is used for reading the configuration file and acquiring the interval time of task inspection and the task generation time range;

the detection subunit is used for finding out that all task states of the interval between the next execution time and the current time in the task generation time range are tasks to be checked and task execution time, wherein the tasks to be checked are tasks for which a task execution list is not generated;

the first generation subunit is used for generating a task execution list according to the task name, the task content and the task execution period of the task to be checked, wherein the task execution list comprises the task name, the task execution time and the task priority;

the updating subunit is used for updating the state of the task to be checked, and updating the state of the task to be checked into the generated task operation;

the trigger unit includes:

the second generation subunit is used for detecting the task to be checked according to the interval time of task checking so as to generate a task execution list;

a judging subunit, configured to cyclically check the task execution time in the task execution list at preset intervals, and if the current time meets the task execution time, then:

the execution sub-unit is used for creating a task execution sub-thread, calling different classes according to the task type of the task content and reading the task content; decomposing parameters of task content to generate a task instance, and executing a target task according to the task instance; the target task is a task of which the current time meets the task execution time.

Compared with the prior art, the invention has the beneficial effects that:

1. timed mission planning trigger: flexible trigger time points (day/weekly/hour, etc.) are set according to different task types, the calculation tasks are decomposed according to time periods, the tasks are executed in parallel as much as possible, execution time is shortened, and the overall time window for executing the tasks is increased.

2. Flexible dependencies between tasks: any task can be used as a parent task of the task to perform dependency triggering; the task execution can be mutually dependent, the front-end task fails, and the follow-up dependent task is not executed.

3. Flexible and various alarm rules: the task failure can be timely and effectively alarmed, and the maintenance of operation and maintenance personnel is facilitated. Besides failure alarms, alarm rules such as incomplete task timeout, non-start task timeout and the like are supported.

4. Providing a perfect and easy-to-use Web user interface: the method is used for configuring, submitting, inquiring and monitoring the task and the dependency relationship of the task by the user.

5. The system has a complete log record: and collecting and recording standard output and standard errors generated in the task running process, providing Http access, and enabling a user to conveniently access the task running log by accessing the log Url corresponding to the task.

Drawings

FIG. 1 is a flow chart of a task scheduling method of big data according to the present invention;

FIG. 2 is a flow chart of creating a task template in accordance with the present invention;

FIG. 3 is a flow chart of the present invention for scheduling task management;

FIG. 4 is a flow chart of task execution of the present invention;

FIG. 5 is a block diagram of a big data task scheduling system according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and detailed description, wherein it is to be understood that, on the premise of no conflict, the following embodiments or technical features may be arbitrarily combined to form new embodiments.

According to the task scheduling method of the big data, a task scheduling platform (realized by software or/and hardware) of the big data is constructed, and rapid, efficient and flexible scheduling tasks are provided according to different task scheduling characteristics and different task type characteristics at different stages of data acquisition, data cleaning and data analysis. Referring to fig. 1, the method includes the following steps:

110. and creating a task template according to the task type.

In order to better execute the scheduling task, different task templates are customized according to the task types and then stored in a database, and the task templates can be called out for express creation of the scheduling task when the scheduling task is created.

The task template is mainly created by configuring different parameters according to different task types and different scheduling tasks, so that the problem of manually repeatedly configuring the scheduling tasks is solved. The task template comprises definition of data processing task rules such as data acquisition, data cleaning, data analysis and the like, parameters of the tasks are created for the data acquisition, the data cleaning, the data analysis and the like, and templates are provided for the creation of the tasks. The process of creating task modules is shown in fig. 2, different task templates are customized according to task types, and then the task templates are stored in a database, and can be called out for selection when scheduling tasks are created. Specifically, setting a template name, or/and automatically generating a template ID; generating template data items, inputting corresponding data item values and attributes of each template data item according to task types, forming template contents by the template data items, the data item values and the attributes of the template data items, and inputting template states, namely, the created task template mainly comprises three data input values of a template name, template contents and a template state and an automatically generated template number (template ID). Each task template is an operator, the operator is the smallest unit in the platform, and each operator carries the execution of a business logic, that is, an actual operation is mapped, for example, the execution of a script is called, an interface is called, and the like. In a task scheduling platform of big data, operators of various types exist, and the functions and the applicability of the operators are different.

120. The method comprises the steps of selecting a task template and a task scheduling service number to create a task, forming a configuration file, wherein the created task comprises a task name, task content and a task execution period, the task content is configured in a kv value mode, and a dependency relationship between tasks is established in a kv file format.

Different scheduling tasks are created according to the task templates, and the scheduling tasks comprise scheduling names, scheduling periods, scheduling types and scheduling tasks. The platform can set parameters of various tasks according to different task types, realize different task scheduling management, limit execution servers of the tasks in the task configuration process, classify according to different task objects and execute according to the servers. The task scheduling management mainly sets different scheduling tasks according to different stages of data acquisition, data cleaning, data analysis and the like and different templates. Each scheduling task contains one or more operators, which we can choose from among the already existing operators. These operators may be used alone or in combination to achieve a serial effect to accomplish a scheduling task.

Referring to fig. 3, the implementation process is as follows: different scheduling tasks are created according to the task templates, a task scheduling information table which is created and initialized previously is obtained (before step 110, the scheduling service number and the service name are initialized, and the data are used for identifying the service to which the task belongs). Selecting a corresponding task template and task scheduling service number, filling in task content, a task scheduling plan and other information to create a task plan.

The created task (i.e. the scheduling task) comprises information such as a task name, task content, task execution period and the like, the data are stored in a database, and a task scheduling program in a data acquisition, data cleaning and data analysis module reads the data and executes the task according to the set content. From the aspect of the function, the system can support task scheduling management in three stages of data acquisition, data cleaning and data analysis, has certain expansibility, and supports flexible configuration of other scheduling tasks according to service requirements. The created tasks are processed using spark-based python scripts, one or more python files for each task. The platform defines a kv file format to establish the dependency relationship between tasks, and any task can be used as a father task of the platform to perform dependency triggering; the task execution can be mutually dependent, the front-end task fails, and the follow-up dependent task is not executed. The task content can be configured with different types of parameters in a kv value mode, and the configuration is flexible. The task execution period is set so that the task execution process adopts a scheduling execution method of a time period, single-time and periodic tasks can be flexibly configured at one time, and a task scheduling program can execute the tasks periodically or periodically according to the configured parameters, so that the purpose of automatic task execution is achieved.

130. And reading the task, generating a task instance, and acquiring task execution process information.

The task scheduling and task execution script separation mode is adopted to achieve the aim of low coupling, and the task scheduling is not affected by modifying the specific content of the task.

Referring to fig. 4, task execution mainly includes three parts of task generation and monitoring, log processing, and history data processing:

wherein the task generation and monitoring further comprises:

a1, generating an execution task list according to different task types.

And reading a system configuration file, and acquiring the interval time of task inspection and the task generation time range. And finding out that all task states of the interval between the next execution time and the current time in the task generation time range are task to be checked and job execution time according to the task generation time range. And generating a task execution schedule, wherein the task execution schedule comprises a task number, task execution time, task priority and the like. And the task state of the task scheduling program for updating the task basic information table is acquired and is that the task job is generated.

A2, monitoring the generated planning task list. The task monitoring has two functions, the first function is used for executing a task execution list generation function according to the interval time of task inspection; the second function checks, every 1 second (which may be set), whether the task execution time in the task execution schedule has been validated, and if the task time has been validated, executes "task execution".

A3, executing the scheduling task of the planning task list. When the scheduling task meets the triggering condition, the task execution module creates a task execution sub-thread, reads the task content to execute the task, and updates the task information after the task execution is completed. The task execution module firstly reads task content in the task, calls different classes according to task types in the task content, reads the task content, decomposes parameters, generates a task instance and executes the task. After the task is executed, the latest running time, the next running time and the task state in the task basic information table are updated to be a non-generated task list. And calling a historical data management module, and writing the task and the job execution condition into a task execution record table and a job execution record table.

The history data processing mainly comprises writing a task execution record into a task execution history data table, and managing the processing of history files of each stage according to configuration files.

The log processing is mainly writing various logs in the running of the system, and using log classes in common codes.

The task scheduling management for data acquisition mainly comprises structured data acquisition and network data acquisition.

Structured data collection is mainly to collect the streaming data in a database. And performing task scheduling and executing tasks through a task scheduling program to complete data acquisition. And the sqoop task is configured on the basis of a scheduler, so that the structured data acquisition and scheduling task can be realized.

When a task is created, different types of task templates are created according to service requirements, and a data grabbing task is realized. The structured data collection is based on a scheduler to configure the sqoop task.

The scheduling task of the web crawler is operated by calling the python spiders script through java, so that the task template of the web crawler comprises the path of the python script. All task templates are defined as:

{FilePath:defaultvalue}

task scheduling management for data cleansing and data analysis: the spark-based python script is used to process one or more python files for each task. The task template adopts a mode similar to a JSON file format to set parameters as follows:

{pyfilepath:pathvalue}。

the task scheduling platform of big data is used for generating tasks by predefining various task templates and carrying out configuration parameters according to the calling templates, then a scheduler obtains task information through a time period to generate a task list, and the tasks are automatically executed according to the task execution period. According to the method, the scheduling tasks are established according to flexible template configuration, the dependency relationship among the tasks is managed, the whole life cycle of big data acquisition, cleaning and analysis is supported, the trouble of manually repeatedly configuring the scheduling tasks is eliminated, flexible, efficient and stable scheduling task management is provided, and support is provided for the improvement of the performance of the whole big data system.

The task scheduling of big data plays a general role in the process of carrying out the ETL of the data, the production, delivery and consumption of the whole data can penetrate through the task scheduling of the big data, the task scheduling management of the big data needs to be unfolded from the task scheduling characteristics, the requirements of a framework and a business scene for using the big data are met, and a high-availability, high-efficiency and flexible big data scheduling platform is constructed.

The big data task scheduling platform provides a batch workflow task scheduler. For running a set of jobs and flows in a particular order within a workflow. The big data task scheduling system defines a KV file format to establish the dependency relationship between tasks and provides an easy-to-use web user interface for maintaining and tracking the configuration, management, monitoring and the like of the scheduling tasks.

The big data task scheduling platform can receive the workflow submitted by the user, communicate with metadata, and save the information such as the configuration, the dependency relationship, the operation history, the resource configuration, the alarm configuration and the like of the scheduling task. And the system is responsible for unified configuration maintenance, triggering, scheduling and monitoring of tasks, executing work tasks submitted by users, realizing workflow monitoring and storing information, states, logs and the like of all workflows.

Example two

A big data task scheduling system is a virtual structure of a big data task scheduling method according to the first embodiment, please refer to fig. 5, which includes:

a task template creation module 510, configured to create a task template according to a task type;

the task scheduling management module 520 is configured to select a task template and a task scheduling service number to create a task, form a configuration file, wherein the created task comprises a task name, task content and a task execution period, the task content is configured in a kv value mode, and a dependency relationship between tasks is established in a kv file format;

the task execution module 530 is configured to read the task, generate a task instance, and obtain task execution process information.

Wherein the task template creation module 510 includes:

The task schedule management module 520 includes:

The task execution module 530 includes:

Further, the second generating unit includes:

the trigger unit includes:

The above embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, but any insubstantial changes and substitutions made by those skilled in the art on the basis of the present invention are intended to be within the scope of the present invention as claimed.

Claims

1. The task scheduling method for big data is characterized by comprising the following steps:

creating a task template according to the task type;

reading the task, generating a task instance, and acquiring task execution process information;

the reading the task, generating a task instance, and obtaining task execution process information, including:

generating a task execution list according to the task instance;

monitoring the task execution list, and executing the task when the task meets the trigger condition;

the generating a task execution list according to the task instance comprises the following steps:

creating a task execution sub-thread, calling different task type modules according to task types of task contents, and reading the task contents; decomposing parameters of task content to generate a task instance, and executing a target task according to the task instance; the target task is a task of which the current time meets the task execution time.

2. The big data task scheduling method of claim 1, wherein the creating a task template according to a task type includes:

setting a template name, or/and automatically generating a template ID;

3. The big data task scheduling method of claim 1, wherein the selecting a task template and a task scheduling service number creates a task, comprising:

4. A big data task scheduling system, characterized by comprising:

the task execution module is used for reading the task, generating a task instance and acquiring task execution process information;

the task execution module includes:

the trigger unit is used for monitoring the task execution list, and executing the task when the task meets the trigger condition;

the second generation unit includes:

the trigger unit includes:

the execution sub-unit is used for creating a task execution sub-thread, calling different task type modules according to the task types of the task content and reading the task content; decomposing parameters of task content to generate a task instance, and executing a target task according to the task instance; the target task is a task of which the current time meets the task execution time.

5. The big data task scheduling system of claim 4, wherein the task template creation module includes:

6. The big data task scheduling system of claim 4, wherein the task scheduling management module includes: