CN110019144A

CN110019144A - A kind of method and system of big data platform data O&M

Info

Publication number: CN110019144A
Application number: CN201810630557.1A
Authority: CN
Inventors: 张翔
Original assignee: Hangzhou Shulan Technology Co Ltd
Current assignee: Hangzhou Shulan Technology Co Ltd
Priority date: 2018-06-19
Filing date: 2018-06-19
Publication date: 2019-07-16

Abstract

The invention discloses a kind of method and systems of big data platform data O&M.A kind of system for data O&M includes: to execute agency, the example of the task for executing processing data；Scheduler is used to the example allocation of task executing agency to described；Database, store the example information and task schedule information relevant to the example；And data source, store data to be processed when example operation.

Description

Method and system for operation and maintenance of big data platform data

Technical Field

The invention relates to a computer technology, in particular to a method and a system for large data platform data operation and maintenance, which can perform efficient operation and maintenance.

Background

With the increasing amount of enterprise data and the increasing importance of enterprises to enterprise data, there is a need to continuously perform massive and effective mining on enterprise data. At this time, the demand for processing the service data is increasing, and the demand for developing the corresponding data job is also increasing. In addition, there are complex dependencies between jobs. When a cluster fails, the batch of jobs needs to be repaired. However, since there are many jobs, the work efficiency is low when manual repair is performed for each job. When a job fails or its code has been modified, data repair is required for the job.

In addition, in a large data platform, there are necessarily a large number of data jobs to be performed. When the operation code has bug and causes operation failure, or when the cluster has failure and causes a large number of operation failures or termination, the failed data operation needs to be maintained. There is therefore a need for an efficient and convenient method and system for data job operation and maintenance.

The traditional data operation and maintenance mode is to directly operate in the background of the server or to perform click operation on a simple visual interface for maintenance. If the maintenance is directly carried out in the background of the server, operation and maintenance personnel are required to have the operation authority and experience of the server and special personnel are required to do the operation and maintenance. In addition, such operations are complex, inefficient, and difficult to respond quickly. If the maintenance is carried out through the conventional click operation, the operation flexibility is not high enough, the method is not suitable for various job repairing scenes, and the intelligent repair of the batch jobs cannot be supported well for the batch jobs. One example of a prior art data operation and maintenance method is a bank data processing operation scheduling system and method taught by chinese patent application publication No. CN 106156956A; the technology is only used for batch processing and processing of a large amount of data, and the problem of data operation maintenance under a large data platform cannot be solved.

Disclosure of Invention

One aspect of the invention discloses a system for data operation and maintenance, which can comprise: an execution agent to execute an instance of a task to process data; a scheduler for allocating instances of tasks to the execution agents; a database storing information of the instance and task scheduling information related to the instance; and a data source storing data to be processed by the instance runtime.

The scheduler is capable of performing at least one of the following functions: a complement data function for specifying an arbitrary date segment and generating an instance of a daily task; a put success function for changing the state of the instance of the failed task to success; and a rerun function for performing rerun on an instance of a certain task when the instance fails.

The complementary data function is capable of serially running instances of tasks within a specified date range according to a date precedence order.

The complementary data function enables instances of tasks to be run in parallel within a specified date range, not according to date order.

The rerun function is capable of rerunning an instance of a single failed task against that task's instance. The rerun function is capable of rerun an instance of the task and an instance of a downstream task of the instance for the instance of the task that failed.

The rerun function is capable of performing the following functions: (1) rerunning the instance of the task and the instance of the downstream task of the instance on the instance of the failed task and the instance of the downstream task; and (2) automatically identifying whether an external parent node of the task-related link fails, and if the external parent node fails, not performing function (1).

The rerun function is capable of rerun an instance of a task and instances of tasks downstream thereof, and in the course of the rerun, skips over instances of tasks that have been successfully executed and rerun only instances of tasks that have failed.

One aspect of the invention discloses a method for data operation and maintenance on a big data platform, wherein the big data platform comprises an execution agent for executing an instance of a task for processing data and a data source for storing data to be processed when the instance of the task runs, the method comprises the following steps: storing information of an instance of a task and task scheduling information associated with the instance; and assigning instances of the task to the execution agents.

Said step of assigning an instance of said task to said executing agent further comprises at least one of: a data complementing step, which is used for specifying any date segment and generating an example of a daily task; a successful setting step for changing the state of the instance of the failed task to be successful; and a rerun step, which is used for rerun the instance of a certain task when the instance fails.

And the step of complementing data is used for serially running the instances of the tasks according to the date sequence within the specified date range.

The step of complementing data runs instances of the tasks in parallel within a specified date range, not according to a date order.

The re-running step re-runs the instance of the task for a single instance of the failed task.

The rerun step reruns the instance of the task and the instance of the downstream task for the instance of the failed task and the instance of the downstream task.

The heavy running step comprises the following steps: (1) rerunning the instance of the task and the instance of the downstream task of the instance on the instance of the failed task and the instance of the downstream task; and (2) automatically identifying whether an external parent node of the task-related link fails, and if the external parent node fails, not performing step (1).

The rerun step reruns instances of a task and of tasks downstream thereof, and during the rerun process, skips instances of tasks that have been successfully executed and rerun only instances of tasks that have failed.

One aspect of the invention discloses a computer readable medium, on which computer readable instructions are stored, which when executed by a computer, can perform the method for the operation and maintenance of big data platform data.

The operation and maintenance method for multiple operations can carry out intelligent and effective operation and maintenance on single, multiple or large-scale operations efficiently and conveniently.

Drawings

FIG. 1 illustrates a data scheduling operation and maintenance architecture diagram according to an embodiment of the present invention.

FIG. 2A illustrates an auto-dependent complement data architecture diagram according to an embodiment of the invention.

FIG. 2B illustrates a parallel execution complement data architecture diagram according to an embodiment of the invention.

FIG. 3 illustrates a job placement success architecture diagram according to an embodiment of the present invention.

FIG. 4A illustrates a single task re-running architecture diagram according to an embodiment of the present invention.

Fig. 4B shows a resume downstream architecture diagram according to an embodiment of the invention.

Fig. 4C is a diagram illustrating automatic identification of whether an external parent node fails in a rerun downstream architecture according to an embodiment of the present invention.

FIG. 4D illustrates a prosthetic re-run architecture diagram according to an embodiment of the present invention.

Detailed Description

The content of the invention will now be discussed with reference to a number of exemplary embodiments. It is to be understood that these examples are discussed only to enable those of ordinary skill in the art to better understand and thus implement the teachings of the present invention, and are not meant to imply any limitations on the scope of the invention.

As used herein, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to. The term "based on" is to be read as "based, at least in part, on". The terms "one embodiment" and "an embodiment" are to be read as "at least one embodiment". The term "another embodiment" is to be read as "at least one other embodiment".

Different embodiments of the invention can freely select time periods for data patching of jobs; the automatic dependent execution or independent parallel execution can be carried out according to the time dimension; the method can successfully place the failed operation which does not influence the overall relation, thereby not influencing the operation of the downstream operation; the failed operation can be rerun; jobs downstream of the job can be rerun in batches. The embodiment of the invention can also realize different functions required by operation and maintenance under different scenes.

In the present disclosure, "task" and "job" are concepts that can be substituted for each other.

The system architecture of one embodiment of the present invention is shown in FIG. 1 and includes an instance, a database (MySQL database), a scheduler, an execution agent, and a data source (e.g., HDFS storage system by Hadoop). Wherein the database is a repository built on a computer storage device that organizes, stores, and manages data by data structure, and the MySQL database is used to store data about information of instances; the data source stores data used for calculation (such as behavior log data of users, transaction data and the like), and the data is stored in a Hadoop Distributed File System (HDFS) in a file form. For example, the system architecture is described as follows:

● instances are typically generated by a user submitting a job, each submission of a job generating a corresponding instance;

● the scheduler is used to construct a DAG graph (directed acyclic graph) according to the dependency relationship between tasks and schedule the tasks to the execution agent according to a certain algorithm, wherein the scheduling refers to transferring the information of the job to be run to the execution agent (which is equivalent to an executor), and the execution agent allocates the run resources for the job according to the received job information;

● database for storing information of instances and task scheduling information;

● executing the agent for performing the computation of the instance;

● the data source is used to store data and is the data source for job calculations.

According to an embodiment of the invention, the job is the minimum unit of operation of the platform, and 12 job types such as Shell, Hive, Spark, MapReduce, Presto, Flink and the like are currently supported.

In the present disclosure, an Instance (Instance) is one executable object created from a job. Each run of a task produces a new instance. The normal operation of the example usually goes through three stages of waiting to operate, operating and ending. The IDs of the instances generated by the two runs are different for one task.

In this disclosure, scheduling refers to the method of allocating instances to run. For example, the schedule includes three granularities of day, week, and month. There may be dependencies between different tasks. According to one embodiment of the present invention, one dependency relationship is: task B is an upstream task of task A if its scheduling requires task B to be completed. In this disclosure, the schedule time is the task start runtime. When a task reaches the scheduling time, if there is an unfinished task upstream, the task is not scheduled. Only after all upstream tasks are completed, will the task be scheduled and thus begin running. In this disclosure, a scheduler refers to a program that allocates instances to executing agents.

In addition, in the present disclosure, interdependencies are supported between scheduling periods of different granularity. Wherein,

the "support of interdependencies between scheduling periods of different granularity" can be understood with reference to the following example: if the a job is a job that runs daily, the B job is a job that runs monday, and the C job is a job that runs No. 1 per month; if the C operation depends on the B operation and the B operation depends on the A operation, scheduling the A operation firstly, scheduling the B operation after the A operation is finished, and scheduling the C operation after the B operation is finished; if a certain day is neither Monday of the week nor No. 1 of the month, only the A job is run, the B job runs empty after the A job is run, and the C job runs empty after the B job is scheduled (the empty running means that the job is scheduled but does not run scripts, so that calculation is not carried out; the empty running can also be understood as a running process); if a day is both Monday of the week and No. 1 of the month, the operation flow is to run the A operation first, then run the B operation, and finally run the C operation.

In addition, according to an embodiment of the present invention, the scheduler allocation algorithm is a distributed scheduling method based on a Directed Acyclic Graph (DAG), while supporting dependent scheduling and timing scheduling.

Here, "dependent scheduling" refers to scheduling each task according to the dependency relationship of each task. For example, task a depends on task B, and a is scheduled only after task B is scheduled and successfully run (this dependence is usually because task a uses data generated by task B, so only after task B successfully runs, the running of task a makes sense).

Here, "timing scheduling" refers to scheduling a certain task at a specified time. For example, one configuration is to change the ratio between 2: 00 schedule task a, then at 2: when 00, task A will be scheduled.

The term "simultaneously supporting dependent scheduling and timing scheduling" refers to that the same task is configured with both scheduled time and the dependency relationship of the task. For example, task A is configured with both scheduled time (e.g., 2: 00 per day) and dependencies (e.g., dependent task B). Therefore, only when 2: after 00 and after task B successfully runs, task A is run (if task B runs before 2: 00, task A runs on time at 2: 00; if task B runs after 2: 00, task A runs after task B runs, and if task B fails, task A is also automatically in a failure state and is not run).

Embodiments of the present invention can use scheduling rules. The scheduling rule may be, for example, "task a, run 8 points per day"; this rule may be stored as task scheduling information, for example in the form of "00/000/08100".

In this disclosure, an execution agent refers to a program that submits instances to a storage compute cluster for its execution. According to one embodiment of the invention, the execution agent receives the task information of the instance to be operated, analyzes the task type of the instance, and submits the instance to a specific environment for operation; wherein, the received tasks have respective task information (e.g. task type (hive or live)

spark), the name of the task, whether the task has a task that is relied upon and/or a task that is relied upon, parameter variables of the task, etc.). For example, the executing agent would submit a Hive type task to the Hive environment to run, and submit a Spark type task to the Spark environment to run.

The operation and maintenance mode of the embodiment of the invention is executed through the generated instance, and the generated instance can be called or a newly generated instance can be called to carry out operation and maintenance. Each time a user submits a job, an instance is generated, which is stored in a database (e.g., MySQL library) and is saved for a certain period of time.

The embodiment of the invention can realize the following functions:

1. data complement function

The complement data function can specify any date field and generate a daily job instance, wherein the date field is a date range arbitrarily specified on the visual date selection interface. These examples can operate in two ways: the independent complement method (as shown in fig. 2A) and the parallel execution complement method (as shown in fig. 2B). The self-dependent data complementing method can automatically generate a serial dependency relationship according to the sequence of the operation dates, wherein the instance in the next day can be executed only after the instance in the previous day is executed. The method for executing the complementary data in parallel has no dependency relationship, and the daily job instances are independent and executed in parallel.

As can be seen from fig. 2A, examples 1 to n have different date labels. And the independent data complementing method creates the dependency relationship according to the sequence of the example generation time. Namely, the example 1 is executed first, and the example 2 is executed after the example 1 is executed, until the example n-1 is executed.

As can be seen from fig. 2B, even though instances 1 through n have different date labels, the parallel execution of the complementary data method executes the instances simultaneously, without creating any dependency between the instances,

2. successful placement function

The put success function may set the failed job instance to a success state (as shown in fig. 3). When a job fails but has no actual effect on the overall job flow, this job instance may be set to a success state in order to enable smooth execution of downstream jobs. At this time, the instance downstream of the job may transition from the fail-stop state to the run or wait state.

3. Rerun function

The rerun function includes three forms: a single-tasking re-running function (e.g., fig. 4A), a re-running downstream function (e.g., fig. 4B and 4C), and a restorative re-running function (e.g., fig. 4D).

When a job fails, the job may be re-run, i.e., the scheduler will reschedule the instance of the failed job (the single instance that was generated) from the mysql library and re-run the instance of the failed job. This solution is a single task rerun function.

When a job fails and the downstream job also fails, the job and the downstream job can be run by re-running the job (parent node) and the downstream job (child node) of the job (i.e., re-running the job first and then re-running the child node job of the job). The scheduler will also schedule instances where these jobs have been generated. This solution is a rerun downstream function.

According to one embodiment of the invention, the rerun downstream functionality is also able to automatically identify whether an external parent node of a node on a link of a dependency of a task has failed. For example, as shown in FIG. 4C, node P is the external parent node that is node E in the link within the circle. If the external parent node fails, then the replay downstream function will not be performed. This is because it makes no sense to rerun the node because the external parent upstream of the node has failed and the node needs the data of all the parents. In the example of fig. 4C, if the restart downstream function is used for node a, when node P also fails at the same time, the restart downstream function is not executed.

When the failed downstream job also depends on other upstream jobs (the failed downstream job has a dependency relationship with other upstream jobs), and these other upstream jobs also fail, then the rerun downstream cannot be performed at that time. At this time, a prosthetic re-run function may be employed. That is, the entire workflow is run step by step starting from the most upstream job. Skipping a downstream job if the downstream job is executed successfully; if the downstream job fails, it is rerun. The scheduler will also schedule multiple instances produced by these jobs and execute the workflow in accordance with the dependencies of the instances.

The method and system of the embodiments of the present invention can be implemented as pure software (e.g., a software program written in Java language), as pure hardware (e.g., a dedicated ASIC chip or FPGA chip) as needed, or as a system combining software and hardware (e.g., a firmware system storing fixed code).

Another aspect of the invention is a computer-readable medium having computer-readable instructions stored thereon that, when executed, perform a method of embodiments of the invention.

While various embodiments of the present invention have been described above, the above description is intended to be illustrative, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The scope of the claimed subject matter is limited only by the attached claims.

Claims

1. A system for data operation and maintenance, comprising:

an execution agent to execute an instance of a task to process data;

a scheduler for allocating instances of tasks to the execution agents;

a database storing information of the instance and task scheduling information related to the instance; and

a data source storing data to be processed by the instance runtime.

2. The system of claim 1, wherein the scheduler is capable of performing at least one of the following functions:

a complement data function for specifying an arbitrary date segment and generating an instance of a daily task;

a put success function for changing the state of the instance of the failed task to success; and

and the rerun function is used for performing rerun on an instance of a certain task when the instance fails.

3. The system of claim 2, wherein the complementary data function is capable of serially running instances of tasks according to a chronological order within a specified date range.

4. The system of claim 2, wherein the complementary data function is capable of running instances of tasks in parallel within a specified date range, not according to date order.

5. The system of claim 2, wherein the rerun function is capable of rerun an instance of a single failed task with respect to that instance of the task.

6. The system of claim 2, wherein the rerun function is capable of rerun an instance of a task and an instance of a task downstream of the instance on an instance of a failed task and an instance of the task downstream.

7. The system of claim 2, wherein the rerun function is capable of performing the following functions:

(1) rerunning the instance of the task and the instance of the downstream task of the instance on the instance of the failed task and the instance of the downstream task; and

(2) automatically identifying whether an external parent node of a link related to the task fails, and if the external parent node fails, performing function (1).

8. The system of claim 2, wherein the rerun function is capable of rerun an instance of a task and instances of tasks downstream thereof, and during the rerun, skipping over instances that have performed successful tasks and rerun only failed tasks.

9. A method for data operations and maintenance on a big data platform, wherein the big data platform includes an execution agent for executing an instance of a task that processes data and a data source for storing data to be processed when an instance of a task is run, the method comprising:

storing information of an instance of a task and task scheduling information associated with the instance; and

assigning an instance of the task to the execution agent.

10. The method of claim 9, wherein the step of assigning the instance of the task to the execution agent further comprises at least one of:

a data complementing step, which is used for specifying any date segment and generating an example of a daily task;

a successful setting step for changing the state of the instance of the failed task to be successful; and

and a re-running step, which is used for re-running the instance of a certain task when the instance fails.

11. The method of claim 10, wherein the step of supplementing data runs instances of tasks serially according to a chronological order within a specified date range.

12. The method of claim 10, wherein the step of supplementing data runs instances of tasks in parallel, not according to a date order, within a specified date range.

13. The method of claim 10, wherein the re-running step re-runs the instance of the task for a single instance of the failed task.

14. The method of claim 10, wherein the rerun step reruns the instance of the task and the instance of the downstream task for the instance of the failed task and the instance of the downstream task.

15. The method of claim 10, wherein the re-running step comprises the steps of:

(2) automatically identifying whether an external parent node of a link related to the task fails, and if the external parent node fails, not performing step (1).

16. The method of claim 10, wherein the rerun step reruns instances of a task and of tasks downstream thereof, and during the rerun, skips instances of tasks that have performed successfully and rerun only instances of tasks that have failed.

17. A computer readable medium having computer readable instructions stored thereon which, when executed by a computer, are capable of performing the method of any one of claims 9-16.