CN110362390B

CN110362390B - Distributed data integration job scheduling method and device

Info

Publication number: CN110362390B
Application number: CN201910489422.2A
Authority: CN
Inventors: 李建元; 刘飞黄; 王超群; 刘兴田; 贾建涛; 温晓岳
Original assignee: Enjoyor Co Ltd
Current assignee: Yinjiang Technology Co.,Ltd.
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2021-09-07
Anticipated expiration: 2039-06-06
Also published as: CN110362390A

Abstract

The invention relates to a distributed data integration job scheduling method and a device, aiming at special scenes possibly faced by data integration, the job scheduling device is responsible for sending data integration jobs to a job running device, the job running device receives scheduling tasks and starts job execution, and sends state information of job running to a job management module, and feeds back working node calculation resources to a resource scheduling module and feeds back loss connection or fault information to a job preloading module. The invention has the following comprehensive characteristics: (1) high availability, fault tolerance, weak consistency; (2) the low-delay characteristic facing the quasi-real-time job scheduling; (3) multi-tenant concurrency control facing to cloud service application; (4) computing resource isolation and multi-job parallel scheduling; (5) a priority scheduling mechanism.

Description

Distributed data integration job scheduling method and device

Technical Field

The invention relates to the technical field of big data foundation, in particular to a distributed data integration job scheduling method and device.

Background

With the evolution of digital economy, service digitization in various industries has been fully developed, and digital service has gradually become a new center of gravity. However, since the service digitization derives a large number of data islands and becomes a common pain point for realizing the digital service, data integration is urgently needed in various industries, the data islands are opened and avoided, and data resources are integrated and managed, so that the associated value among data is effectively developed.

Data integration often faces thousands of job scheduling including job types such as data exchange and data preprocessing, and the design of a scheduling system needs to consider various complex scenarios. For example, some scenarios not only have a large number of jobs but also require parallel processing; some scenes require quasi-real-time operation, and priority scheduling needs to be considered; some jobs need to occupy more or longer computing resources, and resource isolation needs to be considered so as not to influence other jobs; some jobs may suffer from failures such as downtime of data sources and targets, interruption of networks, downtime of running nodes and the like, and a failure fault tolerance mechanism is needed; in a multi-tenant scenario, concurrent control needs to be handled; and so on.

The prior art fails to meet the complex data integration job scheduling requirements. For example: in the traditional LTS scheduling system, the operation is isolated based on the thread, if the execution thread of one operation exhausts all the memories of the current process, all the operations in the process are abnormal, the capability of scheduling data integration operation is lacked, and the traditional LTS scheduling system is more suitable for scheduling light-weight tasks. Chinese patent CN201610800080 discloses a distributed task scheduling system and method for solving the problems of large code writing amount and heavy development task of developers in a parallel computing program development mode, which is essentially scheduling for a single large-scale distributed computing job without considering multi-task parallel scheduling. Chinese patent CN201610197298 discloses a task scheduling method, apparatus and system, which provides a multi-channel multi-task distributed scheduling method, solves the problem of starvation of other jobs caused by too long scheduling time of a single task, but does not consider the problems of low latency of job metadata access, how to ensure job metadata consistency, and the like. Chinese patent No. CN201410748604 of the invention discloses a distributed task scheduling system and method, which provides a distributed task scheduling system and method for ensuring the reliability of the system itself, supporting independent or associated tasks, and supporting task rollback distribution, but is not suitable for complex data integration scenarios where rollback of tasks is not a key point, association does not need to occur between data integration tasks, and does not consider the problem that downtime under large-scale complex data integration tasks is high, and high availability needs to be achieved, and the problems of how to reduce delay as much as possible, how to ensure consistency of operation metadata, and the like under the premise that a quasi-real-time task exists.

Disclosure of Invention

Aiming at special scenes possibly faced by data integration, the invention is characterized in that an operation scheduling device is responsible for sending data integration operation to an operation running device, the operation running device receives scheduling tasks and starts operation execution, and sends operation running state information to an operation management module, feeds back working node calculation resources to a resource scheduling module, and feeds back loss connection or fault information to an operation preloading module; the invention has the following comprehensive characteristics: (1) high availability, fault tolerance, weak consistency; (2) the low-delay characteristic facing the quasi-real-time job scheduling; (3) multi-tenant concurrency control facing to cloud service application; (4) computing resource isolation and multi-job parallel scheduling; (5) a priority scheduling mechanism.

The invention achieves the aim through the following technical scheme: a distributed data integration job scheduling method comprises the following steps:

(1) the job scheduling device issues the data integration job to the job running device, wherein the job scheduling device comprises a job management module, a job preloading module and a resource scheduling module: (1.1) the operation management module receives, caches and stores operation related meta information to perform concurrency control;

(1.2) the job preloading module acquires the jobs to be processed from the job management module and determines the scheduling priority sequence;

(1.3) the resource scheduling module completes resource allocation and scheduling distribution by acquiring the job preloading information and the calculation resource information of the job running device;

(2) the operation running device receives the scheduling task and starts operation execution, and feeds back the operation running state information to the operation management module, feeds back the working node calculation resources to the resource scheduling module, and feeds back the loss connection or fault information to the operation preloading module.

Preferably, the job management module includes an information receiving unit, an information caching unit, a persistent storage unit, and a concurrency control unit, wherein the specific operations are as follows:

(i) the information receiving unit receives job submission, job meta-information modification and scheduling strategy updating; receiving unallocated resource job information fed back by the resource scheduling module, and updating job states; receiving operation state information fed back by the operation running device and updating the operation state;

(ii) the information caching unit is used for locally caching the operation meta information and the state information and supporting frequent real-time query;

(iii) the persistent storage unit maintains the data consistency of the cache layer and the storage layer according to the metadata information of the persistent operation of the cache content;

(iv) the concurrency control unit assigns a read-write lock to access of each of the job resources.

Preferably, in step (iii), the data consistency between the cache layer and the storage layer is maintained by the following method:

(a) writing updated data into a local file by using fault-tolerant storage, and writing the updated data into a storage layer after the network is recovered to be normal;

(b) the fuse is used, when the fault-tolerant mechanism is triggered and reaches a preset threshold value, the fuse is disconnected, the service performs degradation processing, and a new task is not scheduled;

(c) and encapsulating the operation state interface of the operation running device, acquiring the operation running state and auditing the operation running state when the operation scheduling node is started every time, and ensuring that the operation state in the operation running device is consistent with the state in the metadata storage layer.

Preferably, the job preloading module includes a real-time query unit, a job preloading unit, and a fault processing unit, wherein the specific operations are as follows:

(I) the real-time query unit queries the job metadata cache in real time to acquire the unlocked job to be scheduled;

(II) adding the unlocked job to be scheduled into a bounded ordered queue by a job preloading unit, and sequencing according to job scheduling time and job priority;

(III) the fault processing unit receives fault information from the operation device and carries out fault tolerance processing; the fault tolerance processing means that a working node is down or a long-time network is disconnected in the operation process, the operation running device informs the operation preloading module, the operation preloading module is connected with the working node to judge whether the connection is unavailable, if the connection is not available, the operation is directly put into a queue, and the lost operation is finally dispatched to other available nodes; if the unconnected node recovers again at the moment, the operation running device can directly kill the operation process, so that the same operation cannot be simultaneously run on two nodes under one operation running device.

Preferably, the bounded ordered blocking queue is used for loading all jobs to be scheduled, wherein bounded means that the upper limit of the number of jobs is guaranteed, and an upper limit parameter can be given through scene evaluation; the order refers to that the job with earlier trigger time and higher priority is placed at the position at the front of the queue for preferential scheduling; in the process of stopping deleting the job, supporting to remove the specified job to be scheduled from the queue; with the producer-consumer model, the CPU burden is reduced using a thread blocking-wakeup approach.

Preferably, the resource scheduling module includes a resource obtaining unit, a resource allocating unit, and a scheduling distributing unit, wherein the specific operations are as follows:

1) the resource obtaining unit obtains the computing resources of the operation cluster and caches the computing resources in the memory;

2) the resource allocation unit acquires all the jobs from the bounded ordered queue and allocates computing resources to each job according to the job priority order;

3) and the scheduling and distributing unit appoints an actuator for the scheduled job and sends the job meta-information, the distributed computing resources and the actuator configuration to the job running cluster.

Preferably, in the step (2), the job running device includes a main control node and a work node, the main control node is responsible for management and coordination, and the work node is responsible for executing the data integration job; the master control node receives the job meta-information, the job resource allocation information and the job executor information which are distributed by the resource scheduling module and starts the job execution; the agent program on the working node collects the state information of the operation and sends the state information to the main control node, and the main control node sends the state information to the operation management module; the agent program on the working node collects the working node computing resources and sends the working node computing resources to the main control node, and the main control node sends the working node computing resources to the resource scheduling module; the agent program on the working node sends heartbeat information to the main control node, and the main control node sends loss of connection or fault information to the operation preloading module; the executor on the working node is provided with a retry mechanism, and once the data flow source or the data flow target goes down or loses connection, the executor carries out timing retry to ensure that the data flow source and the data flow target can continue to normally operate after recovery.

Preferably, the operation of the job running device is based on a meso cluster system to perform distributed system resource management, a master control node provides low-delay local metadata management in a RAM + WAL log mode, a PAXOS algorithm is adopted to maintain job state synchronization of a large number of working nodes, and specific physical resources are pushed to a job scheduling system based on a cluster physical resource unified management interface and a specific resource sharing strategy of the master control node; the job operation cluster provides two modes of a multi-language driver package and JSON RPC for the registration of the job scheduling system and the acquisition of a specific callback event; the agent program is responsible for collecting the resources of the working nodes, running a specific scheduling task through the actuator and returning the execution result and the task state of the actuator to the master control node; and then the main control node forwards the data to the job scheduling device.

A distributed data integrated job scheduling apparatus, comprising: a job scheduling device and a job running device; the job scheduling device and the job running device perform information interaction with each other; the job scheduling device comprises a job management module, a job pre-recording module and a resource scheduling module; the operation management module is used for receiving, caching and storing operation related meta information and performing concurrency control; the job preloading module is used for acquiring the jobs to be processed from the job management module and determining the scheduling priority sequence; the resource scheduling module is used for completing resource allocation and scheduling distribution by acquiring the job preloading information and the calculation resource information of the job running device; the operation running device comprises a main control node and a working node, wherein the main control node is responsible for management and coordination, and the working node is responsible for executing data integration operation.

Preferably, the job scheduling device and the job running device are both registered in the ZooKeeper, the job scheduling device adopts a master-slave mode, and once the master device goes down, the ZooKeeper selects the backup device and replaces the job scheduling work; the main control nodes in the operation running device adopt a main standby mode, and once the main control nodes are down, the ZooKeeper elects the standby main control nodes to take over the management coordination work.

Preferably, the job scheduling device performs auditing and maintenance based on job status information fed back by the job running device, and the job metadata database caches and maintains consistency based on job metadata of the job scheduling device; when the job scheduling device fails, the standby job scheduling device needs to interact with the job metadata database once taking over the work, a job metadata cache mechanism is rebuilt, and the metadata cache information is audited and maintained by receiving job state feedback information from the job running device, so that the metadata cache information is kept consistent in a distribution system.

The invention has the beneficial effects that: (1) according to CAP theorem, the method of the invention meets two indexes of high availability and fault tolerance, and adopts a mechanism for ensuring the consistency of operation metadata as much as possible; (2) the multi-tenant concurrency control is realized based on the distributed read-write lock, and the multi-tenant data integration service is provided in a cloud service mode; (3) aiming at the particularity that the data integration operation needs frequent scheduling, the operation metadata adopts a cache mechanism, so that the delay and interruption risks caused by frequent metadata access can be effectively reduced.

Drawings

FIG. 1 is a schematic flow diagram of the apparatus of the present invention;

FIG. 2 is a schematic diagram of the high availability mechanism of the apparatus of the present invention;

FIG. 3 is a schematic flow diagram of the method of the present invention;

FIG. 4 is a schematic flow diagram of a job management module of the present invention;

FIG. 5 is a schematic representation of the operation of the job management module of the present invention;

FIG. 6 is a process flow diagram of a job preloading module of the present invention;

FIG. 7 is a schematic flow diagram of a resource scheduling module of the present invention;

fig. 8 is a schematic view showing an operation flow of the work running apparatus of the present invention.

Detailed Description

The invention will be further described with reference to specific examples, but the scope of the invention is not limited thereto:

example (b): as shown in fig. 1, a distributed data-integrated job scheduling apparatus is composed of a job scheduling apparatus and a job execution apparatus. The job scheduling device and the job running device perform information interaction with each other; the job scheduling device comprises a job management module, a job pre-recording module and a resource scheduling module; the operation management module is used for receiving, caching and storing operation related meta information and performing concurrency control; the job preloading module is used for acquiring the jobs to be processed from the job management module and determining the scheduling priority sequence; the resource scheduling module is used for completing resource allocation and scheduling distribution by acquiring the job preloading information and the calculation resource information of the job running device; the operation running device comprises a main control node and a working node, wherein the main control node is responsible for management and coordination, and the working node is responsible for executing data integration operation.

As shown in fig. 2, both the job scheduling device and the job running device are registered in the ZooKeeper, the job scheduling device adopts a master-slave mode, and once the master device goes down, the ZooKeeper selects the backup device and replaces the job scheduling work; the main control nodes in the operation running device adopt a main standby mode, and once the main control nodes are down, the ZooKeeper elects the standby main control nodes to take over the management coordination work.

The job scheduling device audits and maintains based on job state information fed back by the job running device, and the job metadata base caches and maintains consistency based on job metadata of the job scheduling device; when the job scheduling device fails, the standby job scheduling device needs to interact with a job metadata database once taking over work, a job metadata cache mechanism is rebuilt, and the metadata cache information is audited and maintained by receiving job state feedback information from the job running device so as to keep the metadata cache information consistent in a distribution system; thereby ensuring weak consistency.

As shown in fig. 3, a distributed data integration job scheduling method includes the following steps:

s100: the job scheduling device issues the data integration job to the job running device, and the job scheduling device consists of a job management module, a job preloading module and a resource scheduling module, and specifically comprises the following parts:

s101: and the operation management module receives, caches and stores the operation related meta information and performs concurrency control. The job management module consists of an information receiving unit, a storage processing unit and a concurrency control unit. As shown in fig. 4, the specific operations are as follows:

(1) the information receiving unit S101-1 is responsible for receiving job submission, job meta-information modification and scheduling policy update; receiving unallocated resource job information fed back by the resource scheduling module, and updating job states; receiving operation state information fed back by the operation running device and updating the operation state;

(2) the information caching unit S101-2 is responsible for locally caching the operation meta information and the state information and supporting frequent real-time query;

(3) the persistent storage unit S101-3 maintains the data consistency of the cache layer and the storage layer according to the metadata information of the persistent job of the cache content;

(4) the concurrency control unit S101-4 is responsible for assigning a read-write lock to access of each job resource.

Specifically, as shown in fig. 5, the job management module of the present invention is responsible for receiving jobs and maintaining job state machines, and the main job states may include: the method comprises the following steps of not starting operation, waiting to be scheduled operation, suspending operation, running operation, stopping operation, abnormal operation and finishing operation. The job management module provides job state operation interfaces, such as operation interfaces for stopping running jobs, suspending jobs, scheduling jobs, suspending jobs, normally stopping jobs, abnormally stopping jobs, and the like. Maintaining various job scheduling policies, such as: repeated operation, timed operation, Cron operation, disposable operation, etc. And a history storage module is internally maintained and used for recording all scheduling histories. And adding a read-write lock to the operation of the cache layer and the metadata persistence layer to realize concurrency control: if there are concurrent threads that are write operations, the lock is upgraded to an exclusive lock and other threads cannot seize the lock. Conversely, if the concurrent thread is a read operation, the lock is upgraded to a shared lock and other threads can concurrently seize the lock. The operation management module adds a cache layer on an operation metadata storage layer, ensures frequent metadata query and call, supports frequent access and frequent scheduling of quasi-real-time data integration tasks, abstracts the cache layer into an SPI interface on an implementation layer, and supports the implementation of cache layer interfaces such as Caffeine, JDK, Guava, Redis and the like. The persistence layer is abstracted into an SPI interface on the implementation level and supports databases such as relational databases, MongoDB databases and the like. Because the job data may cause the problem of data inconsistency due to network and other instability factors when written into the persistence layer, in the implementation level, the job management module adopts "triple insurance" to ensure the metadata consistency as much as possible: (1) writing updated data into a local database/file by using fault-tolerant storage, and writing the updated data into a storage layer after the network is recovered to be normal; (2) the fuse is used, when the fault-tolerant mechanism is triggered and reaches a certain threshold value, the fuse is disconnected, the service performs degradation processing, and a new task is not scheduled; (3) and encapsulating an operation state interface of the operation running system, and acquiring the operation running state for auditing when the operation scheduling node is started every time, so as to ensure that the operation state in the operation running system is consistent with the state in the metadata storage layer.

S102, as shown in figure 6, the job preloading module acquires the job to be processed from the job management module and determines the scheduling priority sequence, wherein the job preloading module consists of a real-time query unit, a job preloading unit and a fault processing unit. The method comprises the following specific operations:

(1) the real-time query unit S102-1 is responsible for querying job metadata cache in real time and acquiring unlocked jobs to be scheduled;

(2) the job preloading unit S102-2 is responsible for adding the unlocked job to be scheduled into the bounded ordered queue and sorting the job according to the job scheduling time and the job priority;

(3) failure processing unit S102-3: and the system is responsible for receiving fault information from the operation cluster and performing fault tolerance processing.

Specifically, a bounded ordered blocking queue is built to load all jobs to be scheduled. The bounded state refers to the condition that the upper limit of the number of the jobs is guaranteed, and upper limit parameters can be given through scene evaluation; in order means that jobs with earlier trigger times and higher priority levels will be placed in the queue front position for priority scheduling. And in the process of stopping deleting the job, supporting to remove the specified job to be scheduled from the queue. With the producer-consumer model, the CPU burden is reduced using a thread blocking-wakeup approach.

The fault tolerance processing means that in the operation process of the operation, a working node is down or a long-time network is disconnected, the operation running device informs the operation preloading module, the operation preloading module is connected with the working node to judge whether the connection is unavailable, if the connection is not available, the operation is directly put into a queue, and the lost operation is finally scheduled to other available nodes. If the disconnected node is recovered again, the operation running device can directly kill the operation process to ensure that the same operation cannot run on two nodes simultaneously under one operation running device.

And S103, the resource scheduling module finishes resource allocation and scheduling distribution by acquiring the job preloading information and the calculation resource information of the job running device. The resource scheduling module consists of a resource acquisition unit, a resource allocation unit and a scheduling and distributing unit; as shown in fig. 7:

(1) the resource obtaining unit S103-1 is responsible for obtaining the computing resources of the job running cluster and caching the computing resources in the memory;

(2) the resource allocation unit S103-2 is responsible for acquiring all the jobs from the bounded ordered queue and allocating computing resources to each job according to the job priority order;

(3) the scheduling and distributing unit S103-3 is responsible for assigning an executor to the scheduled job and sending the job meta-information, the allocated computing resource and the executor configuration to the job running cluster.

The implementation of the Executor may be Linux container Executor, Docker Executor, or other executors, and these container executors can implement the isolation of the computing resources.

And S200, the job running device receives the scheduling task and starts job execution, sends the state information of job running to the job management module, feeds back the working node calculation resources to the resource scheduling module, and feeds back the loss of connection or fault information to the job preloading module.

The operation running device is provided with a main control node and a working node, the main control node is responsible for management and coordination, and the working node is responsible for executing data integration operation. The master control node receives job meta-information, job resource allocation information and job executor information distributed by the resource scheduling module and starts job execution; the agent program on the working node collects the state information of the operation and sends the state information to the main control node, and the main control node sends the state information to the operation management module; the agent program on the working node collects the working node computing resources and sends the working node computing resources to the main control node, and the main control node sends the working node computing resources to the resource scheduling module; and the agent program on the working node sends heartbeat information to the main control node, and the main control node sends loss of connection or fault information to the operation preloading module. The executor on the working node is provided with a retry mechanism, and once the data flow source or the data flow target goes down or loses connection, the executor carries out timing retry to ensure that the data flow source and the data flow target can continue to normally operate after being recovered.

As shown in fig. 8, the job running apparatus may perform distributed system resource management based on a meso cluster system, where the master node provides low-latency local metadata management in a RAM + WAL log manner, maintains job state synchronization of a large number of working nodes by using a PAXOS algorithm, and pushes specific physical resources to a job scheduling system based on a cluster physical resource unified management interface and a specific resource sharing policy of the master node. The job operation cluster provides two modes of a multi-language driver package and JSON RPC for the registration of the job scheduling system and the acquisition of a specific callback event; the agent program is responsible for collecting the resources of the working nodes, running a specific scheduling task through the actuator and returning the execution result and the task state of the actuator to the master control node. And then the main control node forwards the data to the job scheduling device.

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A distributed data integration job scheduling method is characterized by comprising the following steps: (1) the job scheduling device issues the data integration job to the job running device, wherein the job scheduling device comprises a job management module, a job preloading module and a resource scheduling module:

(1.1) the operation management module receives, caches and stores operation related meta information to perform concurrency control; the operation management module comprises an information receiving unit, an information caching unit, a persistent storage unit and a concurrency control unit, wherein the operation management module specifically comprises the following operations:

(1.1.1) the information receiving unit receives the job submission, the job meta-information modification and the scheduling policy update; receiving unallocated resource job information fed back by the resource scheduling module, and updating job states; receiving operation state information fed back by the operation running device and updating the operation state;

(1.1.2) the information caching unit carries out local caching on the operation meta information and the state information and supports frequent real-time query;

(1.1.3) the persistent storage unit maintains the data consistency of the cache layer and the storage layer according to the metadata information of the persistent job of the cache content; the method for maintaining the data consistency of the cache layer and the storage layer is realized by the following steps:

(c) encapsulating an operation state interface of the operation running device, acquiring an operation running state and auditing the operation running state when an operation scheduling node is started each time, and ensuring that the operation state in the operation running device is consistent with the state in the metadata storage layer;

(1.1.4) the concurrency control unit allocating a read-write lock to access each job resource;

(1.2) the job preloading module acquires the job to be processed from the job management module and determines a bounded ordered queue and a scheduling priority; the bounded ordered queue is used for loading all jobs to be scheduled, wherein bounded means that the upper limit of the number of the jobs is guaranteed, and the upper limit parameter of the number of the jobs is given through scene evaluation; the order refers to that the jobs with early triggering time and high priority are placed at the front position of the queue for preferential scheduling; in the process of stopping deleting the job, supporting to remove the specified job to be scheduled from the queue; the producer-consumer model is adopted, and the CPU burden is reduced by using a thread blocking-awakening mode;

2. The distributed data integration job scheduling method according to claim 1, wherein: the operation preloading module comprises a real-time query unit, an operation preloading unit and a fault processing unit, wherein the operation of the step (1.2) is as follows:

(III) the fault processing unit receives fault information from the operation device and carries out fault tolerance processing; the fault tolerance processing means that a working node is down or a long-time network is disconnected in the operation process, the operation running device informs the operation preloading module, the operation preloading module is connected with the working node to judge whether the connection is unavailable, if the connection is not available, the operation is directly put into a queue, and the lost operation is finally dispatched to other available nodes; if the unconnected node recovers again at the moment, the operation running device can directly kill the operation process, so that the condition that the same operation cannot run on two nodes simultaneously under one operation running device is ensured.

3. The distributed data integration job scheduling method according to claim 1, wherein: the resource scheduling module comprises a resource obtaining unit, a resource allocation unit and a scheduling distribution unit, wherein the step (1.3) specifically operates as follows:

4. The distributed data integration job scheduling method according to claim 1, wherein: in the step (2), the operation running device comprises a main control node and a working node, the main control node is responsible for management and coordination, and the working node is responsible for executing data integration operation; the master control node receives the job meta-information, the job resource allocation information and the job executor information which are distributed by the resource scheduling module and starts the job execution; the agent program on the working node collects the state information of the operation and sends the state information to the main control node, and the main control node sends the state information to the operation management module; the agent program on the working node collects the working node computing resources and sends the working node computing resources to the main control node, and the main control node sends the working node computing resources to the resource scheduling module; the agent program on the working node sends heartbeat information to the main control node, and the main control node sends loss of connection or fault information to the operation preloading module; the executor on the working node is provided with a retry mechanism, and once the data flow source or the data flow target goes down or loses connection, the executor carries out timing retry to ensure that the data flow source and the data flow target can continue to normally operate after recovery.

5. The distributed data integration job scheduling method according to claim 1, wherein: the operation of the operation device is based on a meso cluster system to manage distributed system resources, a main control node adopts a RAM + WAL log mode to provide low-delay local metadata management, a PAXOS algorithm is adopted to maintain the operation state synchronization of working nodes, and specific physical resources are pushed to an operation scheduling system based on a cluster physical resource unified management interface and a specific resource sharing strategy; the job operation cluster provides two modes of a multi-language driver package and JSON RPC for the registration of the job scheduling system and the acquisition of a specific callback event; the agent program is responsible for collecting the resources of the working nodes, running a specific scheduling task through the actuator and returning the execution result and the task state of the actuator to the master control node; and then the main control node forwards the data to the job scheduling device.

6. A distributed data integrated job scheduling apparatus to which the method of claim 1 is applied, comprising: a job scheduling device and a job running device; the job scheduling device and the job running device perform information interaction with each other; the job scheduling device comprises a job management module, a job preloading module and a resource scheduling module; the operation management module is used for receiving, caching and storing operation related meta information and performing concurrency control; the job preloading module is used for acquiring the job to be processed from the job management module and determining the scheduling priority; the resource scheduling module is used for completing resource allocation and scheduling distribution by acquiring the job preloading information and the calculation resource information of the job running device; the operation running device comprises a main control node and a working node, wherein the main control node is responsible for management and coordination, and the working node is responsible for executing data integration operation.

7. The distributed data integrated job scheduling device according to claim 6, wherein: the job scheduling device and the job running device are both registered in the ZooKeeper, the job scheduling device adopts a main standby mode, and once the main device is down, the ZooKeeper selects a standby device and replaces the job scheduling work; the main control nodes in the operation running device adopt a main standby mode, and once the main control nodes are down, the ZooKeeper elects the standby main control nodes to take over the management coordination work.

8. The distributed data integrated job scheduling apparatus according to claim 7, wherein: the job scheduling device audits and maintains based on job state information fed back by the job running device, and the job metadata base maintains consistency based on job metadata cache of the job scheduling device; when the job scheduling device fails, the standby job scheduling device needs to interact with the job metadata database once taking over the work, a job metadata cache mechanism is rebuilt, and the metadata cache information is audited and maintained by receiving job state feedback information from the job running device, so that the metadata cache information is kept consistent in a distribution system.