CN110134533B

CN110134533B - System and method capable of scheduling data in batches

Info

Publication number: CN110134533B
Application number: CN201910399131.4A
Authority: CN
Inventors: 黄清明
Original assignee: Chongqing Tianpeng Network Co ltd
Current assignee: Chongqing Tianpeng Network Co ltd
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2020-04-28
Anticipated expiration: 2039-05-14
Also published as: CN110134533A

Abstract

The invention belongs to the technical field of big data processing, and particularly relates to a system and a method capable of scheduling data in batches, which comprises the following steps: a framework building unit for building a three-layer framework of the system; the project creating unit is used for acquiring project creating information of secondary development of a user and deploying the multilevel scheduling nodes based on the project creating information; and the operation scheduling unit is used for scheduling the batch tasks with balanced load through the multistage scheduling nodes. The invention not only can schedule data in batch, but also can carry out manual setting intervention, has balanced load during scheduling and has perfect scheduling control strategy.

Description

System and method capable of scheduling data in batches

Technical Field

The invention belongs to the technical field of big data processing, and particularly relates to a system and a method capable of scheduling data in batches.

Background

In the big data era, data is gold, the data is an important asset of the whole society, namely all enterprise groups, and the good data management and good data utilization are important propositions of the whole society. To use good data, it should be managed first. The batch scheduling automation technology is just an important guarantee for managing good data. In a large number of large and small data warehouses, data marts and various data pools, a batch scheduling automation technology is used for orderly and efficiently spreading various works such as the entering, the storage, the cleaning, the filtering, the rough machining, the fine machining and the like of a large amount of data.

Currently, the existing azkaban scheduling tool can solve relatively complex scheduling tasks based on timing tasks, time intervals and relation dependencies. However, the Azkaban scheduling scale is limited, and the defects of inflexible manual participation, unbalanced scheduling load, imperfect scheduling control strategy and the like are caused.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a system and a method capable of scheduling data in batches, which not only can schedule data in batches, but also can perform manual setting intervention, and has balanced load during scheduling and perfect scheduling control strategy.

In a first aspect, the present invention provides a system capable of scheduling data in batches, including:

a framework building unit for building a three-layer framework of the system;

the project creating unit is used for acquiring project creating information of secondary development of a user and deploying the multilevel scheduling nodes based on the project creating information;

and the operation scheduling unit is used for scheduling the batch tasks with balanced load through the multistage scheduling nodes.

The three-layer architecture comprises an application layer, a control layer and a target layer.

Wherein a three-layer architecture of the system is built by adopting a typical C/S mode.

And acquiring project creation information of secondary development of a user through the application layer, and deploying the multilevel scheduling nodes of the control layer according to the project creation information.

In the running process of the project, the control layer performs load-balanced batch task scheduling on the target layer through a multi-stage scheduling node, and the target layer executes a corresponding task program according to the batch task scheduling of the control layer.

The application layer is a client, the control layer is a server, and the target layer is a task program deployed on the ETL server.

The control layer is of a multi-level pyramid structure and is composed of various different types of nodes, the control layer comprises EM nodes, Server nodes and Agent nodes, and the Agent nodes comprise MAGent nodes and SAgent nodes;

the EM node is used for communicating with the application layer, controlling the access authority of the application layer and managing and controlling the effective operation of all nodes;

the Server node is used for respectively communicating with the EM node and the Agent node and finishing scheduling control of the Agent node;

the Agent node is used for communicating with the target layer in a master-slave Agent cascade mode, carrying out load balancing deployment according to the resource use state of the ETL server of the target layer, and distributing tasks to the relatively idle ETL server to execute a task program.

The project creating information comprises project names, nodes in the project operation flow and connection relations among the nodes.

The application layer comprises an Admin module, a Designer module and a Monitor module;

the Admin module is used for managing and setting project names;

the Designer module is used for setting each node in the project operation flow and the connection relation among the nodes;

the Monitor module is used for operating the project and monitoring the operation flow of the project.

Each node is composed of a plurality of component processes with different functions, communication is completed between the nodes through sockets, and communication is completed between the component processes through a message queue mode.

Wherein the component processes include FDC process, DRR process, DAR process, STR process, KIM process, NLS process, SPS process, CPG process, UCD process, EMR process, JMM process, DSY process, and FIM process.

In a second aspect, the present invention further provides an automatic implementation method of batch schedulable data, which is applicable to the system of batch schedulable data according to any one of claims 1 to 7, and is characterized by comprising the following steps:

building a three-layer architecture of the system by adopting a typical C/S mode, wherein the three-layer architecture comprises an application layer, a control layer and a target layer;

acquiring project creation information of secondary development of a user through the application layer, and deploying the multilevel scheduling nodes of the control layer according to the project creation information;

in the running process of a project, the control layer performs load-balanced batch task scheduling on the target layer through a multi-stage scheduling node, and the target layer executes a corresponding task program according to the batch task scheduling of the control layer.

Each node consists of a plurality of component processes with different functions, communication is completed between the nodes through a Socket, and communication is completed between the component processes through a message queue mode;

the component processes include FDC process, DRR process, DAR process, STR process, KIM process, NLS process, SPS process, CPG process, UCD process, EMR process, JMM process, DSY process, and FIM process.

The embodiment of the invention can not only schedule data in batches, but also perform manual setting intervention, and has balanced load during scheduling and perfect scheduling control strategy.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a block diagram of a system for batch scheduling data according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a three-level architecture of the system in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of an automated implementation method for batch scheduling of data according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The first embodiment is as follows:

the embodiment provides a system capable of scheduling data in batches, as shown in fig. 1, including:

a framework building unit for building a three-layer framework of the system;

A three-layer architecture of the system constructed in this embodiment is shown in fig. 2, where the application layer is a client, the control layer is a server, and the target layer is various task programs deployed on the ETL server. Patent document 201520554128.2 discloses a big data processing platform network architecture, which includes a core layer switch, an application virtualization server, a database cluster, a storage array, a backup server and at least one switch; the application virtualization server, the database cluster, the storage array and the backup server are respectively connected with the core layer switch, the storage array is connected with the switch, and the switch is respectively connected with the application virtualization server and the database cluster. The technical scheme meets the hardware environment required by processing big data; and has openness and expansibility. A large amount of data are mainly stored in a traditional SQL database at present, have very big difference with the NoSQL database that big data technology used, simultaneously because the variety characteristics of data, before using big data platform to handle data, need import the data into big data platform's own storage system, and generally need advance ETL (data warehouse technology) and handle when importing, accomplish processes such as the extraction of all kinds of data, wash, load.

ETL, an abbreviation used in english Extract-Transform-Load, is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end. The traditional ETL tool is provided with a special conversion engine arranged between a data source and a target data warehouse, and the special conversion engine is used for applying all conversion programs.

The application layer in the invention is mainly divided into admin, designer and monitor from the function point of view. The control layer is a multi-level pyramid structure, and the top layer is a service control node to complete various scheduling service controls and provide various operation application services for the client. And the agent layer completes control interaction with the target layer server. In addition, the agent layer can realize scheduling control of the servers deployed in the cluster, realize load balancing and the like in a master-slave agent cascading mode. The target layer is the object controlled by the whole product, such as our ET server, job workstation, etc.

In the embodiment, after a basic three-layer framework is built, a plurality of projects can be created through the application layer, after the projects are created, the control layer performs load-balanced batch task scheduling on the target layer according to the requirements of the running tasks in the running process of the projects, and the target layer executes corresponding task programs according to the scheduling.

The application layer in the embodiment comprises an Admin module, a Designer module, a Monitor module and the like;

the Admin module is used for managing and setting project names;

The project creation information described in this embodiment includes a project name, nodes in the project workflow, and a connection relationship between the nodes. A user creates a specific project operation flow through Admin and Designer, and after creation, a simulation monitoring project operation flow can be performed through a Monitor.

The control layer of this embodiment is a pyramid system, and is formed by a plurality of different types of nodes, where the control layer includes an EM node (i.e., a core node), a Server node (i.e., a control node), and an Agent node (i.e., a proxy node), and the Agent node includes an Agent node (i.e., a master proxy node) and a sangent node (i.e., a slave proxy node). These several different types of nodes have different roles and functions.

In this embodiment, the nodes complete communication through Socket. In actual operation, each transaction generates a connection action, which not only represents the communication between the client and the core node, but also represents the communication between all nodes. The core nodes are peer-to-peer, each node can initiate a service request to other nodes, and each node is a client and a server. The control layer of this embodiment is a multilayer logic system, and is shown by deploying different nodes in different logic layers, and this multilayer structure is not fixed and unchangeable, and the user can deploy the control layer flexibly according to the scale and the demand of the project, and the whole system can be simple or complex.

Each node of the embodiment is composed of a plurality of component processes with different functions, and the component processes complete communication in a message queue mode. The component processes include an FDC (flow Dispatch core) process, a DRR (Dispatch request router) process, a DAR (Dispatch Answer router) process, a STR (Send Message ToRemoto) process, a KIM (Kernel Integrated manager) process, an NLS (Net Listten) process, an SPS (search Plugin State) process, a CPG (Call Plugin) process, a UCD (user Command deal) process, an EMR (Kernel Event manager And Release) process, a JJob Mutex manager process, a DSY (DataSynchronous) process, And a FIM (flow InstanceMM) process.

In the embodiment, different component processes have different functions, in practical application, a user can select a required component process according to the requirement of a project, and in order to effectively realize synchronous communication and asynchronous communication among the component processes, a request queue and a response queue are logically distributed for each component process on the basis of a physical message queue.

A request queue: and receiving the queue of the request message of other component processes, wherein the current process is a service end for providing service.

The response queue: and receiving the message queue of the response information of other service processes, wherein the current process is the client requesting the service.

Because each process has a request queue and a response queue, each process can provide service and can also request service, and when the service is provided, the component is a service end; when a service is requested, the component is a client. This feature is similar to the inter-core node communication mechanism of the product, and the inter-core node communication is peer-to-peer, and similarly, the intra-node component communication is also peer-to-peer.

In this embodiment, when the ETL server is scheduled, a load balancing mechanism is adopted, and load balancing deployment is to effectively utilize physical resources and improve ETL processing efficiency. It is mainly realized by the agent cascade mode. Load balancing is implemented in relation to a cluster, i.e., within an execution domain formed by a cascade of executing agents. The task deployment on each ETL server is required to be the same for within a cluster. And the control layer automatically distributes the tasks to the relatively idle ETL hosts and executes the task programs according to the resource use condition of the ETL servers in the cluster.

In conclusion, the system can not only schedule data in batches, but also perform manual setting intervention, and has load balance during scheduling and perfect scheduling control strategy.

Example two:

the embodiment provides an automatic implementation method for batch scheduling data, which is suitable for the system for batch scheduling data described in the first embodiment, and includes the following steps:

s1, building a three-layer architecture of the system by adopting a typical C/S mode, wherein the three-layer architecture comprises an application layer, a control layer and a target layer;

s2, acquiring project creation information of user secondary development through an application layer, and deploying the multilevel scheduling nodes of a control layer according to the project creation information;

and S3, in the running process of the project, the control layer performs load-balanced batch task scheduling on the target layer through the multi-stage scheduling nodes, and the target layer executes the corresponding task program according to the batch task scheduling of the control layer.

A three-layer architecture of the system constructed in this embodiment is shown in fig. 2, where the application layer is a client, the control layer is a server, and the target layer is various task programs deployed on the ETL server. In this example.

The application layer is mainly divided into admin, designer and monitor from the function point of view. The control layer is a multi-level pyramid structure, and the top layer is a service control node to complete various scheduling service controls and provide various operation application services for the client. And the agent layer completes control interaction with the target layer server. In addition, the agent layer can realize scheduling control of the servers deployed in the cluster, realize load balancing and the like in a master-slave agent cascading mode. The target layer is the object controlled by the whole product, such as our ET server, job workstation, etc.

the Admin module is used for managing and setting project names;

Each node of the embodiment is composed of a plurality of component processes with different functions, and the component processes complete communication in a message queue mode. The component processes include FDC (flow Dispatch core) process, DRR (Dispatch request router) process, DAR (Dispatch Answer router) process, STR (Send Message ToRemoto) process, KIM (Kernel Integrated manager) process, NLS (Net Listten) process, SPS (SearchPlugin State) process, CPG (Call Plugin) process, UCD (user Command deal) process, EMR (Kernel event manager And Release) process, JJJob Mutex manager process, DSY (DataSynchronous) process, And FIM (flow InstanceMM) process.

In conclusion, the method can not only schedule data in batches, but also perform manual setting intervention, and has load balance during scheduling and perfect scheduling control strategy.

Those of ordinary skill in the art will appreciate that the systems and method steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed method and system may be implemented in other ways. For example, the division of the above steps is only one logic function division, and there may be another division manner in actual implementation, for example, multiple steps may be combined into one step, and one step or multiple steps may also be split into multiple steps. And part or all of the steps can be selected according to actual needs to achieve the aim of the scheme of the embodiment of the invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A system for batch scheduling data, comprising: a framework building unit for building a three-layer framework of the system; the project creating unit is used for acquiring project creating information of secondary development of a user and deploying the multilevel scheduling nodes based on the project creating information; the operation scheduling unit is used for scheduling the batch tasks with balanced load through the multistage scheduling nodes; the three-layer architecture comprises an application layer, a control layer and a target layer; the control layer adopts a multi-level pyramid structure and is composed of various nodes of different types, the control layer comprises an EM node, a Server node and an Agent node, and the Agent node comprises an MAGent node and a SAgent node; the EM node is used for communicating with the application layer, controlling the access authority of the application layer and managing and controlling the effective operation of all nodes; the Server node is used for respectively communicating with the EM node and the Agent node and finishing scheduling control of the Agent node; the Agent node is used for communicating with a target layer in a master-slave Agent cascade mode, carrying out load balancing deployment according to the resource use state of the ETL server of the target layer and distributing tasks to the relatively idle ETL server to execute a task program; the MAGent node is a master agent node, and the SAgent node is a slave agent node.

2. The system capable of scheduling data in batches according to claim 1, wherein a three-layer architecture of the system is built by adopting a typical C/S mode.

3. The system capable of scheduling data in batches according to claim 2, wherein project creation information of secondary development of a user is acquired through the application layer, and the multi-level scheduling node of the control layer is deployed according to the project creation information.

4. The system of claim 2, wherein during the operation of the project, the control layer performs load-balanced batch task scheduling on the target layer through a multi-stage scheduling node, and the target layer executes a corresponding task program according to the batch task scheduling of the control layer.

5. The system of claim 2, wherein the application layer is a client, the control layer is a server, and the target layer is a task program deployed on the ETL server.

6. An automatic implementation method of batch schedulable data, which is applicable to the system of batch schedulable data of any one of claims 1-5, characterized by comprising the following steps: building a three-layer architecture of the system by adopting a typical C/S mode, wherein the three-layer architecture comprises an application layer, a control layer and a target layer; acquiring project creation information of secondary development of a user through the application layer, and deploying the multilevel scheduling nodes of the control layer according to the project creation information; in the running process of a project, the control layer performs load-balanced batch task scheduling on the target layer through a multi-stage scheduling node, and the target layer executes a corresponding task program according to the batch task scheduling of the control layer; the control layer adopts a multi-level pyramid structure and is composed of various nodes of different types, the control layer comprises an EM node, a Server node and an Agent node, and the Agent node comprises an MAGent node and a SAgent node; the EM node is used for communicating with the application layer, controlling the access authority of the application layer and managing and controlling the effective operation of all nodes; the Server node is used for respectively communicating with the EM node and the Agent node and finishing scheduling control of the Agent node; the Agent node is used for communicating with a target layer in a master-slave Agent cascade mode, carrying out load balancing deployment according to the resource use state of the ETL server of the target layer and distributing tasks to the relatively idle ETL server to execute a task program; the MAGent node is a master agent node, and the SAgent node is a slave agent node.

7. The method for realizing automation of data batch scheduling according to claim 6, wherein each node is composed of a plurality of component processes with different functions, the nodes complete communication through Socket, and the component processes complete communication through a message queue.