CN118012954A

CN118012954A - Big data platform dispatching system

Info

Publication number: CN118012954A
Application number: CN202410171230.8A
Authority: CN
Inventors: 张朝阳
Original assignee: Tenth Research Institute Of Telecommunications Technology Co ltd
Current assignee: Tenth Research Institute Of Telecommunications Technology Co ltd
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-05-10

Abstract

The invention discloses a big data platform dispatching system, which comprises an Oracle data storage module, an Oracle-to-Hive synchronization module, a queue selection module, a dispatching main module, a model template storage processing module, a Hive data storage module and a Hive-to-Oracle synchronization module; the invention realizes the dynamic model processing flow by applying the 'Linux shell + database' programming, and realizes the high-efficiency transmission by using the database tool + hive characteristic.

Description

Big data platform dispatching system

Technical Field

The invention belongs to the technical field of big data, and particularly relates to a big data platform scheduling system.

Background

Workflow scheduling tool: currently popular tools are Oozie (cdh platform used), azkaban (linkedln corporation), airflow (python development), dolphin scheduler (domestic). Of which Oozie and Azkaban are most commonly used. Oozie is a heavy-duty task scheduling system compared with Azkaban, and has comprehensive functions, but more complex configuration and slightly bulkier configuration. Azkaban is simple, has only core functions, is simple to configure and is completely sufficient.

Azkaban is a batch workflow task scheduler, introduced by Linkedin, mainly for running a set of work units in a specific order within a workflow. The process is composed of a plurality of working units, and the working units have front-back dependency relationship. The general workflow mainly refers to the workflow of big data hadoop and hive. For example, extracting information from user traditional databases, logs, placing in a hive data warehouse, then cleaning, summarizing, computing, etc., and finally sending out for use by the BI is a typical workflow.

Each of these workflows, if submitted manually, takes a long time, requires constant observation, is cumbersome, and moreover, performs these tasks every day, which is very cumbersome. There are two ways to solve this problem: one way is to write shell scripts, etc., call with crontab timing, and the other is with Azkaban, etc. workflow scheduling tools. Regardless of the scheduling tool, two steps are taken: firstly, describing a workflow, and secondly, setting timing execution. Each tool differs only in the manner in which it is configured.

Data synchronization tool: there are two more in use today: one is sqoop and the other is datax.

Sqoop: the application scene is single, is suitable for data transmission of relational databases (Oracle, mysql and the like) and hdfs (hive, hbase and the like), and is simple to operate.

Datax: the method is a Alibaba open source data synchronization tool, can be used for data synchronization among a relational database, nosql and big data in order to solve the problem of heterogeneous data source synchronization, and is more suitable for data synchronization of configuration files in the form of etl, python scheduling json with a plurality of units of heterogeneous data sources.

Yarn resource manager: in addition to the above two parts, there are queue resources to be considered for use in execution of the work units. Resources refer to the memory, hard disk, CPU, etc. of the cluster. One of the three big components of big data, being yarn (the other two big components are computed mapreduce, stored hdfs) of the resource manager, is most commonly used queue scheduling capacity scheduler, which generally means that a queue can be defined, each queue can be given different weights and maximum weights, the former allocates resources to each queue by weight, the latter limits the maximum resources that can be preempted.

The requirements of the workflow in the project are as follows:

1: completing data transmission from Oracle to hive; the working unit a is denoted.

2:2.1, Selecting p yarn queues every day according to the busy degree of the queue resources; respectively assigned to p models to run in parallel.

Each model run result is transmitted to Oracle. The working units of the p models are denoted b1, b2, …, bp, respectively.

2.2 After m (m < =p) of p models run out, selecting m (if there are less than m left at last, there are more than m left) of queues, and distributing the m to-be-run models. c1, c2, …, cm.

Cycle 2.2 until all models requiring processing are taken daily, weekly, etc.

The upper unit of work, english, is called job.

Azkaban disadvantages:

azkaban the workflow is defined by the following yaml file.

If there are 10 tasks to be processed, plus the initial "Oracle to Hive synchronization", there are 11 tasks in total, the parallelism p=3, and the requirements above can be defined by the following configuration files:

job0:

type:hive

command: "Oracle to Hive synchronization"

execution.properties:

Execution.mode serial execution mode of serial #)

Execution.parallel.max 1# runs only one task at a time

job1:

type:hive

Command: "hive-f script1.Sql series of parameters"

execution.properties:

Parallel execution mode of execution mode

Execution.parallel.max. Number of tasks running simultaneously no more than 3#

job2:

type:hive

Command: "hive-f script2.Sql series of parameters"

execution.properties:

There are three problems with the above: firstly, the task of the application platform is subscribed or stopped at any time and is continuously and dynamically changed; seventy-eight and hundreds of models need to be processed every day; and thirdly, the model processing can be transmitted into about 10 parameters. Therefore, static configuration files have a large limitation, and manual configuration is difficult.

There are two solutions: firstly, dynamically generating a configuration file every day, secondly, using dynamic workflow scheduling, generally writing task scripts or programs, creating Azkaban workflows, adding tasks as independent working nodes to the workflows, submitting the workflows to Azkaban for scheduling and execution, and performing parallel execution. The first is inflexible and not real-time, and some are "skiving proper track". The second requires the use of Azkaban with a very skilled grip, with a steeper learning curve.

And the parallelism cannot be dynamically adjusted.

In addition, azkaban does not have its own queue management system, but rather uses some of the queue concepts to manage the order of execution of tasks, and does not have specialized resource management and queue management mechanisms as the resource manager yarn described below. Typically in an application, only one yarn queue is specified for a set of tasks.

In summary, azkaban has three disadvantages: firstly, the system is large and stiff, only needs to complete workflow requirements of projects, and the rest functions are too many; secondly, static configuration files and static parallelism parameters cannot adapt to task changes at any time, and flexibility is lacking; thirdly, even if dynamic workflow scheduling is used, higher-level programming is required, but professional resource management and queue management functions are not available, and from the technical economy perspective, input and output are not equal.

Disadvantages of the data synchronization tool:

sqoop is a useful tool, but in practice, efficiency is greatly compromised with mapreduce, and in addition, logs do not locate problems well.

Datax can also be used, but the application scene is not suitable, and is suitable for etl with more heterogeneous database units, the volume is larger, and meanwhile, the JSON file is required to be customized, if the JSON file is not formatted and is unfamiliar, the JSON file is inconvenient.

These two tools have a common disadvantage: the configuration file of the column mapping can be automatically generated, so that the difficulty is high.

The reason for these disadvantages is: firstly, the efficiency is poor (sqoop) in consideration of the universality, and secondly, the efficiency is not consistent with the scene of the project (datax).

Mode of use problem for yarn queues:

The use of the yarn queue is through the entire workflow. Queues are selected during data synchronization and model processing. It is common practice to assign a queue to data synchronization or modeling. This can cause non-uniformity in busy and idle, and increase the unstable factor of platform resource usage.

The way the queue is used is problematic is: without breaking the inherent way of thinking.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a large data platform dispatching system which comprises an Oracle data storage module, an Oracle-to-Hive synchronization module, a queue selection module, a dispatching main module, a model template storage processing module, a Hive data storage module and a Hive-to-Oracle synchronization module; the invention realizes the dynamic model processing flow by applying the 'Linux shell + database' programming, and realizes the high-efficiency transmission by using the database tool + hive characteristic.

The technical scheme adopted for solving the technical problems is as follows:

A big data platform dispatching system comprises an Oracle data storage module, an Oracle-to-Hive synchronization module, a queue selection module, a dispatching main module, a model template storage processing module, a Hive data storage module and a Hive-to-Oracle synchronization module;

The Oracle data storage module includes 5 types of tables:

(1) Model template configuration table a: the method comprises the steps of model id, model Chinese name and running period configuration information;

(2) User model information table b: the method comprises the steps of including user id, model id and version number information; the user model information table b is generated on the basis of the model template configuration table a, specifically, a version is generated by subscribing a certain model by a certain user according to a time stamp at a certain moment, and finally the version is inserted into the table;

(3) A plurality of user model parameter tables c _i: including group information, personnel information parameters;

(4) User model task log table d:

the user model task log table d is divided into two parts: one part is to add processing date on the basis of the user model information table b and is a task part; the other part is processing starting time, finishing time and state, and is a log part; automatically generating a user model information table b to a user model task log table d every day;

(5) A plurality of user model processing result tables e _i;

stored in the model template configuration table a is the model id of the model template storage processing module, namely the English name; the content exists in the form of sql in the model template storage processing module and also comprises a model Chinese name and running period configuration information;

The Oracle to Hive synchronization module: copying a plurality of user model parameter tables c _i from Oracle to Hive, and finally providing for processing by a model template storage processing module;

the queue selecting module: selecting a plurality of most idle queue resources, and respectively providing the most idle queue resources for a plurality of model template storage processing modules running in parallel to process;

The model template storage processing module is used for: for storing all model templates sql; according to the model id, user id, version number and date parameters transmitted by the task table, the data resources of the Hive data warehouse are used in combination with the parameter table, the corresponding model template sql is operated, a certain user model processing result table e _i is called a Hive-Oracle synchronous module, and data are copied to Oracle; simultaneously, when the processing is started, operated and ended, recording the processing starting time, the processing ending time and the processing state into a task list;

the Hive data storage module: the Hive data warehouse stores a plurality of user model parameter tables c _i which are synchronized in the Oracle database, and also stores a user model processing result table e _i processed and generated by the model template storage processing module;

the Hive to Oracle synchronization module: reading a user model processing result table e _i, and copying e _i to Oracle;

The dispatch master module: is the core of the whole platform dispatching system; firstly, calling an Oracle to a Hive module to perform data synchronization, secondly, reading tasks of a user model task log table d of an Oracle data storage module, thirdly, taking an idle queue from a queue selection module, and fourthly, parallelly taking a plurality of model operation model template storage processing modules;

the processing procedure of the big data platform dispatching system is as follows:

step 1: the scheduled task of Linux every day can call up the scheduling master module at a preset time;

step 2: the dispatching master module calls an Oracle-to-Hive synchronization module, and the Oracle-to-Hive synchronization module reads a plurality of user model parameter tables c _i from an Oracle data storage module and copies all c _i to Hive data storage modules;

Step 3: the dispatching main module reads a user model task log table d of the Oracle data storage module, takes a plurality of tasks consisting of user id, model id, version number and date according to parallelism, and then selects a plurality of most idle queue resources by the queue selection module, and provides the most idle queue resources for the plurality of tasks respectively, and calls the model template storage processing module in parallel to process the tasks; after the model template storage processing module finishes processing, calling a Hive to Oracle synchronous module, copying the data of a certain user model processing result table e _i to Oracle, and updating the log part of a user model task log table d;

Step 4: the scheduling main module judges whether tasks are run out or not at regular intervals, if so, the user model task log table d of the Oracle data storage module is read again, new tasks are supplemented timely, the queue selection module selects the most idle queue resources and provides the most idle queue resources for the tasks respectively, the model template storage processing module is called in parallel for processing, and the model template storage processing module performs corresponding processing; and (5) circulating until the task processing of the user model task log table d is finished.

Preferably, the status includes three of processing, success, failure.

The beneficial effects of the invention are as follows:

1) The dynamic of the model processing flow;

The invention stores the task (user model) in the database table, comprehensively considers the use condition of the large data platform resource according to the user model information updated in real time in the table, processes the user model in parallel, finishes the model operation, and then selects the next model.

2) Selecting a plurality of most idle queues;

The invention selects a plurality of most idle queues to be used in a seam inserting way, does not designate any one queue and does not cause pressure on any one queue.

3) The data transmission is efficient;

It is conventionally thought that the sqoop for data transfer from hive to database is better, and datax can also be used, but the combined use of the features of tool sqlplus, sqlldr and hive through the practice of Oracle is more efficient.

4) The log of the positioning problem is facilitated;

For normal operation of the system and subsequent iterations, a log is indispensable. The system has a general model running log, a storage process execution log, a master control log generated by a shell script, a model running log, a model processing log, a log generated by data transmission and the like. Logs are more abundant and classified than azkaban tools. The localization of unpredictable faults is more rapid.

5) The Linux shell + database can dynamically select a yarn queue, execute user model scripts in parallel and complete synchronization between the database and the Hive by combining a database tool and Hive characteristics according to requirements more flexibly and custom, can control and get through the whole, is easy to locate faults in logs, and is easier to maintain, smooth and elegant in system.

Drawings

FIG. 1 is a flow diagram of the application platform and big data platform scheduling system architecture of the present invention.

FIG. 2 is a schematic diagram of Oracle to Hive synchronization of the present invention.

FIG. 3 is a schematic diagram of Hive to Oracle synchronization of the present invention.

Fig. 4 is a flow chart of the dispatch master module of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

The invention mainly aims to solve the technical problem that a user uses a foreground subscription display system of an application platform to subscribe various model versions through Web pages and browse operation results in time. It is considered that the user model version is updated and iterated continuously along with time, and the workflow composed of daily tasks is dynamically changed, so that the popular workflow and other tools are difficult to design to adapt to the situation. The technology belongs to a big data platform dispatching system, and adopts the technology of 'Linux shell script + database' programming.

Examples:

As shown in fig. 1.

1) Selecting a plurality of most idle queues according to the duty cycle of the yarn resources, and distributing the most idle queues to a plurality of model processing and using (a queue selecting module);

Because of the unpredictability of large data platform processing tasks, the queue resources of the resource manager yarn are dynamically changed, and a queue is previously specified for model processing in the prior art, which results in a large amount of waiting time for some urgent tasks.

TABLE 1

As shown in table 1, a total of 7 queues were configured, respectively: 15%, 10%, 5%, 30%, 8%, 2% add up to 75%, and 75% of the resources of the entire platform can be used. Multiplying by "user_limit_factor" yields "priority usage capacity" of 30%, 20%, 15%, 75%, 8%, 2%, respectively, that is, maximum usage of so many resources per queue.

Currently 1.3%, 0.5%, 3.2%, 95.5%, 123.9%, 100.1%, 100.4% are used, respectively.

In the case of current usage < = 100, the main remaining capacity = capacity-current usage; in the case of the current usage >100, the main remaining capacity=0.

In the case of current usage >100, secondary remaining capacity= (user restriction factor-current usage/100) ×capacity; in the case of the current usage ratio < =100, the secondary residual capacity=0.

The queues of 'secondary residual capacity' and '0' are removed, and then the queues are selected according to the priority of 'primary residual capacity' from large to small plus 'secondary residual capacity' from large to small. Here, only "queue 1" to "queue 5" will be selected.

The invention selects according to two dimensions, just like a business window of a bank, uses the business window by stitching and inserting pins, does not assign any queue, does not cause pressure on any queue, and uses the leftover materials of resources.

2) The dynamic of the model process flow (dispatch master module);

The scheduling workflow of this system is: m models are processed by p models, and m queues are selected by the method of 1) again for processing m untreated models. The analog swimming pool has a main lane which is common treatment to all models (such as data synchronization from Oracle to hive before all models are processed), and other analog swimming pools have p lanes, each player is completed, and the electronic coach can allow the next player to use next.

The method adopted by the dispatch master module is different from Azkaban, and the flow is as follows: firstly, an Oracle-to-Hive synchronization module is called daily to synchronize parameter tables, secondly, a plurality of tasks of a user model task log table d are read through shell scripts, meanwhile, a queue selection module is called, a plurality of idle queues are taken, a model template storage processing module is called in parallel, the parameters are transmitted, and finally, a processing result e _i is synchronized back to the Oracle.

Obviously, the scheduling master module replaces the Azkaban workflow configuration file with a database table, and the update of the d table can influence the change of the processing flow in real time, so the model processing is dynamic.

Table 2 is a schematic table of the operation. This figure shows, from bottom to top, one main lane and 7 sub-lanes: lanes 0,1, … …, and 6, where lanes 1 through 5 are omitted for space limitations.

TABLE 2

From bottom to top, the serial numbers 1 to 3 are the data synchronization of the starting operation of the main control program and oracle to hive; the numbers 4 to 15 are "lane 0" to "lane 6", 7 parallel scheduling runs are made for models M01, M11, M21, M31, M41, M51, M61 respectively. The operation of each model includes: model start timing, model process running, HIVE to ORACLE data synchronization, model end timing, each step has an identification of start time, end time, duration, success or failure.

In the "main lanes" of numbers 21, 31, etc., the electronic referee can determine which lanes are free and can be assigned to new model runs.

3) Data transmission between the plurality of bins hive and the database Oracle;

Oracle (other databases are similar) to hive data transfer employs sqlplus +hive load data.

The data transmission from hive to Oracle adopts hive-f+ sqlldr, so that the efficiency is greatly improved.

These two parts are illustrated with respect to flow charts 2 and 3, respectively.

1) -3) Above ensures that the overall task scheduling is efficient.

4) The log is complete, so that the positioning problem is facilitated;

Firstly, checking an operation log of a database, wherein the operation log is divided into a plurality of states of success, model processing failure, recording result smaller than a threshold value, depending on data source deletion, running and pre-running.

Log is divided into three categories:

master control log: oracle to hive data synchronization, queue allocation, parallel processing, and the like.

Model log: hive machining of a model and data transmission from hive to Oracle.

Model processing log: hive process log.

Whether the model processing fails or the recorded result is smaller than a threshold value or the like is successfully put through, the model operation log and the model processing log can be used for positioning.

Data transmission can also be conveniently located through the log.

5) The scheduling system is ensured to be convenient to use.

Therefore, the invention is a dynamic, efficient and convenient platform scheduling system.

The scheduler master flow chart is shown in fig. 4.

In fig. 4, "$actual_run_nums=the number of models actually fetched" means: typically $actual_run_num=p or $actual_run_num=m, but up to the end there may be less than m, so how many can actually be taken out.

The key point and the point to be protected of the invention are as follows:

1) The dynamic of the model processing flow;

The invention realizes the dynamic model processing flow by applying the 'Linux shell + database' programming.

Azkaban are currently not dynamic or difficult to implement through configuration files.

Although a unit of work fails at one step of the azkaban workflow in detail, this step can be performed again, the present invention is implemented by another "rely on failure module".

2) Selecting a plurality of most idle queues;

the present invention uses a combination of the primary and secondary usage dimensions of the yarn queue to select the appropriate queue.

The prior art specifies a fixed queue for a model of a certain type.

3) The data synchronization is efficient;

the invention uses the database tool+hive characteristics to realize efficient transmission.

The prior art uses sqoop or datax to complete the data transfer.

The former is convenient and efficient to use, and the latter is large in volume and complex in configuration.

4) The log of the positioning problem is facilitated;

The logs are classified and stored in a classified mode, and compared with azkaban, the problems are easier to locate.

Claims

1. The big data platform dispatching system is characterized by comprising an Oracle data storage module, an Oracle-to-Hive synchronization module, a queue selection module, a dispatching main module, a model template storage processing module, a Hive data storage module and a Hive-to-Oracle synchronization module;

The Oracle data storage module includes 5 types of tables:

(4) User model task log table d:

(5) A plurality of user model processing result tables e _i;

2. The big data platform scheduling system of claim 1, wherein the status includes three of processing, success, and failure.