Disclosure of Invention
In order to solve the defects in the prior art, the invention aims to provide a modular ETL task processing system and an ETL task processing method for a data governance platform.
The modularized ETL task processing system provided by the invention is a task processing system integrating data extraction, conversion, verification, loading, monitoring and authority control, various components are flexibly combined through visual component arrangement to complete complex data processing logic, the system can be used by users with different technical levels, a primary user can use the system only by configuring the system according to default parameters provided by the system, and a senior user can complete complex business logic processing through programming components such as hundreds of built-in functions, JAVASCRIPT, SQL and the like provided by the system. The system operation process is visualized, the operation process of each component is monitored in real time in detail, the operation state, the overall progress and the input/output record number of each component can be seen through a unified monitoring interface, and meanwhile, the historical task execution condition can be traced, replayed and checked for errors.
The invention provides a modular ETL (extract transform load) task processing system for a data management platform, which comprises a data source management module, a configuration management module, a task monitoring module and a data analysis module;
the data source management module provides support for data source configuration required by task operation, and comprises: the method comprises the following steps of configuring a database and FTP (file transfer protocol), wherein the configuration of the database comprises the configuration of the type of the database, the configuration of a connection mode, the configuration of a login account and the configuration of a connection pool, and the configuration of the FTP comprises the configuration of a protocol, the configuration of the login account and the configuration of data transmission codes;
the configuration management module provides support for global configuration items required by task running, and the support comprises the following steps: dictionary resource configuration, mail server configuration and verification rule configuration, wherein the mail server configuration comprises the following steps: SMTP (simple mail transfer protocol) server configuration, SMTP port configuration, SMTP security protocol configuration and login account configuration, wherein the verification rule configuration comprises the following steps: data type configuration, field length configuration, dictionary value configuration, data range configuration and regular expression configuration;
the task management module is a system main module and is used for performing task management configuration, including data migration management, online drawing management, task copy management and locking/unlocking management;
the task monitoring module is a monitoring interface in task running and is responsible for monitoring task execution progress and providing an error early warning interface, and the task monitoring comprises task execution log playback, each component operation log playback, data sampling, visual task progress monitoring and error data extraction;
the data analysis module provides incidence relation analysis for the task related fields and provides a visual graphic analysis interface for the data after the task is executed, and the data analysis comprises the following steps: analyzing the association relation of fields and tasks, analyzing the association relation of fields and data sources, and executing log analysis.
The invention also provides an ETL task processing method realized by utilizing the ETL task processing system, which comprises the following steps:
step one, selecting ETL task components required by task execution according to task requirements and initializing;
selecting data required by the pull task from a data source;
step three, calculating the number of required threads, distributing the initialized ETL task components to each thread and starting each thread, distributing the data in the step two to each thread according to the requirement and processing by utilizing the ETL components in the threads;
and step four, storing the processed data in a database for use in subsequent links, or updating the task state and repeating the operation.
The task flow is designed by a graphical user-defined flow, a visual attribute editing mode and batch attribute import/export; the process design area supports operations such as copying, cutting and pasting, returning, advancing and the like; data stream transfer, path branching and the like among the components are controlled in a graphical dragging mode, and data stream distribution, copying operation and the like are supported among different components; and complex business logic can be completed among the components through free combination.
In the first step, the ETL task component comprises five categories of an input class component, a conversion class component, an output class component, a flow class component and an application class component;
the input type component is responsible for reading a data source and generating input streams, and comprises a table input component, an Excel input component, a TXT input component, a fixed-length text input component, an XML input component, a constant input component, a read file list component and a result set input component;
the conversion component is responsible for various conversions of data streams, and completes the processing of service logic through the permutation and combination of different components, including a table output component, a table insertion/update component, an Excel output component, a TXT output component, a fixed-length text output component, an XML output component, a result set output component and a variable setting component;
the output assembly is responsible for storing the input stream to a data source and comprises a connecting assembly, an aggregation assembly, a sorting assembly, a cleaning assembly, a duplicate removal assembly, a row-to-column assembly, a column-to-row assembly, a formula assembly, a difference comparison assembly and an input stream merging assembly;
the flow class component is responsible for condition judgment, state conversion and path selection operation; the system comprises a checking component, a Boolean filtering component, an enumeration filtering component, a condition suspension component, a step waiting component and a data delay component;
the application component is responsible for processing various types of single service data and does not relate to data stream input, and comprises an SQL execution component, a storage process component, a SHELL component, an SSH component, an HTTP component, an Email sending component, an FTP downloading component, an FTP uploading component, a file decompressing component, a file compressing component, a file changing and encoding component, a time delay component, a file checking component, a file creating component, a file deleting component, a file transferring component and a JavaScript component.
In the second step, the data source comprises a database, JSON, TXT, RESTAPI, XML, CSV, Excel and a fixed-length text; the data source in the invention can be widely used in various industries and has no specific source requirement limitation.
In the third step, the thread can flexibly select the number of threads, the execution sequence of the threads and the execution strategy according to the service requirement;
the ETL task system calculates different thread branches according to the arrangement condition of each component in the task, then stores the component to which each thread branch belongs into different thread parameters, namely a HashMap structure, and respectively creates different threads according to the thread parameters after the system is started to initialize the components and run the corresponding task.
The processing method also comprises a task monitoring link, wherein the operation process is monitored through the flow of the whole processing method; the task monitoring link comprises task execution log playback, component operation log playback, visual task progress monitoring, data sampling and error data extraction; during the task running period, the current task execution progress, the running condition of each thread and the execution condition of the input and output data stream of each component can be checked through a visual interface; and during the operation process, each component can perform data sampling in real time and store the data for verifying the data accuracy of each component.
The task execution log playback means that a detailed log of the task during the whole task running period can be viewed in detail after the task execution is finished, and the log content comprises time, state and prompt information, as shown in fig. 9.
The playback of the running log of each component refers to clicking each component after the execution of the task is finished to check a detailed log of the component in the whole running period of the task, wherein the log content comprises time, state and prompt information, and is shown in fig. 7.
The visual task progress monitoring refers to checking the running states, the progress and the error information of all tasks through a task monitoring interface, as shown in fig. 11, and mastering the running state of the whole task system through the monitoring interface.
The data sampling refers to the data sampling of each component, and whether the parameters set during the operation of the component are effective or not is checked and debugged, as shown in fig. 7.
During the task running process, if some data do not accord with the component parameter rules, the error data are extracted into a log table for the monitoring interface to view in a unified way, as shown in fig. 12.
The API, the operation buttons and the list data related in the running process of the task system are controlled by the authority subsystem, and only a user with authority can operate the corresponding buttons, the API and the data. The system can automatically configure system modules and operation buttons according to the authority owned by the user, and the modules and operation buttons without the authority can be automatically removed.
The beneficial effects of the invention include: the ETL task processing system provided by the invention adopts a graphical, componentized and flow low-code development mode to reduce the use threshold of a user, and the operations of conversion, aggregation, filtration, cleaning and the like of business data are more transparent, flexible and convenient through a self-defined flow. Meanwhile, the system provides what you see is what you get interactive experience, so that the user is liberated from complicated system configuration, and can be more focused on the design of business top level design and data logic processing.
Detailed Description
The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited.
The invention provides a modular ETL task processing system for a data governance platform and an ETL processing method realized by using the system.
The processing method comprises the steps of selecting ETL task components required by task execution according to task requirements and initializing the ETL task components; selecting data required by a pull task from a data source; calculating the number of required threads, distributing the initialized ETL task components to each thread and starting each thread, distributing the data to each thread according to the requirement, and processing by utilizing the ETL components in the threads; and storing the processed data in a database for subsequent links to use, or updating the task state, and repeating the operation to complete the flow of the whole processing method.
In the present invention, the start of multithreading when the task starts may be, for example, as shown in fig. 4, start 2 threads to respectively complete the tasks of TXT input and table input, compare the input data contents, after each thread starts, wait for each thread to complete the task or continue to enter the next component, and the process will cycle until the task ends.
In the specific full-flow operation process, as illustrated in fig. 5, the system starts two threads: thread 1 is responsible for loading the TXT text, thread 2 is responsible for loading data from the data sheet, then two threads finish the task and hand over the data to the main thread respectively and destroy automatically after keeping, at this moment the main thread can start thread 3 here and finish the difference contrast, then continue to run the routing module, the routing module will start 3 threads after judging according to the attribute that the module has set up and combining the current data afterwards, these 3 threads finish newly adding, deleting, relevant business logic modified respectively, upgrade the whole task state after 3 threads finish all executing, destroy 3 threads finally, until the task runs and finishes.
The ETL task processing system stores task parameters, component parameters, a component arrangement diagram and a thread sequence diagram into a database in a JSON format at a task starting stage, generates an independent UUID for the database, and updates the component parameters, sample data and a task execution state according to the UUID in the system operation process, so that a task execution log playback function can be realized, for example, as shown in FIG. 7; in fig. 7, four components, namely, a formula component, an enumeration filter component, a null operation component, and a table output component, are used, where a line of data is generated using the [ formula ] component, including a, integer: 1, b, date and time type: 2021-12-0300:00:00, c, string: the three fields of '1' are filtered out by a [ enumeration filtering ] component, the value of a being 1 is sent to a [ idle operation ] component, and finally the value is stored in a database by a [ table output ] component, in the process, each component can check the flowing sampling data (the first 10 pieces) and the number of input/output records, and if the component makes an error in the processing process, the component can be printed in a delivered log in real time for an operator to check.
According to the design in the task execution playback, the ETL task system restores JSON data such as task parameters, component arrangement diagrams, thread sequence diagrams and the like into a system design interface after acquiring the JSON data according to UUIDs, sets different state colors according to the actual operation result of each component, and points each component to display the detailed operation data of the component, wherein the method comprises the following steps: inputting the number of records, outputting the number of records, status log and other data. The system divides the task running state into: the 4 states of waiting to be executed, in progress, success and failure are displayed, and real-time statistics are shown in a form of a visual chart, as shown in fig. 8.
The log execution part of each component can print various logs in the operation stage to assist a user to complete task design, the logs can be formatted and then stored in a database so as to be used in visual task progress monitoring and task execution log playback, a system provides log options of four levels of ERROR, WARN, INFO and DEBUG which respectively correspond to logs with different detailed degrees, and the user can select different log levels according to service types, such as: the task with the application class component as the main component may select a DEBUG type, the task with the input and output class components as the main component may select an ERROR type, and the system may print log information of different levels according to the set levels in the running stage, for example, as shown in fig. 9.
The present invention is described below with a specific service example, as shown in fig. 10.
Step 1: and importing the txt file into the memory from the disk.
Step 2: and importing the customer in the database into the memory.
And 3, step 3: comparing the data generated in the step 1 and the step 2, and dividing the comparison result into: adding, modifying and deleting 3 types.
And 4, step 4: and storing the comparison result into a database diff _ result table for subsequent processing.
Detailed process description: after a task is started, 2 threads are started firstly, the two threads are responsible for importing txt files into a memory from a disk and importing a customer table in a database into the memory, after the txt files and the customer table are imported, data in the memory are contrasted according to rules set in a difference contrast assembly, and the contrast result is divided into: adding, modifying and deleting 3 categories, wherein the classification field is represented by Result, after the classification is finished, distributing different classification results to different path branches according to a rule set by a component (actually, an enumeration filtering component) according to a state rule path, the adding, modifying and deleting components represent different path branches, the 3 branches are used for storing the classified data in a database diff _ Result table for subsequent processing by a table output component (storing newly added data, storing modified data and storing deleted data, actually being alias of the table output component), after the table output components on the 3 paths are finished, the main thread is used for automatically destroying the updated task state successfully and returning the state to a front end interface, so that the whole task execution is finished, and when an error is encountered during the task execution, the main thread sends a stop signal to each task, and after the task thread is finished, the main thread updates the task state to failure and returns the state to the front-end interface.
The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, which is set forth in the following claims.