CN112966039A

CN112966039A - Front-end and back-end separation execution method based on ETL engine

Info

Publication number: CN112966039A
Application number: CN202110293087.6A
Authority: CN
Inventors: 程永新; 宋辉; 郭振宇
Original assignee: Shanghai New Century Network Co ltd
Current assignee: Shanghai New Century Network Co ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-06-15
Anticipated expiration: 2041-03-18
Also published as: CN112966039B

Abstract

The invention discloses a front-end and back-end separation execution method based on an ETL engine, which comprises the following steps: s1) separating the front UI layer of the button from the back core layer, the engine layer and the resource library, and integrating the separated core layer and the engine layer into a WEB container; s2) making the separated resource library into an independent module for centralized management; s3) classifying the core layer functions, packaging into independent services to provide interfaces for the outside, and introducing zookeeper and Dubbox to realize distributed ETL services; s4) web page version drag development is realized by using the mxGraph as a UI layer, and the C/S architecture of the button is changed into the B/S architecture. The front-end and back-end separation execution method based on the ETL engine provided by the invention is convenient to install and use, easy to realize webpage version dragging type development and stronger in expansibility and applicability.

Description

Front-end and back-end separation execution method based on ETL engine

Technical Field

The invention relates to a data warehouse technology, in particular to a front-end and back-end separated execution method based on an ETL engine.

Background

The information is an important resource of modern enterprises and is the basis of scientific management and decision analysis of the enterprises. Currently, most enterprises spend a lot of money and time to construct business systems and office automation systems of online transaction OLTP for recording various relevant data of transactions.

According to statistics, the data volume can be multiplied every 2-3 years, the data contain huge commercial values, and the data concerned by enterprises usually only account for about 2% -4% of the total data volume. As a result, businesses still do not maximize the use of existing data resources, wasting more time and money, and losing the best opportunity to make critical business decisions. Therefore, how to convert data into information and knowledge through various technical means has become a major bottleneck for improving the core competitiveness of enterprises. ETL is a main technical approach.

ETL is the acronyms of the three words "Extract", "Transform", "Load", namely "Extract", "Transform", "Load", respectively.

"extraction": reading data from various original business systems is a prerequisite for all work.

"convert": and converting the extracted data according to a pre-designed rule, so that originally heterogeneous data formats can be unified.

"loading": and importing the converted data into a data warehouse according to planned increment or all.

ETL is the core and soul of BI/DW (business intelligence/data warehouse), integrates according to unified rules and improves the value of data. ETL is responsible for completing the process of transforming data from a data source to a target data warehouse, which is an important step in implementing a data warehouse. However, ETL is a desktop application, and employs the CS architecture. Compared with the traditional WEB application program, the prior art has the following defects:

1) and is inconvenient to install and use. The traditional ETL tool needs to be installed, can only be started by a local computer, and cannot be put into production in a network isolation environment.

2) And the UI layer cannot be made into a webpage version: workflow of a traditional ETL tool is designed in a graphical drag mode, ETL workflow definition and parameters are very complex, graphical drag is very difficult to achieve, and graphical page version cannot meet requirements of all drag-mode defined workflows.

2) The traditional ETL tool is a stand-alone version: and can be operated by only one person. The UI display of the conventional ETL is rough and complicated, and the usability is extremely poor.

3) Traditional vertical architecture, tight coupling, and is difficult to expand. The workflow designer and execution of the traditional ETL tool are integrated, and the code is difficult to expand.

4) Support the difficulty of multi-person team cooperation: due to the CS architecture design of the ETL, two users cannot simultaneously operate and design the same ETL workflow, so that the source code is controlled by a global lock, one user locks a related logic table when operating one conversion or operation, and other users can unlock the table again when locally accessing the same resource library, and the lock waiting or deadlock occurs to ensure that a plurality of users cannot normally use the table when simultaneously operating.

Disclosure of Invention

The invention aims to solve the technical problem of providing a front-end and back-end separation execution method based on an ETL engine, which is convenient to install and use, easy to realize webpage version dragging type development and higher in expansibility and applicability.

The technical scheme adopted by the invention for solving the technical problems is to provide a front-end and back-end separation execution method based on an ETL engine, which comprises the following steps: s1) separating the front UI layer of the button from the back core layer, the engine layer and the resource library, and integrating the separated core layer and the engine layer into a WEB container; s2) making the separated resource library into an independent module for centralized management; s3) classifying the core layer functions, packaging into independent services to provide interfaces for the outside, and introducing zookeeper and Dubbox to realize distributed ETL services; s4) web page version drag development is realized by using the mxGraph as a UI layer, and the C/S architecture of the button is changed into the B/S architecture.

Further, the core layer of the key extracted in step S1 includes functions of a general purpose factory, a general purpose tool, program lifecycle monitoring, exception handling, log factory management, a connection protocol, a plug-in factory, and data source management; the engine layer of the extracted key includes a plug-in registration management, an execution log management, a Job execution engine, and a conversion execution engine.

Further, the step S1 is to import the core jar of the core layer and the engine layer into the springboot2 framework through maven 'S dependency management, so that the springboot2 micro-Service can directly use the core jar functions of the core layer and the engine layer, the step S3 classifies and abstracts the core functions, including cluster management, Job management, conversion management, data source management and directory management, and writes the core functions into the Service layer as services, and issues the services to the Zookeeper registration center, and implements distributed services through the Dubbo' S RCP protocol; and simultaneously introducing an interface class of the distributed service and a configuration file of a consumer of the Dubbo, and calling the core layer and engine layer functions of the ETL.

Further, the resource library separated in step S2 is divided into a file resource library and a database resource library, the file resource library stores all the nodes of the ETL as an XML file, the XML file stores the Job parameter, the transformation parameter, the connection parameter, the log parameter, the basic parameter, the cluster parameter, the sharing parameter and the execution parameter, and different parameters are distinguished by different tags; the database resource library adopts a relational database, and splits and stores various parameters into different tables.

Further, the front-end UI layer in step S4 is implemented by using a combination of vue and mxGraph, a plurality of graph nodes in the ETL flow are connected by straight lines to form a pipeline, each graph node represents a data processing node, and different nodes implement different data processing logics.

Further, the front-end UI layer in step S2 further extends and implements resource library directory management, directory authorization, ETL real-time monitoring, log monitoring, and historical log management functions on the basis of basic functions, and the UI layer configures a service interface for invoking ETL reading/adding/updating/deleting/executing/suspending/terminating/monitoring/history operations provided by the execution engine layer and acquiring ETL data.

Further, the interaction process of the front-end UI layer, the back-end core layer, the engine layer and the resource library is as follows: a) after the front-end UI layer workflow designer completes configuration, the ETL workflow definition information is generated into XML; b) the front end UI layer transmits the XML to the rear end engine layer through a Dubbo interface; c) the back-end engine layer calls an api interface of the core layer to check the correctness of the XML, stores the XML in an independent resource library and returns a storage state to the UI layer; d) the UI layer initiates an ETL to execute a task through a Dubbo interface; e) the back-end engine layer receives the instruction to start ETL work; f) the UI layer monitors the working process and state of the ETL.

And further, controlling access by adding a state mark to replace a code of the global lock on the basis of the Web application, supporting concurrent access of the same resource, and detecting whether the same resource has usage conflict by utilizing a checkpoint and checkpoint mechanism.

Compared with the prior art, the invention has the following beneficial effects: according to the front-end and back-end separation execution method based on the ETL engine, a button front-end UI and a button back-end engine are separated, a new UI layer is introduced, the C/S framework of the button is changed into a B/S framework, and after the Web application is deployed at one position, other machines can be accessed and used by inputting an access address; meanwhile, distributed high availability is realized through zookeeper and Dubbox, multi-person team assistance is supported, ETL data of multiple projects can be managed simultaneously, and working efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of the main transformation of ETL-based data;

FIG. 2 is a schematic diagram illustrating a front-end and back-end separation execution flow based on an ETL engine according to the present invention;

FIG. 3 is a diagram showing the relationship between the front and rear ends of a button used in the present invention after separation.

Detailed Description

The invention is further described below with reference to the figures and examples.

The keyboard is one of typical representatives of ETL tools, adopts pure Java writing (OSGI architecture based on JAVA), can be deployed and operated on windows, Linux and Unix in a cross-platform mode, and has high efficiency and stability in data extraction. The main conversion is achieved as shown in figure 1.

The key has the following main functional characteristics:

1. supporting multiple data sources

The keyboard supports various data sources, including various data sources such as a database, a file system, Excel, Xml, LDAP, SOAP/WebServicie, CSV files, RSS and the like. The supported databases include various mainstream databases such as DB2, Oracle, Mysql, MS SQL Server, Sybase, etc. The button provides a package for access to the various data sources, and developers only need to drag the corresponding components to the console. The database connections support cluster and database partition access.

The mode of database connection supports JDBC, ODBC and JNDI modes, provides a function of a data connection pool, and can greatly improve the access efficiency of the database.

2. Core component enrichment

The button components are mainly divided into two categories, namely Job core components and conversion core components, and the key components are integrated as follows:

3. support multitask concurrency, support extraction conversion processing of large data volume and have high execution efficiency

The button supports multi-task concurrency, can configure concurrency numbers in the interface, and can configure the concurrency numbers for each component.

4. Mature exception handling flow

The button provides abundant exception handling functions and can meet various exception handling requirements. A large number of exception handling components are provided, and processing under various exception conditions is met.

5. Can be well combined with Java application

The button itself is developed using Java, and can be seamlessly integrated with a Java application to call a button script in the application. Meanwhile, the button provides a set of Java interfaces, the execution process of the button can be controlled through the application, and the execution information and the result information of the button are monitored.

From the above, the whole process of converting data from the data source to the target data warehouse can be stable and efficient through the button. The invention selects the button as an ETL tool, and a front-end and back-end separation execution method of an engine, as shown in figure 2, comprises the following steps:

s1) separating the front UI layer of the button from the back core layer, the engine layer and the resource library, and integrating the separated core layer and the engine layer into a WEB container;

s2) making the separated resource library into an independent module for centralized management;

s3) classifying the core layer functions, packaging into independent services to provide interfaces for the outside, and introducing zookeeper and Dubbox to realize distributed ETL services;

s4) web page version drag development is realized by using the mxGraph as a UI layer, and the C/S architecture of the button is changed into the B/S architecture.

The invention separates the architecture model of ETL, separates the UI layer from the core layer, the engine layer and the resource library, and discards the UI layer of the original CS architecture. The core layer and the engine layer are integrated into a WEB container (tomcat is used in the invention), are split according to functions, are combined with high availability and distributed technical ideas, and are packaged into independent services to provide interfaces for the outside. And the data source layer is made into an independent module for centralized management. And the UI layer realizes webpage version dragging type development by using the mxGraph. By separating the front end from the back end, decoupling, thinning and splitting the modules, and combining the characteristics of high availability and distribution, the expansibility and the application are stronger. After the web application adopting the B/S architecture is deployed at one place, other machines can be accessed and used by inputting the access address. The UI layer is perfected by using an open-source graphic tool to realize dragging type development. After the B/S architecture is improved, multi-person team cooperation is easily realized, access is controlled by adding a state mark to replace a code of a global lock, conflicts of the use of the same resource can be effectively avoided when a plurality of persons operate simultaneously by utilizing a checkpoint and checkpoint detection mechanism, and the phenomenon of avalanche caused by locking is avoided.

The following provides a specific process of splitting and integrating the ETL architecture model according to the present invention:

one, back end core layer and engine layer

And extracting a core layer and a data source layer, integrating the core layer and the data source layer into a background service container (the platform is integrated into a springboot2 micro-service framework), refining functions and modules, packaging after segmentation, packaging into an independent service release, and taking the upper UI as a producer. The core layer core comprises functions of a general factory, a general tool, program life cycle monitoring, exception handling, log factory management, a connection protocol, a plug-in factory, data source management and the like. The engine layer engine comprises a plug-in registration management, an execution log management, a Job execution engine, a conversion execution engine and the like.

The core jar of the core and the engine is firstly imported into a springboot2 framework through the dependency management of the maven, and at the moment, the springboot2 micro service can directly use the core jar function of the core and the engine.

And classifying and abstracting core functions, including cluster management, Job management, conversion management, data source management, directory management and the like, and writing the core functions into a Service layer to serve. And then introducing a zookeeper + dubbo technology, releasing the service to a zookeeper registration center, and realizing distributed service through an RCP (remote control protocol) protocol of the dubbo. By using the zookeeper + dubbo technology, the cluster service can be conveniently expanded, and the functions of load balancing, service degradation and fault transfer can be added to ensure high concurrency and high availability.

In use, in any project needing to call the core layer and engine layer functions of the ETL, the core layer and engine layer functions of the ETL can be directly called only by introducing the interface class of the distributed service and introducing the configuration file of the consumer of the dubbo.

The possibility is provided for the B/S architecture and the distributed deployment through the bottom layer encapsulation and the external service of the core layer of the engine.

1)Zookeeper

ZooKeeper is a distributed, open-source distributed application coordination service, is an open-source implementation of Chubby of Google, and is an important component of Hadoop and Hbase. It is a software that provides a consistent service for distributed applications, and the functions provided include: configuration maintenance, domain name service, distributed synchronization, group service, etc., and serves as a registration center of the dubbo in the technical scheme.

2)Dubbo

Dubbo is an Alibaba open-source distributed service framework, which is characterized by being structured in a layered manner, and in this way, the layers can be decoupled (or maximally loosely coupled). From the service model point of view, Dubbo uses a very simple model, either the Provider provides the service or the Consumer consumes the service, so based on this, two roles of service Provider (Provider) and service Consumer (Consumer) can be abstracted. Regarding registry, protocol support, service monitoring, etc.

Compared with the traditional interface, the distributed service interface has the advantages that the service division is thinner in a distributed mode, the service division is more independent, the interfaces are not affected with each other, one interface is stopped or updated, the calling of other interfaces is not affected, the idea of low coupling of cohesion is achieved, and the high available effect is achieved. The relationship after separation of the front and rear ends is shown in fig. 3. ConfigServer: for the registration center, a kernel layer and an engine layer of ETL are integrated in a Server (back end), API of the kernel function is packaged into service and is published to the registration center, a Client is an item of a user-defined UI and can have service requirements, the Client subscribes the service required by the Client to the registration center, an interface URL is obtained, and then the Server end is called to obtain the service according to the URL.

Two, independent resource library

The definition information of the ETL workflow needs to be managed uniformly and independently, and the resource library is divided into two categories.

1. The method is a file resource library which is stored as a file in an XML form, and Job parameters, conversion parameters, connection parameters, log parameters, basic parameters, cluster parameters, sharing parameters, execution parameters and the like are integrated into one file and are distinguished by different labels, such as conversion < caps >, Job < Job >, basic parameters < info >, cluster parameters < clusterschemas >, connection parameters < connection >, sequencing parameters < order > and the like. The method has the advantages that the method is simple and easy to read, all nodes of the ETL are saved into one file, the performance of saving and reading the whole ETL definition is very excellent in real time, and the defect that the file sharing is troublesome is overcome.

2. The method is a DATABASE resource library, and takes a relational DATABASE as storage, and splits and stores various parameters such as conversion < caps > in ETL definition, Job < Job >, basic parameter < info >, cluster parameter < clusterschemas >, connection parameter < connection >, sorting parameter < order > and the like into different tables, for example, the definition parameters of Job < Job > are stored into the tables such as JOB, JOB _ ATBUTE, JOB _ HOP, JOBENTRY _ ATTTRIBUTE, JOBENTRY _ DATABASE, JOBENTRY _ TYPE and the like. The method has the advantages of naturally supporting distributed storage and data sharing. The disadvantages are that: because the various parameters are spread across different tables, querying and saving performance is not as fast as a file.

The invention combines the advantages of two resource libraries, stores the resources in a Hadoop file system in a file mode to realize uniform storage and high availability, can develop the reading/adding/updating/deleting operation of a uniform service interface for an ETL file, and uniformly provides services for the outside.

The independent resource library provides good support and realization for realizing the aspects of resource sharing, distribution, team cooperation and the like.

Third, front end UI layer ETL workflow designer

The UI layer of the ETL workflow designer is replaced with vue + mxGraph. The UI layer separates the ETL designer from the actuator, so that the UI layer can guarantee advanced graphic technology and is easy to expand functions, the UI layer realizes the ETL design function and realizes the drawing of any graphical position, a plurality of graphic nodes in the ETL flow can be connected by straight lines to form a pipeline for work, each graphic node represents a data processing node, different nodes can realize different data processing logics, such as data acquisition, data loading, data encryption, data connection, data splitting, data conversion and the like, and different processing directions of success and failure of the nodes can be set to realize the ETL processing shunting effect. The UI is also expanded on the basis of basic functions to realize functions of resource library directory management, directory authorization, ETL execution real-time monitoring, log monitoring, historical log management and the like, and simultaneously, the UI layer can call ETL reading, adding, updating, deleting, executing, suspending, stopping, monitoring, history and other operations provided by the execution engine layer and obtain ETL data as long as a service interface is configured, so that the capabilities of quick development and quick response to requirements are realized.

The mxGraph is a JS drawing component suitable for Web applications that need to design/edit Workflow/BPM flow charts, diagrams, network diagrams, and general graphics in Web pages. The mxgraph download package includes a front-end program written with javascript, and also includes multiple instances of integration with a back-end program (java/C #, etc.).

vue + mxGraph realizes the UI layer, and perfectly transplants the user experience of the original client to the browser.

Four, ETL Integrated Process

1) And after the configuration of the front-end UI layer workflow designer is completed, the ETL workflow definition information is generated into XML.

2) And the front-end UI layer transmits the XML to the back-end engine layer through the dubbo interface.

3) And the back-end engine layer calls the api of the core layer to check the correctness of the XML, stores the XML into the independent resource library and returns the storage state to the UI layer.

4) The UI layer initiates ETL to execute tasks through the dubbo interface.

5) The back-end engine layer receives the instruction to start ETL work.

6) The UI layer monitors the working process and state of the ETL.

Fifth, based on the user's directory authority control

1) And (3) managing role menu authority: for the B/S architecture system, the role and user management based on enterprises or teams can be easily realized. And authorizing and accessing each resource of the ETL by using the role and the distribution of the menu authority by the user.

2) User directory authority assignment: aiming at the catalog of the resource library, other members of the unauthorized resource catalog cannot be seen through the authorization of the catalog by the user. The independence of resource allocation among team members is improved, and data conflict is prevented.

3) And (3) data source authority distribution: and the role authority distribution controls whether the user can create, modify and delete the data source, and authorizes the newly created data source to the team members.

By distributing the role authority to the system, team cooperation and team management can be conveniently carried out.

In summary, the front-end and back-end separation execution method based on the ETL engine provided by the present invention has the following advantages:

1) the button front-end UI and the back-end engine are separated, the function of the button is changed into distributed service from an integrated ETL tool, and the hot plug-in component type can be realized. The high availability and portability of the button are enhanced.

2) Distributed high-availability ETL service is realized through zookeeper and Dubbox, and in the prior art, a keytle does not provide a high-availability service interface. The invention provides distributed interfaces by Dubbox, so that the services are divided into more detailed and independent from each other, the interfaces are not influenced with each other, one interface is stopped or updated without influencing the calling of other interfaces, the idea of low coupling of cohesion is realized, and the high available effect is achieved.

3) The C/S architecture of the key is changed into the B/S architecture, and in the prior art, the key cannot be connected in a network isolation environment and cannot be put into production. The scheme is improved into a B/S architecture, and the problem of network isolation environment does not exist.

4) Support many people team to assist, in the prior art, the button can only be installed on local computer, only can one-man operation, does not accord with production environment service condition. According to the scheme, the access is controlled by adding the state mark to replace a code of a global lock on the basis of Web application, and by using a checkin and checkout detection mechanism, when a plurality of persons operate simultaneously, the conflict of the use of the same resource can be effectively avoided, the phenomenon of avalanche caused by the lock can be avoided, and the purpose of simultaneous operation of the plurality of persons is achieved.

Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A front-end and back-end separation execution method based on an ETL engine is characterized by comprising the following steps:

2. The ETL engine-based front-end and back-end separated execution method of claim 1, wherein the core layer of the key extracted in step S1 includes common factory, common tools, program lifecycle snooping, exception handling, log factory management, connection protocol, plug-in factory, data source management functions; the engine layer of the extracted key includes a plug-in registration management, an execution log management, a Job execution engine, and a conversion execution engine.

3. The ETL engine-based front-end and back-end separated execution method of claim 2, wherein the step S1 is to import the core jar of the core layer and the engine layer into the springboot2 framework through maven 'S dependency management, so that the springboot2 micro-Service can directly use the core jar functions of the core layer and the engine layer, the step S3 classifies and abstracts the core functions, including cluster management, Job management, conversion management, data source management and directory management, and writes the core functions into the Service layer as services, and then issues the services to the Zookeeper registration center, and realizes distributed services through Dubbo' S RCP protocol; and simultaneously introducing an interface class of the distributed service and a configuration file of a consumer of the Dubbo, and calling the core layer and engine layer functions of the ETL.

4. The ETL engine-based front-end and back-end separation execution method of claim 1, wherein the resource library separated in step S2 is divided into a file resource library and a database resource library, the file resource library stores all the nodes of the ETL as an XML file, the XML file stores therein a Job parameter, a transformation parameter, a connection parameter, a log parameter, a basic parameter, a cluster parameter, a sharing parameter and an execution parameter, and different parameters are distinguished by different tags; the database resource library adopts a relational database, and splits and stores various parameters into different tables.

5. The ETL engine-based front-end and back-end separate execution method of claim 1, wherein the front-end UI layer in step S4 is implemented by a combination of vue and mxGraph, a plurality of graph nodes in the ETL process are connected by straight lines to form a pipeline, each graph node represents a data processing node, and different nodes implement different data processing logics.

6. The ETL engine-based front-end and back-end separated execution method of claim 5, wherein the front-end UI layer in step S2 further implements resource library directory management, directory authorization, and the ETL performs real-time monitoring, log monitoring, and historical log management functions on the basis of basic functions, and the UI layer configures a service interface for invoking ETL reading/adding/updating/deleting/executing/suspending/terminating/monitoring/historical operations provided by the execution engine layer and obtaining ETL data.

7. The ETL engine-based front-end and back-end separate execution method of claim 5, wherein the interaction process of the front-end UI layer with the back-end core layer, the engine layer and the repository is as follows:

a) after the front-end UI layer workflow designer completes configuration, the ETL workflow definition information is generated into XML;

b) the front end UI layer transmits the XML to the rear end engine layer through a Dubbo interface;

c) the back-end engine layer calls an api interface of the core layer to check the correctness of the XML, stores the XML in an independent resource library and returns a storage state to the UI layer;

d) the UI layer initiates an ETL to execute a task through a Dubbo interface;

e) the back-end engine layer receives the instruction to start ETL work;

f) the UI layer monitors the working process and state of the ETL.

8. The ETL engine-based front-end and back-end separated execution method of claim 1, further comprising controlling access by adding a state flag instead of a code of a global lock on a Web application basis, supporting concurrent access of the same resource, and detecting whether there is a usage conflict for the same resource by using a checkpoint and checkpoint mechanism.