CN111125045A

CN111125045A - Lightweight ETL processing platform

Info

Publication number: CN111125045A
Application number: CN201911108855.5A
Authority: CN
Inventors: 李甫
Original assignee: Quantum Cloud Future Beijing Information Technology Co ltd
Current assignee: Hefei Kelast Network Technology Co ltd
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-05-08
Anticipated expiration: 2039-11-13
Also published as: CN111125045B

Abstract

The embodiment of the invention discloses a lightweight ETL processing platform which comprises a front-end task transceiver module and a back-end data processing module, wherein the front-end task transceiver module comprises a browser, an H5 operation unit and a MySQL database management system, the browser is used for configuring operation parameters, the H5 operation unit is used for visually displaying operations, different function libraries or operation instructions supporting Spark operations are placed in the MySQL database management system, the back-end data processing module comprises a Spark cluster, data in the Spark cluster are stored in a distributed mode, big data calculation is carried out according to the function libraries or operation instructions of the Spark operations placed in the MySQL database management system, and calculation results are output to the MySQL database management system. The invention solves the problems of high requirement on the specialty and low processing speed of the existing big data processing.

Description

Lightweight ETL processing platform

Technical Field

The embodiment of the invention relates to the technical field of big data processing, in particular to a lightweight ETL processing platform.

Background

Cloud computing systems are increasingly gaining attention as they are able to provide mass storage and reliable services. In a cloud infrastructure, thousands of interconnected computing mechanisms form a "cloud" that provides services, and a large number of users can share the "cloud" at the same time and tailor required resources according to their actual needs. As an important component in cloud data processing, most of current cloud storage systems adopt a distributed hash table mode to construct indexes, and data are organized in a key-value pair mode. The big data is a data set which is large in scale and greatly exceeds the capability range of a traditional database software tool in the aspects of acquisition, storage, management and analysis, and has the four characteristics of large data scale, rapid data circulation, various data types and low value density. The big data and cloud computing are just as inseparable as the front and back of a coin. The large data cannot be processed by a single computer necessarily, and a distributed architecture must be adopted. The method is characterized in that distributed data mining is carried out on mass data. But it must rely on distributed processing of cloud computing, distributed databases and cloud storage, virtualization technologies.

With the development of big data services, it is very common to perform ETL (Extract-Transform-Load, Chinese full name: data warehouse technology) processing on data. The ETL processing is a process of extracting data from one data source, converting the extracted data into a standard format, and loading the converted data into another target data source. Currently, there are many different types of data sources, such as: the system comprises a relational Mysql, a non-relational HBase, a data warehouse with Hive, a file storage HDFS, a computing engine Spark and a file index service Elasticissearch with a storage function; while data sources of different data types may have different interface types. Aiming at different scenes, the Spark program is slow in starting speed, is not suitable for real-time data processing, is high in requirement on the memory of a computer on the basis of memory calculation, needs a plurality of servers to form a distributed calculation cluster, needs professional technicians to carry out deep JVM parameter optimization, and is high in requirement on the professional degree.

Disclosure of Invention

Therefore, the embodiment of the invention provides a lightweight ETL processing platform to solve the problems of high requirement on the professional degree and low processing speed of the existing large data processing.

In order to achieve the above object, the embodiments of the present invention provide the following technical solutions:

the invention discloses a lightweight ETL processing platform which comprises a front-end task transceiver module and a back-end data processing module, wherein the front-end task transceiver module comprises a browser, an H5 operation unit and a MySQL database management system, the browser is used for configuring operation parameters, the H5 operation unit is used for visually displaying operations, different function libraries or operation instructions supporting Spark operations are placed in the MySQL database management system, the back-end data processing module comprises a Spark cluster, data in the Spark cluster are stored in a distributed mode, big data calculation is carried out according to the function libraries or operation instructions of the Spark operations placed in the MySQL database management system, and calculation results are output to the MySQL database management system.

Furthermore, the operating parameters are configured in the browser, and the user adds the operating process through the browser to configure the corresponding parameters.

Furthermore, the H5 operation unit comprises a background server and an operation display interface, the operation of holding and pulling is performed through the mouse, the operation display interface performs visual display, the background server identifies each different operation of the mouse, analyzes the operation logic of the user, and generates a corresponding operation task.

Further, the MySQL database management system performs data interaction with the H5 operation unit, performs task parameter assembly according to operation tasks generated by the H5 operation unit in the MySQL database management system to form a Spark operable function library or operation instruction, and transmits the Spark operable function library or operation instruction to a Spark cluster.

Further, the MySQL database management system further comprises a data receiving unit, wherein the data receiving unit receives a Spark cluster calculation result, and a user can call the calculation result from the MySQL database management system to display a data result and generate a data report.

Further, the back-end data processing module comprises a Spark cluster, the Spark cluster comprises a task cluster manager and an HDFS distributed file system, the HDFS distributed file system performs distributed storage on data and improves reading speed, the task cluster manager distributes computing tasks to a plurality of Spark computing engines, and each Spark computing engine outputs computing results.

Further, the Spark calculation engines extract data tasks from the HDFS distributed file system to perform big data calculation, the Spark calculation engines have consistent data, and the calculation processes are different.

Further, after the Spark calculation engine completes calculation, the Spark cluster outputs the calculation result to the MySQL database management system, and the MySQL database management system sorts and stores the data.

Furthermore, the MySQL database management system supports the user to expand the existing function library and realizes the addition of the personalized function library.

The embodiment of the invention has the following advantages:

the embodiment of the invention discloses a lightweight ETL processing platform, which performs operation parameter configuration through a browser, visually displays operation through an H5 operation unit, identifies operation to generate an operation task through a background server, reduces the requirement of big data processing on the specialty, enables the operation of a user to be visual, performs task parameter assembly according to the operation task generated by an H5 operation unit in a MySQL database management system to form a Spark operable function library or an operation instruction, performs big data processing operation on a Spark cluster, performs distributed storage on data through an HDFS distributed file system, improves the reading speed of the Spark calculation engine on the data, adapts to real-time data processing of different scenes, and improves the data processing efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.

Fig. 1 is an architecture diagram of a lightweight ETL processing platform according to an embodiment of the present invention;

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Examples

The embodiment discloses a lightweight ETL processing platform, which comprises a front-end task transceiver module and a rear-end data processing module, wherein the front-end task transceiver module is used for inputting operation and outputting and displaying processing results, the rear-end data processing module is used for processing data, and parameter configuration and data calculation are separated, so that the ETL platform is lighter.

The front-end task transceiver module comprises a browser, an H5 operation unit and a MySQL database management system, wherein the browser is used for configuring operation parameters, the H5 operation unit is used for visually displaying operations, different function libraries or operation instructions supporting Spark operations are placed in the MySQL database management system, the rear-end data processing module comprises a Spark cluster, data are stored in the Spark cluster in a distributed mode, big data calculation is carried out according to the function libraries or the operation instructions of the Spark operations placed in the MySQL database management system, and calculation results are output to the MySQL database management system.

The ETL is a process of loading data of a business system into a data warehouse after extraction, cleaning and conversion, and aims to integrate scattered, disordered and non-uniform data in an enterprise and provide an analysis basis for the decision of the enterprise; the design of the ETL is divided into three parts: data extraction, data cleaning conversion and data loading. Data extraction is to extract Data from different Data sources into an ODS (Operational Data Store), the process can also perform cleaning and conversion of some Data, different extraction methods need to be selected in the extraction process, and the operation efficiency of the ETL is improved as much as possible. Of the three ETL parts, the longest part takes time is the "T" (Transform, cleaning, conversion) part, and the part of the workload is 2/3 of the whole ETL in general. The loading of data is typically written directly into a DW (data warehouse) after the data is flushed.

The data extraction part needs to do a lot of work in the investigation stage, firstly, it is to be understood what DBMS the data comes from several business systems, the database server of each business system runs, whether manual data exists, how large the manual data amount exists, whether unstructured data exists, and the like, and the data extraction design can be carried out after the information is collected.

1. For the same data source processing method as the database system storing the DW;

this type of data source is relatively easy to design. Generally, the DBMS (SQLServer, Oracle) provides a database linking function, and a direct link relationship is established between the DW database server and the original business system, so that a Select statement can be written for direct access.

2. Processing methods for data sources other than DW database systems;

for this kind of data source, a database link can also be established in an ODBC manner, for example, between SQLServer and Oracle. If a database link cannot be established, this can be done in two ways, one is by means of a tool exporting the source data as txt or xls files and then importing these source system files into the ODS. Another approach is through a program interface.

3. For a file type data source;

business personnel can be trained to import the data into the designated database using the database tool and then extract the data from the designated database. Or may also be implemented with tools.

4. The problem of incremental updates;

for systems with large amounts of data, incremental decimation must be considered. In general, the service system records the time when the service occurs, and we can use it as an incremental flag, first determine the maximum time of the record in the ODS before each extraction, and then according to this time, the service system takes all the records that are greater than this time. With the time stamp of the business system, the business system typically has no or partial time stamp.

The data warehouse is divided into two parts, namely ODS and DW in general. The common practice is to clean the service system to the ODS, filter dirty data and incomplete data, convert the data from the ODS to the DW, and perform some calculation and aggregation of service rules.

1. Data cleansing

The task of data cleaning is to filter the data which do not meet the requirements, send the filtering result to a business administration department, and determine whether to filter or to extract after being corrected by a business unit; the data which is not qualified is mainly three categories of incomplete data, error data and repeated data.

(1) Incomplete data: this type of data is mainly a lack of information that should be available, such as a lack of regional information for the name client of the name affiliate of the provider, a failure to match the master and detail tables in the business system, etc. And filtering the data, respectively writing different Excel files according to the missing contents, submitting the Excel files to the client, requiring completion within a specified time, and writing the Excel files into a data warehouse after completion.

(2) Erroneous data: the reason for this kind of error is that the service system is not sound enough, and it is not judged after receiving the input and directly written into the background database, for example, the numerical data is input into full-angle digital characters, there is a carriage return operation behind the character data, the date format is incorrect, the date is out of bounds, etc. The data is also classified, and for the problems of full-angle characters and invisible characters before and after the data, the data can be found only by writing SQL sentences, and then the data is required to be extracted by customers after the business system is corrected. Errors of incorrect date format or date-out-of-bounds can cause failure of ETL operation, and the errors need to be picked out by a business system database in an SQL mode and are submitted to business administration departments to require time limit correction.

(3) For repeated data, especially for the situation of the dimension table, all fields of the repeated data record are derived to be confirmed and sorted by a client. Data cleansing is an iterative process that continually finds and solves problems. Whether filtering is carried out or not, whether correction is carried out or not generally requires customer confirmation, filtered data is written into an Excel file or a data table, and an email of the filtered data can be sent to a business unit every day in the initial stage of ETL development, so that errors can be corrected as soon as possible and the email can be used as a basis for verifying data in the future. Data cleansing requires care not to filter out useful data, to verify carefully for each filtering rule, and to confirm by the user.

2. Data conversion

The task of data transformation is mainly to perform inconsistent data transformation, transformation of data granularity, and computation of some business rules.

(1) Inconsistent data transformation: the process is an integrated process, and unifies the same type of data of different business systems, for example, the code of the same supplier in the settlement system is XX0001, and the code in CRM is YY0001, so that the data are unified and converted into one code after being extracted.

(2) Conversion of data granularity: business systems typically store very detailed data, and data in data warehouses is used for analysis and does not require very detailed data. Typically, business system data is aggregated at a data warehouse granularity.

(3) And (4) calculating business rules, wherein different enterprises have different business rules and different data indexes, and the data indexes need to be stored in a data warehouse for analysis after calculation in the ETL.

The user carries out the configuration of operating parameter in through the browser, the user adds to the operation process through the browser, dispose corresponding parameter, H5 operating element includes backend server and operation show interface, carry out the support through mouse and draw the operation of drawing, operation show interface carries out visual display, backend server discerns every different operation of mouse, analyze user's operating logic, generate corresponding operation task, the user adds equipment, only need carry out the support that corresponds at H5 operating element and draw the operation, type in the IP address, can accomplish the interpolation of equipment, it is operation visual, reduce data processing's professional degree.

The MySQL database management system performs data interaction with the H5 operation unit, performs task parameter assembly according to operation tasks generated by the H5 operation unit in the MySQL database management system to form a Spark operable function library or operation instruction, and transmits the Spark operable function library or operation instruction to a Spark cluster.

The back-end data processing module comprises a Spark cluster, a task cluster manager and an HDFS distributed file system are included in the Spark cluster, the HDFS distributed file system performs distributed storage on data and improves reading speed, the task cluster manager distributes computing tasks to a plurality of Spark computing engines, each Spark computing engine outputs computing results, and the Spark computing engines extract data tasks from the HDFS distributed file system to perform big data computing.

And after the calculation of the park calculation engines is finished, the calculation result is output to a MySQL database management system through the park cluster, and the MySQL database management system sorts and stores the data. The MySQL database management system further comprises a data receiving unit that receives a result of the Spark cluster calculation. And the user can call the calculation result from the MySQL database management system to display the data result and generate a data report. Meanwhile, the MySQL database management system supports the user to expand the existing function library, and personalized function library addition is realized.

The HDFS distributed file system performs distributed storage on data, improves the reading speed of a Spark calculation engine on the data, adapts to real-time data processing of different scenes, and improves data processing efficiency.

Although the invention has been described in detail above with reference to a general description and specific examples, it will be apparent to one skilled in the art that modifications or improvements may be made thereto based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A light-weight ETL processing platform is characterized by comprising a front-end task transceiver module and a back-end data processing module, wherein the front-end task transceiver module comprises a browser, an H5 operation unit and a MySQL database management system, the browser is used for configuring operation parameters, the H5 operation unit is used for visually displaying operations, different function libraries or operation instructions supporting Spark operations are placed in the MySQL database management system, the back-end data processing module comprises a Spark cluster, data in the Spark cluster are stored in a distributed mode, large data calculation is carried out according to the function libraries or operation instructions of the Spark operations placed in the MySQL database management system, and calculation results are output to the MySQL database management system.

2. The lightweight ETL processing platform of claim 1, wherein said browser is configured with operating parameters, and a user adds an operating procedure through the browser to configure corresponding parameters.

3. The lightweight ETL processing platform of claim 1, wherein the H5 operation unit comprises a background server and an operation display interface, the operation of dragging is performed through a mouse, the operation display interface performs visual display, the background server identifies each different operation of the mouse, analyzes the operation logic of the user, and generates a corresponding operation task.

4. The lightweight ETL processing platform of claim 1, wherein the MySQL database management system performs data interaction with the H5 operation unit, performs task parameter assembly according to the operation task generated by the H5 operation unit in the MySQL database management system to form a Spark operable function library or operation instruction, and transmits the Spark operable function library or operation instruction to the Spark cluster.

5. The lightweight ETL processing platform of claim 4, wherein the MySQL database management system further comprises a data receiving unit, the data receiving unit receives results of Spark cluster calculation, and a user can retrieve the calculation results from the MySQL database management system to perform data result display and data report generation.

6. The lightweight ETL processing platform of claim 1, wherein the backend data processing module comprises a Spark cluster, the Spark cluster comprises a task cluster manager and an HDFS distributed file system, the HDFS distributed file system performs distributed storage on data to increase reading speed, the task cluster manager distributes computing tasks to a plurality of Spark computing engines, and each Spark computing engine outputs computing results.

7. The lightweight ETL processing platform of claim 6, wherein the Spark calculation engines extract data tasks from the HDFS distributed file system for big data calculation, and each Spark calculation engine has consistent data and different calculation processes.

8. The lightweight ETL processing platform of claim 6, wherein after computation by the Spark computation engine is completed, the results of the computation are output to a MySQL database management system by a Spark cluster, and the MySQL database management system collates and saves the data.

9. The lightweight ETL processing platform of claim 1, wherein said MySQL database management system supports user to extend existing function libraries to enable personalized function library additions.