CN113505119B

CN113505119B - ETL method and device based on multiple data sources

Info

Publication number: CN113505119B
Application number: CN202110862612.1A
Authority: CN
Inventors: 刘晓文; 李凡平; 石柱国
Original assignee: Qingdao Yisa Data Technology Co Ltd
Current assignee: Qingdao Yisa Data Technology Co Ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2023-08-29
Anticipated expiration: 2041-07-29
Also published as: CN113505119A

Abstract

The application discloses an ETL method and device based on multiple data sources, wherein the method comprises the following steps: configuring custom data; selecting the output destination table and the cleaning rule of the field; generating a corresponding message format according to the configuration and selection operation, and writing the corresponding message format into a kafka message queue; processing the data in the kafka message queue by using a spark stream computing framework, and then warehousing the processed data; the beneficial effects are as follows: (1) The web page is used to simplify the operation of the operation and maintenance developer for receiving data, the command is not required to be executed, the configuration file is not required to be modified, the migration and the access of various complex data can be easily completed without executing a program, and the learning cost and the operation and maintenance difficulty are reduced; (2) By making a general rule of data access, the data of different tables in different libraries can be written into the format, and the set of flow codes can be used for completing the ETL operation of all data with different sources, so that the development quantity is greatly reduced.

Description

ETL method and device based on multiple data sources

Technical Field

The application relates to the technical field of data processing, in particular to an ETL method and device based on multiple data sources.

Background

Currently, big data ETL (data extraction, conversion and loading) is increasingly widely applied, such as the e-commerce field, the financial field, the security field and the like. In the prior art, most of data access and storage custom develop an ETL program for a data source and a target data source, but as the data sources are more and more, the dimension of data analysis is more and more, and more programs need to be developed, so that the development cost and the operation and maintenance cost are increased linearly.

Disclosure of Invention

Aiming at the defects existing in the prior art, the embodiment of the application aims to provide an ETL method and device based on multiple data sources.

To achieve the above object, in a first aspect, an embodiment of the present application provides an ETL method based on multiple data sources, including:

configuring custom data;

selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;

generating a corresponding message format according to the configuration and selection operation, and writing the corresponding message format into a kafka message queue;

and processing the data in the kafka message queue by using a spark stream computing framework, and warehousing the processed data.

As a specific embodiment of the present application, the cleaning rule is a function of the completion of the encapsulation; the cleaning rule includes:

clearing field content and deleting space;

NULL value replacement, designated value replacement NULL.

As a specific embodiment of the present application, the processing of the spark streaming computing framework includes the following steps:

setting batch processing time according to the data volume and real-time requirements;

packaging the received kafka data into java according to a predefined data format;

and dividing the filter operator of the dstream using spark into different dstream streams according to the target database field in the data format, and performing warehousing operation on each stream traversal.

As a specific implementation mode of the application, data of the same batch flow are grouped according to the field of the target table, the groupby operator of RDD is used for calculating the number of the target tables, then the target tables are traversed, the filter operator of RDD is used for screening the data of the same table into a block, and each RDD is the target table of the same type;

when the data of the same type of target table are all together, the data are subjected to uniform cleaning operation, field values of the corresponding positions of the array are taken out, then the values of the corresponding cleaning fields are taken out, the cleaning rules are reflected by the method names of the cleaning rules, and the fields are cleaned.

Further, as a preferred embodiment of the present application, the database in sparks ql is used for warehousing.

In a second aspect, an embodiment of the present application further provides an ETL apparatus based on multiple data sources, including:

the configuration module is used for configuring the custom data;

the selection module is used for selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;

the packaging module is used for generating a corresponding message format according to the configuration and selection operation and writing the message format into the kafka message queue;

and the processing module is used for utilizing the spark stream computing framework to process the data in the kafka message queue and then warehousing the processed data.

clearing field content and deleting space;

NULL value replacement, designated value replacement NULL.

As a specific embodiment of the present application, the database in sparksql is used for warehousing.

The embodiment of the application has the main beneficial effects that:

(1) The web page is used to simplify the operation of the operation and maintenance developer for receiving data, the command is not required to be executed, the configuration file is not required to be modified, the migration and the access of various complex data can be easily completed without executing a program, and the learning cost and the operation and maintenance difficulty are reduced;

(2) By making a general rule of data access, the data of different tables in different libraries can be written into the format, so that the set of flow codes can be used for completing the ETL operation of all data with different sources, and the development quantity is greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart of an ETL method based on multiple data sources according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an arrangement according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a destination table for selecting output according to an embodiment of the present application;

FIG. 4 is a flow chart of a data processing provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an ETL device based on multiple data sources according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should be noted that the term of art is defined as commonly understood by those skilled in the computer arts.

As shown in fig. 1 to 4, an embodiment of the present application provides an ETL method based on multiple data sources, the method including:

s101, configuring custom data.

Specifically, the user can select the data source table, the selected field, the data type, the data length, the data precision and the like in the Web page in a user-defined manner.

S102, selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page;

the destination table of the selection output indicates which table is to be written into finally.

Specifically, the user selects a data cleansing rule, a database of targets, a data table of targets, and a data type of targets on the Web page.

Wherein the cleaning rule is a function of encapsulation completion; the cleaning rule includes:

clearing field content and deleting space;

NULL value replacement, designated value replacement NULL.

When the method is applied, for the cleaning rule part, a big data developer needs to develop the cleaning rule which is possibly used at one time, package the cleaning rule into a function and provide the function for the front-end display for the user to select, so that the user can dynamically clean various fields, and the developer only needs to update codes when adding new cleaning rules.

S103, generating a corresponding message format according to the configuration and selection operation, and writing the message format into a kafka message queue.

Specifically, the user generates a corresponding message format through page operation in the previous step, and transmits the corresponding message format from the front end to the back end, the back end collects the data of the corresponding library according to the database address and the identity information given by the user, encapsulates the data into a fixed message format according to the field name and the type appointed by the user, and transmits the data from the database to the kafka message queue; a kafka message queue for waiting for a program of big data to fetch the data;

the design data format (i.e. message format) has the advantages that one kafka data can carry a plurality of source data, network IO is saved, the data format is completely separated from service, only the value in the spark data processing program is needed, the spark program does not pay attention to specific names and field types of fields, and the like, and only the data can be processed according to a designed mode.

S104, after the data in the kafka message queue is processed by using the spark stream computing framework, the processed data is put in storage.

Specifically, the processing of the spark stream computing framework includes the following steps:

dividing a filter operator of dstream of spark into different dstream streams according to a target database field in the data format, and performing warehousing operation on each stream traversal;

that is, in the Spark data cleaning stage, batch processing time is set according to the data volume and the requirement of real-time, and a large number of small files are generated at too short intervals. After receiving the kafka data, the data is packaged into java according to a predefined data format. Since there are various data in kafka, in the first step, we need to separate different target databases, and the filter operator of dsstream using spark divides the target databases into different dsstream according to the target database fields in the data format, so that the target databases are not too many, and only the binning operation needs to be performed for each stream traversal.

Further, grouping the data of the same batch of streams according to the field of the target table, calculating the number of the target tables by using a groupby operator of RDD, traversing the target tables, and screening the data of the same table into a block by using a filter operator of RDD, wherein each RDD is the target table of the same type;

When the same data of the target library is in one stream, the data is in a different table. Since we are uncertain how many tables there are, the first operation is to group the data of this batch according to the target table field, and the groupby operator of RDD can be used to calculate how many target tables there are in total. The target tables are then traversed, again using the filter operator of RDD to filter the data of the same table into one block. Assuming 7 tables are grouped, 7 RDDs are screened out, using the 7 filter operator, for 7 table traversals, each RDD being a target table of the same class.

When the data of the same type of target table are all together, the data can be uniformly cleaned, the field value of the corresponding position of the array is taken out, the value of the corresponding cleaning field is taken out, the field is cleaned by using the method name reflection cleaning rule of the cleaning rule, the reflection is adopted, complicated codes are avoided, and the cleaning rule is extracted into different methods of a single type, so that the maintenance is more convenient.

Finally, after washing, the data can be binned, as opposed to binning using standard sql, which we use the dataframe in sparksql. Compared with jdbc mode, sparksql adopts a mode of directly writing files to HDFS, is faster than jdbc mode through TCP-based thraft mode, and reduces operations such as unpacking. Using the dataframe approach requires the construction of schemas and row. According to the message format transmitted by the user, the field names and the field types are sequentially extracted to construct a schema, and according to the field values transmitted by the user, a row is constructed. Then constructing into a dataframe according to the schema and the row, and calling the method of the dataframe to write the data into the target database.

The data can be written into the warehouse by using a dataframe mode in support of the jdbc connection mode, and the warehouse can only be put into the warehouse by using a general mode without support of the jdbc mode.

In addition, it should be noted that, a table needs to be built in hive (i.e. a table is built in a data warehouse, since the data warehouse has many kinds, and now we use a hive database and a thunder database, the method can also use other kinds of data warehouses), collect the local data and the error types of each error report, write the collected data and the error types into the error table of hive, and regularly count and evaluate the errors, so as to evaluate the error data regularly, find hidden bug in the code, modify or optimize the code, reduce the occurrence of similar errors, collect the error report data, prevent the critical data from losing due to the code bug, and thus ensure that the data cannot be lost.

According to the scheme, the operation of receiving data by operation and maintenance developers is simplified by using the web page, commands do not need to be executed, configuration files do not need to be modified, migration and access of various complex data can be easily completed without executing programs, and learning cost and operation and maintenance difficulty are reduced;

by making a general rule of data access, the data of different tables in different libraries can be written into the format, so that the set of flow codes can be used for completing the ETL operation of all data with different sources, and the development quantity is greatly reduced.

Based on the same inventive concept, referring to fig. 5, the embodiment of the present application further provides an ETL device based on multiple data sources, and since the principle of solving the problem of these devices is similar to that of an ETL method based on multiple data sources, the specific implementation of these devices can refer to the implementation steps of the method, and the repetition is omitted.

The device comprises:

the configuration module is used for configuring the custom data;

the selection module is used for selecting the output destination table and the cleaning rule of the field; wherein the configuration and selection are both obtained by operating in a front-end Web page; the cleaning rule is a function of packaging completion; the cleaning rule includes:

clearing field content and deleting space;

NULL value replacement, designated value replacement NULL.

the back end collects the data of the corresponding library according to the database address and identity information given by the user, encapsulates the data into a fixed message format according to the field name and the type specified by the user, and transmits the data from the data source library to the kafka message queue; kafka message queues to wait for large data programs to fetch data.

Namely, setting the time of batch processing according to the data volume and the real-time requirement;

grouping the data of the same batch of streams according to the field of the target table, calculating the number of the target tables by using a groupby operator of RDD, traversing the target tables, and screening the data of the same table into a block by using a filter operator of RDD again, wherein each RDD is a target table of the same type;

when the data of the same type of target table are all together, uniformly cleaning, taking out the field value of the corresponding position of the array, taking out the value of the corresponding cleaning field, and using the method name of the cleaning rule to reflect the cleaning rule to clean the field; meanwhile, the dataframe in the sparksql is adopted for warehousing.

The scheme is applied to the situation that a plurality of data sources exist, the traditional method for developing a plurality of programs to clean and store corresponding data is replaced, the same effect as a plurality of programs can be achieved without developing and deploying the plurality of programs, and under the situation that certain data cleaning rules need to be changed, the method of the application can be used for replacing the traditional method for modifying codes, so that the development quantity is greatly reduced;

the web page is used to simplify the operation of the operation and maintenance developer for receiving data, and the migration and access of various complex data can be easily completed without executing programs, so that the learning cost and the operation and maintenance difficulty are reduced.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. An ETL method based on multiple data sources, comprising:

configuring custom data;

processing the data in the kafka message queue by using a spark stream computing framework, and then warehousing the processed data;

the processing of the spark stream computing framework comprises the following steps:

2. The multi-data-source based ETL method of claim 1, wherein the cleansing rules are a function of encapsulation completion; the cleaning rule includes:

clearing field content and deleting space;

NULL value replacement, designated value replacement NULL.

3. An ETL method based on multiple data sources as claimed in claim 2, wherein the binning is performed using dataframes in sparksql.

4. An ETL apparatus based on multiple data sources, comprising:

the configuration module is used for configuring the custom data;

the processing module is used for utilizing the spark stream computing framework to process the data in the kafka message queue and then warehousing the processed data;

5. The multi-data source based ETL device of claim 4, wherein the cleansing rules are a function of encapsulation completion; the cleaning rule includes:

clearing field content and deleting space;

NULL value replacement, designated value replacement NULL.

6. The multi-data source based ETL device according to claim 4, wherein the binning is performed using dataframes in sparksql.