CN113204593A

CN113204593A - ETL job development system and computer equipment based on big data calculation engine

Info

Publication number: CN113204593A
Application number: CN202110590257.7A
Authority: CN
Inventors: 董加强
Original assignee: Shanghai Qiniu Information Technology Co ltd
Current assignee: Shanghai Qiniu Information Technology Co ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2021-08-03

Abstract

The embodiment of the application belongs to the technical field of big data, and relates to an ETL job development system based on a big data calculation engine and a computer device, wherein the system comprises: the system comprises a configuration file input and analysis subsystem, a configuration file analysis subsystem and a configuration file analysis subsystem, wherein the configuration file input and analysis subsystem is used for inputting a configuration file to be subjected to ETL operation development and then analyzing the configuration file to obtain an analysis result; the rendering and registering subsystem is used for rendering and dynamically compiling the analysis result and then registering the analysis result to the big data computing engine; and the execution subsystem is used for calling the registered big data calculation engine to develop the ETL operation. A distributed big data processing system is formed by adopting a configured development mode and dynamically analyzing and rendering ETL operation instructions and combining a big data calculation engine, so that the repetitive labor can be reduced, the development flow can be shortened, and the difficulty in development and operation and maintenance can be reduced.

Description

ETL job development system and computer equipment based on big data calculation engine

Technical Field

The invention relates to the technical field of big data, in particular to an ETL (extract transform load) operation development system and computer equipment based on a big data calculation engine.

Background

Currently, massive data analysis is a very common internet service scenario, and these massive data may be stored in an offline data source (e.g. HDFS) or a real-time data source (e.g. KAFKA), and when performing some more common data ETL (Extract-Transform-Load) operation, the data needs to be aggregated, accumulated, or filtered, and written into an external storage system such as HDFS, RDS, or KAFKA.

Developers can utilize some common big data frameworks such as Flink or Spark and the like, and then implement the ETL process in combination with specific data processing services, which requires that the developers must be very familiar with the API interfaces of Flink or Spark and with some common programming languages such as Java or Scala and the like, and each data processing requirement needs to establish a specific project, and the development and operation and maintenance are difficult due to the fact that the developers have more repetitive labor and longer development processes and cannot be deployed uniformly.

Disclosure of Invention

The embodiment of the application aims to provide an ETL job development system based on a big data computing engine so as to reduce the difficulty of development, operation and maintenance in the big data processing process.

In order to solve the above technical problem, an embodiment of the present application provides an ETL job development system based on a big data computing engine, which adopts the following technical solutions:

the ETL job development system based on the big data computing engine comprises:

the system comprises a configuration file input and analysis subsystem, a configuration file analysis subsystem and a configuration file analysis subsystem, wherein the configuration file input and analysis subsystem is used for inputting a configuration file to be subjected to ETL operation development and then analyzing the configuration file to obtain an analysis result;

the rendering and registering subsystem is used for rendering and dynamically compiling the analysis result and then registering the analysis result to the big data computing engine;

and the execution subsystem is used for calling the registered big data calculation engine to develop the ETL operation.

Further, the configuration file comprises a user-defined job file.

Further, the operation file is analyzed through a grammar analysis tool to obtain an analysis result, and the analysis result comprises an operation instruction.

Further, the step of rendering and dynamically compiling the parsing result includes:

reading a predefined rendering variable;

and performing real-time rendering on the operation instruction by using the rendering variable.

Further, the dynamic compiling includes dynamically compiling the rendered job instructions into executable instructions of the big data compute engine.

Further, registering with a big data compute engine includes invoking a registration interface of the big data compute engine to execute the executable instructions.

Further, the operating system further comprises a data input subsystem for inputting externally input operating data to the big data computing engine.

Further, the operating system further comprises a data output subsystem, and the data output subsystem is used for outputting the calculation result of the big data calculation engine to an external storage.

Furthermore, the operating system further comprises a monitoring subsystem for monitoring the whole ETL operation development process of the operating system.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the functions of the ETL job development system or each subsystem based on the big data computing engine when executing the computer program.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: provided is an ETL job development system based on a big data computing engine, comprising: the system comprises a configuration file input and analysis subsystem, a configuration file analysis subsystem and a configuration file analysis subsystem, wherein the configuration file input and analysis subsystem is used for inputting a configuration file to be subjected to ETL operation development and then analyzing the configuration file to obtain an analysis result; the rendering and registering subsystem is used for rendering and dynamically compiling the analysis result and then registering the analysis result to the big data computing engine; and the execution subsystem is used for calling the registered big data calculation engine to develop the ETL operation. A distributed big data processing system is formed by adopting a configured development mode and dynamically analyzing and rendering ETL operation instructions and combining a big data calculation engine, so that the repetitive labor can be reduced, the development flow can be shortened, and the difficulty in development and operation and maintenance can be reduced.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 illustrates a schematic structural diagram of an ETL job development system based on a big data computing engine according to an embodiment of the present application;

fig. 2 shows a schematic structural diagram of a computer device provided according to an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. The term "connected" and "coupled" when used in this application, unless otherwise indicated, includes both direct and indirect connections (couplings).

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions of the present application better understood by those skilled in the art, the following related background knowledge will be first explained. The ETL (Extract-Transform-Load) is a process of loading a large amount of data of a business system into a data warehouse after extraction, cleaning and conversion, and aims to integrate scattered, disordered and standard non-uniform data in an enterprise to provide an analysis basis for enterprise decision making. ETL is an important link in the BI project. Typically, ETL spends at least 1/3 of the entire project in a BI project, and the goodness of the ETL design is directly related to the success or failure of the BI project. The design of the ETL is divided into three parts: data extraction, data cleaning conversion and data loading. We also start from these three sections when designing ETL. Data extraction is performed by extracting Data from different Data sources into an ODS (Operational Data Store) — some Data cleaning and conversion can be performed in the process, and different extraction methods need to be selected in the extraction process, so that the operation efficiency of the ETL is improved as much as possible. Of the three ETL parts, the longest part takes time is the "T" (Transform, cleaning, conversion) part, and the part of the workload is 2/3 of the whole ETL in general. The loading of Data is typically written directly into a DW (Data flushing) after the Data is flushed.

There are several ways to implement ETL, three of which are commonly used. One is realized by means of ETL tools (such as an OWB of Oracle, a DTS of an SQL Server 2000, an SSIS service of the SQL Server2005, a big data calculation engine and the like), the other is realized in an SQL mode, and the other is realized by combining the ETL tools and the SQL. The first two methods have respective advantages and disadvantages, an ETL project can be quickly established by means of a tool, a complex coding task is shielded, the speed is improved, the difficulty is reduced, and the flexibility is lacked. The SQL method has the advantages of flexibility, improvement of ETL operation efficiency, complex coding and higher technical requirement. And the third method is to combine the advantages of the first two methods, so that the development speed and efficiency of the ETL can be greatly improved.

Therefore, the patent innovatively provides an ETL job development system based on a big data computing engine, so as to reduce the difficulty of development, operation and maintenance in the big data processing process. The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in FIG. 1, FIG. 1 illustrates a schematic structural diagram of one embodiment of a big data compute engine based ETL job development system 100 according to the present application. The ETL job development system based on the big data computing engine comprises:

a configuration file input and analysis subsystem 101, configured to input a configuration file to be subjected to ETL job development, and then analyze the configuration file to obtain an analysis result;

the rendering and registering subsystem 102 is used for rendering and dynamically compiling the analysis result and then registering the analysis result to a big data computing engine;

and the execution subsystem 103 is used for calling the registered big data calculation engine to develop the ETL operation.

The configuration file comprises a user-defined operation file, the operation file can be a user-defined SQL statement script or a DSL file similar to SQL syntax, and the operation file is automatically read, analyzed and executed when the system runs. The DSL is a domain specific language, and a computer program language with limited expressiveness for a specific domain can be regarded as an abstract processing manner. The functions of constructing the model and the system are abstracted through the DSL, common codes are extracted, repeated labor is reduced, and the development efficiency can be improved. The processing of DSL typically comprises the following steps: defining a DSL script file, analyzing the script file to obtain a semantic model, and generating a program code or executing the model according to the semantic model.

Further, the operation file is analyzed through a grammar analysis tool to obtain an analysis result, and the analysis result comprises an operation instruction; the parsing tool may be ANTLR or other tool that can implement parsing, ANTLR including a lexical parser and a parser that allows us to define lexical rules for recognizing the character stream and parsing rules for interpreting the character stream. ANTLR will then actively generate the corresponding lexical/syntactic parser itself according to the user-provided grammar file (e.g., the DSL file described above). The user can compile the input text using ANTLR and convert it into other forms, such as AST (Abstract Syntax Tree, i.e. semantic model) or other data manipulation instructions that the computer can further parse and execute, for example, ETL job development instructions in this embodiment. The DSL is used for facilitating a user to self-define the ETL job development file, then the analysis of the DSL file is carried out by combining with the ANTLR, a series of data processing instructions are obtained, the first step of the ETL job development can be carried out quickly without the user mastering a programming language (such as JAVA or SCALA) and the difficulty of the ETL job development is reduced.

Further, the step of rendering the analysis result includes:

reading a predefined rendering variable;

Specifically, the configuration file further includes a variable rendering file, where the variable rendering file is used to pre-define rendering variables, and a variable definition method with a format of-key value (supporting multiple variables) is as follows:

cmd_args＝[

"/home/qboxserver/defy-log-dw-cdn-sla/_package/run.sh",

"-day{}".format(day),

"-hour{}".format(hour)]

where day and hour are predefined rendering variables. The rendering variables are read when the system executes the job instructions and then used to render the corresponding variables referenced in the DSL file in real time, such as variables < day > and < hour > referenced by the following statements: create batch input label raw log USING part (path ═ hdfs:// cdnloghadoop/dw/summary/day ═ day >/hour ═ hour > ". Through the real-time variable rendering, parameters can be changed or designated during running, so that a job instruction can be dynamically generated, and the flexibility and reusability of the system are improved.

Furthermore, the dynamic compiling includes dynamically compiling the rendered job instruction into an executable instruction of the big data computing engine, so as to be applicable to the big data computing engine; and then calling a registration interface of the big data computing engine to register the rendered and dynamically compiled executable instruction, wherein the registered executable instruction can be executed by calling the big data computing engine through the execution subsystem, so that corresponding big data ETL (extract transform and load) operation development is carried out and a computing result is output.

Specifically, the big data computing engine may use Spark or Flink, Spark is a big data parallel computing framework based on memory computing, and may be used to construct a large-scale and low-delay data analysis application program, and the computing speed is high; in addition, Spark provides a complete and powerful technical stack, including SQL query, streaming computation, machine learning, and graph algorithm components, which can be seamlessly integrated in the same application, sufficient to handle complex computations; the Spark operation modes are various, the Spark operation mode can be operated in an independent cluster mode or in a cloud environment, and various data sources such as HDFS, HBase and Hive can be accessed.

By combining the advantages of Spark, the ETL job development system of this embodiment can be deployed as a distributed big data processing system, that is, each subsystem can be independently deployed on different computer devices, and the coupling of each subsystem is reduced, so that the stability of the system is improved, and development, operation and maintenance are facilitated.

In addition, the operating system further includes a data input subsystem for inputting externally input operation data to the big data calculation engine Spark. The operating system further comprises a data output subsystem, and the data output subsystem is used for outputting the calculation result of the big data calculation engine Spark to an external storage. The operation system also comprises a monitoring subsystem for monitoring the whole ETL operation development process of the operation system, such as judging throughput, time delay and the like of data, displaying and early warning in real time, and facilitating the checking and positioning of operation and maintenance personnel.

In the development process of the ETL job development system based on the big data computing engine, it is not necessary to create Java/Scala projects, and the development process is performed in a configuration mode, so that developers do not need to master Java/Scala and other programming languages, and understand Spark or Flink APIs, and can realize stream-type and batch-type big data processing by using SQL syntax. Specifically, the development steps are as follows:

stream processing: stream processing development generally uses kafka (a distributed messaging system) by consuming topoic (topic) of kafka and then aggregating the data of topoc, continuously outputting the data to an external system (e.g., HDFS, database), and operates as follows:

the new directory below the defy-streaming directory represents that a new stream processing item is created, and the files to be changed are as follows:

DSL (i.e. the above-mentioned DSL file, describes how to read data, convert data, write data to a SQL-like statement set of an external system);

gather (parameters required by the operation of the gather, including setting resource attributes such as execute _ cores, execute _ memory, and execute);

application. conf (including defy-streaming project operating parameters);

batch processing: the batch processing task generally reads files on the HDFS, then performs a conversion operation such as aggregation, and finally outputs the data to an external system (e.g., HDFS, database) as follows:

a new directory is created under the defy-airflow directory, representing the creation of a new batch item, and the files that need to be changed are:

DSL (DSL file describes how to read data, convert data, write data to SQL-like statement sets of external systems);

gather (parameters required by gather operation, including setting resource properties such as execute _ cores, execute _ memory, and execute);

application. conf (including defy-streaming project operating parameters);

a new airflow dag is created under the airflow directory, the batch project is scheduled, and the file to be added is dag.

In addition, ANTLR can be utilized to extend SQL syntax of the big data engine Spark/Flink, enabling the big data engine to support stream processing functionality of Spark extended using DDL to create stream dimension tables, watermark semantic definition and parsing to support Spark stream computation.

It should be understood that, although the respective subsystems in the structural diagram of the drawings are sequentially shown as indicated by arrows, the subsystems are not necessarily sequentially executed in the order indicated by the arrows. The execution of these subsystems is not strictly sequential, and may be performed in other sequences unless explicitly stated otherwise herein. Moreover, at least a portion of the subsystems in the schematic block diagrams of the figures may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution is not necessarily sequential, but may be alternated or performed with other steps or at least a portion of the sub-steps or stages of other steps.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 2 in detail, fig. 2 is a block diagram of a basic structure of a computer device 2 according to the present embodiment.

The computer device 2 comprises a memory 21, a processor 22, a network interface 23, communicatively connected to each other by a system bus. It is noted that only a computer device 2 having components 21-23 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 21 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 2. Of course, the memory 21 may also comprise both an internal storage unit of the computer device 2 and an external storage device thereof. In this embodiment, the memory 21 is generally used for storing an operating system installed on the computer device 2 and various types of application software, such as program codes of an ETL job development system based on a big data computing engine. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to run the program code stored in the memory 21 or process data, for example, the program code of the ETL job development system based on the big data computing engine.

The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing communication connection between the computer device 2 and other electronic devices.

The present application further provides another embodiment, which is to provide a computer readable storage medium, wherein the computer readable storage medium stores a corresponding program of the big data computing engine based ETL job development system, and the big data computing engine based ETL job development system program can be executed by at least one processor, so that the at least one processor executes the steps of the corresponding program of the big data computing engine based ETL job development system.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. An ETL job development system based on a big data computing engine, comprising:

2. The ETL job development system of claim 1, wherein said configuration file comprises a user-customized job file.

3. The ETL job development system of claim 2, wherein said job file is parsed by a syntax parsing tool to obtain said parsing result, said parsing result comprising a job instruction.

4. The ETL job development system of claim 3, wherein said step of rendering and dynamically compiling said parsed results comprises:

reading a predefined rendering variable;

5. The ETL job development system of claim 4, wherein the dynamic compilation comprises dynamically compiling the rendered job instructions into executable instructions of the big data compute engine.

6. The ETL job development system of claim 5, wherein said registering with big data compute engine comprises calling a registration interface of the big data compute engine to execute the executable instructions.

7. The ETL job development system of claim 1, wherein said operating system further comprises a data input subsystem for inputting externally input job data to said big data compute engine.

8. The ETL job development system of claim 7, wherein said job system further comprises a data output subsystem for outputting the calculation results of said big data calculation engine to an external storage.

9. The ETL job development system of claim 1, wherein said operating system further comprises a monitoring subsystem for monitoring the entire ETL job development process of said operating system.

10. A computer device comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the functions of the large data computing engine based ETL job development system or subsystems according to any one of claims 1 to 9 when executing the computer program.