CN114691766A

CN114691766A - Data acquisition method and device and electronic equipment

Info

Publication number: CN114691766A
Application number: CN202011612552.XA
Authority: CN
Inventors: 薛星海
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-01

Abstract

The invention discloses a data acquisition method, a data acquisition device and electronic equipment, which are applied to the field of data processing and comprise the following steps: acquiring a target ETL task and a target configuration item for specifying a target execution engine; converting the target ETL task into standard operation configuration; generating an application program package suitable for being executed by a target execution engine according to the target configuration item; and submitting the application program package to the target service cluster according to the target configuration item, so that the target service cluster executes the application program package by using the target execution engine to run the ETL data acquisition operation. The invention improves the data acquisition efficiency and reduces the complexity of user operation.

Description

Data acquisition method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing, and in particular, to a data acquisition method and apparatus, and an electronic device.

Background

A user extracts required data from a data source, and finally loads the data into a data warehouse according to a predefined data warehouse model after data cleaning, wherein ETL (Extract-Transform-Load) is a main tool. The ETL data acquisition tool is a process of loading data of a business system into a data warehouse after extraction, cleaning and conversion, aims to integrate scattered, disordered and standard non-uniform data in an enterprise and provides an analysis basis for the decision of the enterprise, and is an important link of a BI (business intelligence) project.

The ETL data collection tools are all internally integrated execution engines, so that the functions and the execution engines are bound, the execution engines cannot be switched, and if the current execution engine is not suitable for a certain data source, the whole ETL data collection tool needs to be replaced to realize data collection of the data source. For situations such as ETL operation of multi-source heterogeneous data, a user needs to use a plurality of ETL data acquisition tools to perform ETL operation of different data sources, so that the data acquisition efficiency is low, the operation difficulty of switching different ETL tools by the user is high, and the cost is also high.

Disclosure of Invention

The embodiment of the invention provides a data acquisition method, a data acquisition device and electronic equipment, and solves the technical problems of low data acquisition efficiency and high operation difficulty of a user in the prior art.

In a first aspect, the present invention provides a data acquisition method according to an embodiment of the present invention, including:

acquiring a target ETL task and a target configuration item for specifying a target execution engine;

converting the target ETL task into a standard job configuration;

generating an application program package suitable for the target execution engine to execute according to the target configuration item;

and submitting the application program package to a target service cluster according to the target configuration item, so that the target service cluster executes the application program package by using the target execution engine to run an ETL data acquisition job.

Optionally, the converting the target ETL task into a standard job configuration includes: calling a pre-established configuration conversion plug-in to convert the target ETL task into standard operation configuration; generating an application package suitable for the target execution engine to execute according to the target configuration item, wherein the generating includes: calling a program created in advance to generate a plug-in, and generating the application program package according to the target configuration item.

Optionally, the submitting the application package to the target service cluster according to the target configuration item includes:

and calling a pre-created submission plug-in, and submitting the application package to the target service cluster.

determining a target starting command matched with the target configuration item based on a preset mapping relation between each configuration item and the starting command;

and submitting the application program package to the target service cluster by using the target starting command so as to enable the target service cluster to start the target execution engine by using the target starting command and execute the application program package by using the started target execution engine.

Optionally, there are multiple candidate execution engines distributed over different service clusters or the same service cluster, the target execution engine belonging to one of the multiple candidate execution engines.

Optionally, an engine configuration interface including a plurality of configuration items is preset, and the obtaining of the target configuration item for specifying the target execution engine includes:

and acquiring a configuration item selected by a user as the target configuration item by using the engine configuration interface, wherein the target configuration item is used for appointing one execution engine which is adaptive to the target ETL task from the plurality of candidate execution engines as the target execution engine.

Optionally, the application package includes: any one of a spark application package, a flink application package, and a datax application package.

In a second aspect, an embodiment of the present invention provides a data acquisition apparatus, including:

a data acquisition unit for acquiring a target ETL task and a target configuration item for specifying a target execution engine;

a configuration conversion unit for converting the target ETL task into a standard job configuration;

the program generating unit is used for generating an application program package suitable for the target execution engine to execute according to the target configuration item;

and the submitting unit is used for submitting the application program package to a target service cluster according to the target configuration item so that the target service cluster executes the application program package by using the target execution engine to run ETL data acquisition operation.

In a third aspect, an embodiment of the present invention provides an electronic device, including: memory, processor and code stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of the embodiments of the first aspect when executing the code.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method according to any one of the embodiments of the first aspect.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:

converting the acquired target ETL task into standard operation configuration; and generating an application program package suitable for being executed by the target execution engine according to the target configuration item, and submitting the application program package to the target service cluster according to the target configuration item so that the target service cluster starts the target execution engine to execute the application program package to run the ETL data acquisition job. Therefore, only by changing the configuration items through a user to select the target execution engine to be actually used, the application program package suitable for the target execution engine to execute can be generated, the execution engines of the ETL are expanded, different execution engines can be flexibly switched, then, according to the ETL task and the frame advantages of the different execution engines, the proper execution engine can be selected to complete the operation of the ETL operation, the user does not need to learn various different ETL tools, the perception to the user is eliminated, and the ETL operation efficiency of multi-source data is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flow chart of a data acquisition method according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of a data acquisition device according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to solve the technical problems of low data acquisition efficiency and high user operation difficulty in the prior art, the embodiment of the invention provides a data acquisition method, which has the following general idea:

converting the acquired target ETL task into standard operation configuration; and generating an application program package suitable for the target execution engine to execute according to the target configuration item, and submitting the application program package to the target service cluster according to the target configuration item so that the target service cluster starts the target execution engine to execute the application program package to run the ETL data acquisition job.

By the technical scheme, the low coupling of the upper-layer function of the ETL operation and the execution engine at the bottom layer is realized, so that the target execution engine to be actually used can be selected only by changing the configuration items through a user, the application program package suitable for the target execution engine to execute can be generated, the execution engine of the ETL is expanded, different execution engines can be flexibly switched, and then the proper execution engine can be selected to finish the running of the ETL operation according to the ETL task and the frame advantages of different execution engines, the user does not need to learn various ETL tools, the perception of the user is avoided, and the ETL operation efficiency of multi-source data is improved.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

First, it is stated that the term "and/or" appearing herein is merely one type of associative relationship that describes an associated object, meaning that three types of relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The data acquisition method provided by the embodiment of the invention is realized by using the same ETL data acquisition tool, and specifically, the ETL data acquisition tool comprises a business layer and execution engines, and a plurality of execution engines are expanded. The embodiment of the invention realizes the following data acquisition method by using the same ETL data acquisition tool without using a plurality of different ETL data acquisition tools.

Referring to fig. 1, a method for implementing data acquisition according to an embodiment of the present invention includes the following steps:

s101: and acquiring a target ETL task and a target configuration item for specifying a target execution engine.

The target ETL task may specifically be to perform data acquisition on data sources such as log files and streaming data. Specifically, if the target ETL task is created by a user operation, the target ETL task created by the user is obtained, and a read operation (reader), a transform operation (transformer), a write operation (writer), and the like are configured in the created target ETL task. The configuration of the reading operation specifically configures a data source for extracting data, namely, from which data warehouse the data is read; the conversion operation configures specific conversion operation contents, such as: at least one of a plurality of conversion operations such as filtering, gathering, sorting, field mapping, de-duplication and the like can be configured; the configuration of the write operation is specifically to configure the target source of the write data, i.e., to which data warehouse the converted data is written.

The ETL data acquisition tool used in the embodiment of the invention is provided with a visual interface for a user to create an ETL task. So that the user can perform the relevant operations of creating the target ETL task on the visualization interface. Further, the visualization interface comprises an engine configuration interface, and the configuration item selected by the user is obtained as a target configuration item by using the engine configuration interface, wherein the target configuration item is used for designating one execution engine matched with the target ETL task as a target execution engine from a plurality of candidate execution engines.

Specifically, an engine configuration interface is preset, and the engine configuration interface includes a plurality of configuration items, which correspond to the execution engines of the ETL data collection tool one to one. The configuration items of the engine configuration interface may include: the data, the flink, the spark and the like are various, so that a user can select a configuration item through the engine configuration interface to realize one of the designated data execution engine, the flink execution engine and the spark execution engine as a target execution engine.

It should be noted that, in the embodiment of the present invention: flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink is designed to operate in all common cluster environments, performing calculations at memory speed and any scale; spark is a distributed, open source processing system for large data workloads. Spark can use the query execution mode of caching and optimizing in the memory, and can carry out rapid analysis query aiming at data of any scale; the DataX is an open-source ETL acquisition framework and supports data synchronization of various data sources and target sources.

S102: the target ETL task is converted to a standard job configuration.

In the ETL data collection tool used in the embodiment of the present invention, a configuration conversion plug-in is created in advance, the pre-created configuration conversion plug-in is called, and the target ETL task is converted into a standard job configuration, specifically, the method includes converting each operation configured in the target ETL task, for example, converting configurations of a read operation (reader), a conversion operation (transformer), and a write operation (writer), so as to obtain the standard job configuration. The standard job configuration includes configuration contents of standard formats for a read operation (reader), a transform operation (transformer), and a write operation (writer). The standard job configuration is the basis of switching the execution engine, and as the ETL data acquisition is a universal architecture containing a data source and a target source, any ETL task can be converted into the standard job configuration in the standard format.

S103: and generating an application program package suitable for the target execution engine to execute according to the target configuration item.

The ETL data acquisition tool used in the embodiment of the invention is pre-created with a program generation plug-in, and standard operation configuration is converted into an application program package suitable for being executed by a target execution engine by calling the pre-created program generation plug-in.

In specific implementation, the selected target configuration items are different, and the application packages generated by conversion are correspondingly different. The generated application package may be a package including: any one of a spark application package, a flink application package, and a datax application package.

Specifically, if the target configuration item selected by the user is "spark", the standard job configuration is converted to generate a spark application package; and if the target configuration item selected by the user is 'flink', converting the standard job configuration to generate a flink application program package. If the target configuration item selected by the user is "datax", the standard job configuration is converted to generate a datax application package.

Specifically, if the spark application package is generated by conversion, the specific method is as follows: converting the standard job configuration into an RDD (resource Distributed data sets) operator which can be executed by a spark execution engine, and then packaging the RDD operator obtained by conversion to obtain a spark application program package. The RDD operator is composed of a plurality of partitions, acts on the partitions, and is a function for calculating the partitions.

Specifically, if the flink application package is generated through conversion, the specific method is as follows: and converting the standard job configuration into an execution diagram structure which can be executed by a flink execution engine, and then packaging the converted execution diagram structure to obtain a spark application package.

If the datax application program package is generated by conversion, the specific mode is as follows: and converting the standard job configuration into a configuration file in a standard data source configuration format suitable for the datax execution engine, specifically in a json format, and transmitting the converted configuration file serving as a parameter into a datax.

In step S102, an application package suitable for execution by the target execution engine is generated.

S103, submitting the application program package to the target service cluster according to the target configuration item, so that the target service cluster executes the application program package by using the target execution engine to run the ETL data acquisition job.

In step S103, specifically, the pre-created submission plug-in is called to submit the application package to the target service cluster, and the application package is executed on the target service cluster. The target service cluster refers to a service cluster where the target execution engine is located.

The start commands required for different candidate execution engines are different, so that with different defined start commands the respective execution engine can be started. The start command of each execution engine defines the relevant information corresponding to the execution engine to be started.

In an alternative embodiment, the mapping relationship between each configuration item and the start command may be created in advance. The map relationship is utilized to effect the submission of the generated application package to the appropriate location and execution by the appropriate execution engine.

For example, the "spark configuration" item maps the first start command; the 'flink' configuration item maps a second start command; the "datax" configuration item maps to the third start command. According to the target configuration item selected by the user, a starting command for starting the target execution engine can be matched.

Specifically, step S103 specifically includes: determining a target starting command matched with the target configuration item based on a preset mapping relation between each configuration item and the starting command; and submitting the application program package to the target service cluster by using the target starting command so that the target service cluster starts the target execution engine by using the target starting command and executes the application program package by using the started target execution engine.

Certainly, in specific implementation, the application package and the target configuration item may also be directly submitted to the target service cluster, and a target start command is matched in the target service cluster based on the target configuration item, so as to start a target execution engine for executing the application package by using the target start command.

In a specific implementation, the multiple candidate execution engines are distributed in different service clusters, or may be distributed in the same service cluster. For example, spark execution engine is at service cluster a (service nodes 1, 2, 3, 4), flink execution engine is at service cluster B (service nodes 3, 4, 6, 7), and datax execution engine is distributed at service cluster B (service nodes 5, 6, 7, 8).

In an optional embodiment, if multiple candidate execution engines are distributed in the same service cluster, the application package is submitted to the service cluster corresponding to the IP address based on the fixed IP address.

In an optional embodiment, if a plurality of candidate execution engines are distributed in different service clusters, a mapping relationship between each configuration item and an IP address of a service node where the corresponding execution engine is located is pre-created, and a target IP address of the service cluster where the target execution engine is located is determined; and submitting the application program package and the target starting command to the determined target IP address so that the application program package and the target starting command are submitted to the service cluster where the target execution engine is located. And after receiving the target start command and the service cluster of the application package, starting the target execution engine by using the target start command.

Based on the same inventive concept, another embodiment of the present invention provides a data acquisition apparatus, as shown in fig. 2, including:

a data acquisition unit 201 for acquiring a target ETL task and a target configuration item for specifying a target execution engine;

a configuration conversion unit 202, configured to convert the target ETL task into a standard job configuration;

a program generating unit 203, configured to generate an application package suitable for the target execution engine to execute according to the target configuration item;

a submitting unit 204, configured to submit the application package to a target service cluster according to the target configuration item, so that the target service cluster executes the application package by using the target execution engine, so as to run an ETL data collection job.

In an optional implementation manner, the configuration converting unit 202 is specifically configured to: calling a pre-established configuration conversion plug-in to convert the target ETL task into standard operation configuration; the program generating unit 203 is specifically configured to: calling a program created in advance to generate a plug-in, and generating the application program package according to the target configuration item.

In an optional implementation manner, the submission unit 204 is specifically configured to: and calling a pre-created submission plug-in to submit the application package to the target service cluster.

In an optional implementation manner, the submission unit 204 is specifically configured to:

In an alternative embodiment, there are multiple candidate execution engines distributed in different service clusters or the same service cluster, and the target execution engine belongs to one of the multiple candidate execution engines.

In an optional embodiment, an engine configuration interface including a plurality of configuration items is preset, and the data obtaining unit 201 is specifically configured to:

and acquiring a configuration item selected by a user as the target configuration item by using the engine configuration interface, wherein the target configuration item is used for appointing one execution engine which is adapted to the target ETL task from the plurality of candidate execution engines as the target execution engine.

In an optional embodiment, the application package includes: any one of a spark application package, a flink application package, and a datax application package.

Based on the same inventive concept, the embodiment of the invention provides electronic equipment. Referring to fig. 3, an electronic device according to an embodiment of the present invention includes: a memory 301, a processor 302 and code 303 stored on the memory and executable on the processor, the controller implementing any of the foregoing embodiments of the data acquisition method when executing the code.

Where in fig. 3 a bus architecture (represented by bus 300), bus 300 may include any number of interconnected buses and bridges, bus 300 linking together various circuits including one or more processors, represented by processor 302, and memory, represented by memory 301. The bus 300 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 306 provides an interface between the bus 300 and the receiver 301 and transmitter 303. The receiver 301 and the transmitter 303 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 302 is responsible for managing the bus 300 and general processing, and the memory 304 may be used for storing data used by the processor 302 in performing operations.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of data acquisition, comprising:

converting the target ETL task into a standard job configuration;

2. The method of claim 1, wherein:

the converting the target ETL task into a standard job configuration comprises: calling a pre-established configuration conversion plug-in to convert the target ETL task into standard operation configuration;

generating an application package suitable for the target execution engine to execute according to the target configuration item, wherein the generating includes: calling a program created in advance to generate a plug-in, and generating the application program package according to the target configuration item.

3. The method of claim 2, wherein said submitting the application package to a target service cluster according to the target configuration item comprises:

4. The method of claim 1 or 3, wherein said submitting said application package to a target service cluster according to said target configuration item comprises:

5. The method of claim 1, wherein there are multiple candidate execution engines distributed across different service clusters or the same service cluster, the target execution engine belonging to one of the multiple candidate execution engines.

6. The method of claim 5, wherein an engine configuration interface containing a plurality of configuration items is preset, and the obtaining of the target configuration item for specifying the target execution engine comprises:

7. The method of claim 1 or 2, wherein the application package comprises: any one of a spark application package, a flink application package, and a datax application package.

8. A data acquisition device, comprising:

9. An electronic device, comprising: memory, processor and code stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1-7 when executing the code.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.