CN115809294A - Rapid ETL method based on Spark SQL temporary view - Google Patents
Rapid ETL method based on Spark SQL temporary view Download PDFInfo
- Publication number
- CN115809294A CN115809294A CN202211580856.1A CN202211580856A CN115809294A CN 115809294 A CN115809294 A CN 115809294A CN 202211580856 A CN202211580856 A CN 202211580856A CN 115809294 A CN115809294 A CN 115809294A
- Authority
- CN
- China
- Prior art keywords
- node
- target
- etl
- sql
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a quick ETL method, a device, equipment and a medium based on Spark SQL temporary views, which are used for constructing SQL sentences of each ETL step, constructing flow nodes and a target flow DAG graph of the whole ETL process according to the SQL sentences, sequentially establishing the temporary views of each flow node through a topological sequence of the Spark based on the target flow DAG graph, and outputting target data to a target database. Compared with the prior art, the method integrates the Spark SQL with the DAG graph used in the ETL process, finally expresses the DAG graph into a complete and full-flow Spark SQL, runs on the Spark platform to realize the basic function of the ETL tool, improves the development efficiency, and only after the last temporary view is constructed in the ETL process, the processing action of the data is executed, so that the hardware resource is greatly saved, and the ETL efficiency is improved.
Description
Technical Field
The invention relates to the technical field of data warehouses, in particular to a quick ETL method, a quick ETL device, quick ETL equipment and a quick ETL medium based on Spark SQL temporary views.
Background
In the case of data middle platform construction, an ETL tool is a core component for constructing a data warehouse, when an ETL processing process is developed, a WEB graphical interface is usually adopted, nodes in each step are constructed to form a DAG directed acyclic graph, the DAG graph is analyzed in a data processing program to be a logic storage structure of node information, connection information and the like, different node types are used for loading blind node components, the node components contain operation logic, and operations such as data reading, cleaning conversion, data loading and the like are completed through the node components.
When the method is generally used for construction, a DAG graph needs to be traversed from top to bottom, result data of each node operation in the DAG is stored in a distributed file system or a database, when a downstream node calculates, calculation results in a disk of an upstream node are read, then current node data calculation operation is carried out, and after calculation is finished, the result data are written into the disk and are used by the downstream node. The method has the advantage that after the process operation is completed, the execution result data of any node can be previewed or only the downstream node operation can be executed by taking any node in the process as a root node.
However, the disadvantage of this method is that each node has a process of reading data from the network, then performing calculation, and then writing result data to the disk, which may be normal when the data amount is small or the data processing flow is short, but when the flow is large and the data amount is large, this method will greatly reduce the speed of data calculation processing. And the above manner also greatly destroys the advantage of memory calculation when ETL data processing calculation is carried out by using Spark as a bottom layer. Therefore, a new ETL processing method is needed to solve the problem of low efficiency when performing ETL processing based on a DAG map in the above process.
Disclosure of Invention
In view of the above, it is necessary to provide a fast ETL method, apparatus, device and medium based on Spark SQL temporary view, so as to solve the problem of low efficiency when performing ETL processing based on DAG graph in the prior art.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
in a first aspect, the present invention provides a fast ETL method based on Spark SQL temporary view, including:
according to the target ETL step, an SQL statement corresponding to the ETL step is constructed, and a process node corresponding to the ETL step is established based on the SQL statement;
constructing a target flow DAG graph according to the flow nodes;
traversing the DAG graph of the target process to obtain a plurality of topological sequences of the process nodes;
and establishing a temporary view corresponding to the process node according to the SQL statement sequentially through Spark based on the topological sequence, processing data according to the finally obtained temporary view, and outputting target data to a target database.
Further, the constructing an SQL statement corresponding to the ETL step according to the target ETL step, and establishing a process node corresponding to the ETL step based on the SQL statement, includes:
establishing a node ID of a target node, and establishing a view name of the target node according to the node ID of the target node;
establishing a node type according to the ETL step;
constructing an SQL statement of the target node according to the action content of the target ETL step and the view name of the target node;
and constructing the target node according to the node ID of the target node, the node type and the SQL statement of the target node.
Further, the step of constructing an SQL statement of the target node according to the action content of the target ETL step and the view name of the target node includes;
acquiring a view name of an upstream node of the target node;
and constructing the SQL statement of the target node according to the action content of the target ETL step, the view name of the upstream node of the target node and the view name of the target node.
Further, the constructing the target node according to the node ID of the target node, the node type, and the SQL statement of the target node includes:
establishing JDBC parameters according to the target ETL step;
and constructing the target node according to the node ID of the target node, the node type, the SQL statement of the target node and the JDBC parameter.
Furthermore, the process nodes comprise a data reading node, a data processing node and a data output node; the establishing, based on the topological sequence, a temporary view corresponding to the process node according to the SQL statement sequentially by Spark, processing data according to the finally obtained temporary view, and outputting target data to a target database includes:
constructing a dataframe data set according to the data reading nodes, and registering the dataframe data set into Spark SQL according to SQL statements corresponding to the data reading nodes to obtain a data reading temporary view;
according to the data reading view and the topological sequence, sequentially constructing a data processing temporary view corresponding to each processing node according to the SQL statement of each data processing node;
and processing data based on the SQL statements corresponding to the data output nodes according to the finally obtained data processing temporary view, and outputting target data to a target database.
Further, the constructing a temporary data processing view corresponding to each processing node according to the SQL statement of each data processing node in sequence based on the topology sequence includes:
acquiring a temporary view of an upstream node of the data processing node according to the topological sequence;
based on the temporary view of the upstream node of the data processing node, constructing the temporary view of the data processing according to the SQL statement corresponding to the data processing node;
wherein the upstream view of the first data processing node in the topology sequence is the data read temporary view of 5.
Further, traversing the target flow DAG graph to obtain a topology sequence of the flow nodes, including:
establishing a stack, and stacking the flow nodes with zero out degree in the target flow DAG graph;
acquiring a flow node of a stack top, and judging whether the flow node of the stack top has an upstream node or not, if so, stacking the upstream node of the flow node of the stack top, and if not, popping the flow node of the stack top;
and obtaining the topology sequence according to the pop sequence of the process nodes.
In a second aspect, the invention also provides a fast ETL based on Spark SQL temporary view
An apparatus, comprising:
a node 5 constructing module, configured to construct an SQL statement corresponding to the target ETL step according to the target ETL step, and establish a flow node corresponding to the ETL step based on the SQL statement;
a DAG graph constructing module, configured to construct a target flow DAG graph according to the flow nodes;
the sequencing module is used for traversing the DAG graph of the target process to obtain a plurality of topological sequences of the process nodes;
and the data processing module 0 is used for establishing a temporary view corresponding to the process node according to the SQL statement sequentially through Spark based on the topological sequence, processing data according to the finally obtained temporary view and outputting target data to a target database.
In a third aspect, the present invention also provides an electronic device comprising a memory and a processor, wherein,
5 a memory for storing a program;
and the processor is coupled with the memory and used for executing the program stored in the memory so as to realize the steps in the Spark SQL temporary view-based fast ETL method in any one of the implementation manners.
In a fourth aspect, the present invention further provides a computer-readable storage medium, configured to store a computer-readable program or instruction, where the program or instruction, when executed by a processor, can implement the steps in the Spark SQL temporary view-based fast ETL method in any of the above-described implementation manners.
According to the quick ETL method, the quick ETL device, the quick ETL equipment and the quick ETL medium based on the Spark SQL temporary view, the SQL sentences of each ETL step are constructed through the specific actions of the target ETL step, the flow nodes and the target flow DAG graph of the whole ETL process are constructed according to the SQL sentences, the temporary view of each flow node is sequentially constructed through the Spark based on the topological sequence of the target flow DAG graph, data are processed according to the finally obtained temporary view, and the target data are output to the target database. Compared with the prior art, the method integrates the Spark SQL with the DAG graph used in the ETL process, finally realizes that the DAG graph is expressed into a complete and full-flow Spark SQL, operates on a Spark platform, realizes the basic function of the ETL tool, and improves the development efficiency.
Drawings
FIG. 1 is a flowchart of a method of an embodiment of a fast ETL method based on Spark SQL temporary view provided by the present invention;
FIG. 2 is a flowchart of a method of step S101 in FIG. 1 according to an embodiment;
FIG. 3 is a schematic diagram illustrating a comparison relationship between a logic diagram description and a business diagram description corresponding to the DAG diagram in the target process in step S102 in FIG. 1;
FIG. 4 is a flowchart of a method of step S104 in FIG. 1 according to an embodiment;
FIG. 5 is a diagram illustrating specific SQL of each flow node in the step S104 in FIG. 1;
fig. 6 is a schematic structural diagram of an embodiment of the fast ETL device based on Spark SQL temporary view according to the present invention;
fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate preferred embodiments of the invention and together with the description, serve to explain the principles of the invention and not to limit the scope of the invention.
Before explaining in detail, some terms in this document are explained first:
ETL: ETL, an abbreviation used in english Extract-Transform-Load, is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source end to a destination end.
Spark: spark generally refers to Apache Spark, is a fast general computing engine specially designed for large-scale data processing, is an open-source Hadoop MapReduce-like general parallel framework, has the advantages of Hadoop MapReduce, but is different from MapReduce in that a Job intermediate output result can be stored in a memory, so that HDFS reading and writing are not needed any more, and Spark can be better suitable for algorithms of MapReduce which need iteration, such as data mining, machine learning and the like.
Spark SQL: spark SQL is a Spark module that sparks uses for structured data processing.
View: a view is a table based on a visualization of the result set of SQL statements, in particular a virtual table derived from one or more tables, the contents of which are defined by the query. The temporary view in the present invention refers to a temporary view in Spark.
It is understood that there are other terms related to Spark (e.g., dataframe, etc.) that are known to those skilled in the art and therefore not described herein in any greater detail.
A DAG graph: a DAG graph is a directed acyclic graph, which refers to a loop-free directed graph. In various embodiments herein, a DAG graph refers specifically to the target flow DAG graph. Similarly, other terms related to the DAG graph (such as topology ordering, upstream nodes, downstream nodes, etc.) are also available in the art and will not be described herein.
In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.
The invention realizes the aim of integrating the Spark SQL with the directed acyclic graph of the ETL when the Spark SQL is used as the ETL bottom-layer basic framework, finally expresses the directed acyclic graph as the Spark SQL with a complete and full flow, and operates on the Spark platform to realize the basic function of the ETL tool. Specifically, the nodes in the DAG graph used in the ETL process are represented as SQL, and the SQL is finally output to the specified table or file structure through the data output nodes. In addition, spark SQL represented by a plurality of nodes is reasonably connected in series, a Spark DataFrame table model structure is finally formed through a temporary view, and table data is output to different storage platforms by virtue of Spark capability.
The invention provides a fast ETL method, a device, equipment and a storage medium based on Spark SQL temporary views, which are respectively explained below.
Referring to fig. 1, a specific embodiment of the present invention discloses a fast ETL method based on Spark SQL temporary view, which includes:
s101, according to a target ETL step, constructing an SQL statement corresponding to the ETL step, and establishing a process node corresponding to the ETL step based on the SQL statement;
s102, constructing a DAG graph of the target process according to the process nodes;
s103, traversing the DAG graph of the target process to obtain a plurality of topological sequences of the process nodes;
s104, establishing a temporary view corresponding to the process node according to the SQL statement sequentially through Spark based on the topological sequence, processing data according to the finally obtained temporary view, and outputting target data to a target database.
According to the quick ETL method, the device, the equipment and the medium based on the Spark SQL temporary view, the SQL sentences of each ETL step are constructed through the specific actions of the target ETL step, the flow nodes and the target flow DAG graph of the whole ETL process are constructed according to the SQL sentences, the temporary view of each flow node is sequentially constructed through the Spark based on the topological sequence of the target flow DAG graph, data are processed according to the finally obtained temporary view, and the target data are output to the target database. Compared with the prior art, the method integrates the Spark SQL with the DAG graph used in the ETL process, finally realizes the expression of the DAG graph into a complete and full-flow Spark SQL, operates on the Spark platform, realizes the basic function of the ETL tool, and improves the development efficiency.
It should be noted that, in the above process, the steps S101 and S102, and the steps S103 and S104 may be performed in a staggered manner, for example, when a flow node is constructed, a DAG graph of a target flow may be constructed together, when a topology sequence is obtained by traversal, an action corresponding to the flow node may be performed each time an order of one node is determined, and the above number is used only as an identifier for convenience of description, and does not represent that the execution is required to be performed in the exact order in practice.
Specifically, as shown in fig. 2, step S101 in this embodiment, according to a target ETL step, constructs an SQL statement corresponding to the ETL step, and establishes a flow node corresponding to the ETL step based on the SQL statement, specifically including:
s201, establishing a node ID of a target node, and establishing a view name of the target node according to the node ID of the target node;
s202, establishing a node type according to the ETL step;
s203, constructing an SQL statement of the target node according to the action content of the target ETL step and the view name of the target node;
s204, constructing the target node according to the node ID of the target node, the node type and the SQL statement of the target node.
In this embodiment, the target ETL step includes data reading, data cleaning, data association, and data output processes, where data reading refers to acquiring data from different platforms, and is generally the first ETL step, and data output refers to outputting cleaned data to a target database, and is generally the last ETL step, so that the process node constructed in this embodiment includes a data reading node, a data processing node, and a data output node, where the data processing node includes a data cleaning node and a data association node.
As a preferred embodiment, when constructing a data processing node in the above process, the constructing an SQL statement of the target node according to the action content of the target ETL step and the view name of the target node specifically includes;
acquiring a view name of an upstream node of the target node;
and constructing the SQL sentence of the target node according to the action content of the target ETL step, the view name of the upstream node of the target node and the view name of the target node.
The upstream node is a process node corresponding to a previous step of the ETL step corresponding to the target node, and it is easy to understand that the data reading node is constructed, and because the upstream node does not exist, the SQL statement is constructed according to the view name corresponding to the upstream node.
As a preferred embodiment, when constructing a data reading node and a data output node in the above process, constructing the target node according to the node ID of the target node, the node type, and the SQL statement of the target node specifically includes:
establishing JDBC parameters according to the target ETL step;
and constructing the target node according to the node ID of the target node, the node type, the SQL statement of the target node and the JDBC parameter.
For convenience of understanding, the present invention further provides a more specific embodiment for explaining the steps S201 to S204:
the ETL function is mostly divided into a front-end visual map editing and a back-end parsing part. The front end mostly uses a graph editor such as MxGraph or Antv x6 to edit the visual graph, and forms a describable DAG directed acyclic graph description format according to an editing result. Front-end processing can be implemented by any prior art, and therefore the invention will not discuss the design of the front-end herein for the time being.
The Spark computed at the back end needs to parse the DAG directed acyclic graph at the front end, generally, the interaction format generally uses a common JSON format to describe the graph, and it is also simpler to describe the DAG directed acyclic graph using JSON, and this embodiment focuses on how to describe SQL into the nodes of the DAG directed acyclic graph, and the specific process is as follows:
designing a data reading node format: the JDBC _ config parameter (i.e., the JDBC parameter in the foregoing text) is designed to be used by Spark to construct JDBC connections and to load the data of the relational database data table into the memory. And then, designing SQL parameters, wherein the parameters are expressed by Spark SQL and are used for carrying out field filtering on a data table in the memory after the data are loaded into the memory, and eliminating some irrelevant fields to reduce the memory occupation. The NODE _ ID field (i.e., NODE ID) is designed for reconstructing the memory data table as a temporary View prefixed by View _ followed by the actual value of the upper NODE ID, e.g., (View _ NODE _ 01). Meanwhile, the node _ id field is also used for association between nodes, and upstream and downstream relations are constructed, namely edges. The finally designed data reading node includes a node _ id field, a node _ type field (node type), a table _ name field, an SQL statement field, and a jdbc _ config field.
Designing a data cleaning node format: the SQL construction of the data cleaning NODE needs to depend on the upstream NODE, so that an SQL statement can be designed according to the temporary VIEW name of the upstream NODE, spark SQL is used for inquiring the temporary VIEW of the upstream NODE, format conversion is carried out on data through a built-in function (for example: SELECT substring (DEPT _ ID, 0) AS DEPT _ ID FROM VIEW _ NODE _ 01), then a Where clause is spliced to construct a data cleaning condition, and the cleaning work of the data is completed. This statement is used to finally reconstruct the stitched full SQL temporary View by Spark, prefixed by View _ plus the actual value of the NODE ID of the current NODE, e.g. (View _ NODE _ 03). The finally designed data cleansing node includes a node _ id field, a node _ type field (node type), and an SQL statement field.
Designing a data associated node format: the data association NODE SQL depends on two NODE VIEWs at the upstream, and for the case of two NODEs at the upstream, the two VIEW NODEs at the upstream are usually merged by using Left JOIN, inner JOIN and right JOIN keys in SQL (for example, SELECT Left FROM VIEW _ NODE _03LEFT JOIN VIEW _ NODE 02on VIEW NODE 03.Dept _ id = VIEW _ NODE u 02.Dept _id). Finally, the SQL of the data association NODEs is reconstructed as a temporary View and named with View _ as a prefix plus the actual value of the NODE ID of the current NODE, e.g. (View _ NODE _ 04). The finally designed data association node comprises a node _ id field, a node _ type field (node type) and an SQL statement field.
Designing a data output node format: the data output node needs to design the jdbc _ config parameter for Spark building database connection and output the data through the connection to the destination data table (note: other types of databases such as mongodb currently output to mongodb, etc., here, the mongodb _ config parameter needs to be designed for building the mongodb connection). The design SQL reads temporary VIEW data (for example: SELECT FROM VIEW NODE 04), the fields of the upstream NODE can be filtered through the SQL, and the data is reassembled according to the SQL of the data output NODE and output to the data table. The data output node of the final design includes a node _ id field, a node _ type field (node type), a table _ name field, an SQL statement field, and a jdbc _ config field.
After (or while) the process node is constructed, step S102 may be executed to construct a target process DAG graph according to the process node, in the JSON representation in the above embodiment, edges are designed under links arrays, each edge is an object, and a source node and a target node of an edge are recorded in each object. When the target flow DAG graph is constructed in the above process, a comparison relationship between a logic graph description (that is, only a node name and a relationship between nodes are shown) corresponding to the target flow DAG graph and an actual business graph description (that is, an actual ETL step) is shown in fig. 3.
Then, step S103 may be performed, as a preferred embodiment, step S103 in this embodiment, traversing the target flow DAG graph to obtain a topology sequence of a plurality of flow nodes, which specifically includes:
establishing a stack, and stacking the flow nodes with zero out degree in the target flow DAG graph;
acquiring a flow node of a stack top, judging whether the flow node of the stack top has an upstream node or not, if so, stacking the upstream node of the flow node of the stack top, and if not, popping the flow node of the stack top;
and obtaining the topology sequence according to the pop sequence of the process nodes.
The above process adopts a depth-first traversal mode from the end of the target flow DAG graph to traverse, and it is easy to understand that other modes may be adopted to topologically order the DAG graph to obtain a topological sequence in practice.
Similarly, for the convenience of understanding, the present invention also provides a more specific embodiment to more clearly illustrate the traversal step in step S103:
traversing all nodes in the DAG graph, acquiring a node with an out-degree of 0 (namely the lowest level node in the DAG graph), and sequentially stacking the nodes (note: the stack is a data structure with first in and then out).
And circularly traversing nodes in the stack, firstly judging whether the stack is empty or not, jumping out of the loop if no data exists in the stack, reading the nodes in the stack (without popping out the stack), inquiring the edge which takes the current node as a destination point, and acquiring a source node (namely an upstream node) through the edge. And if the source point exists, reading the source point, stacking and entering the next circulation. And if the source node does not exist, performing stack operation on the current node, analyzing the node configuration, and constructing a node view according to SQL contents through different node types. And finally entering the next cycle. The order of the nodes to be popped is the topological sequence.
Further, as shown in fig. 4, as a preferred embodiment, step S104 in this embodiment, based on the topology sequence, sequentially establishing a temporary view corresponding to the flow node according to the SQL statement by Spark, processing data according to the finally obtained temporary view, and outputting target data to a target database specifically includes:
s401, constructing a dataframe data set according to the data reading nodes, and registering the dataframe data set into spark SQL according to SQL statements corresponding to the data reading nodes to obtain a data reading temporary view;
s402, according to the data reading view, based on the topological sequence, sequentially according to the SQL statement of each data processing node, constructing a data processing temporary view corresponding to each processing node;
and S403, processing data based on the SQL statement corresponding to the data output node according to the finally obtained data processing temporary view, and outputting target data to a target database.
In a preferred embodiment, step S402 of the foregoing process, based on the topology sequence, sequentially building a data processing temporary view corresponding to each processing node according to the SQL statement of each data processing node, specifically including:
acquiring a temporary view of an upstream node of the data processing node according to the topological sequence;
based on the temporary view of the upstream node of the data processing node, constructing the temporary view of the data processing according to the SQL statement corresponding to the data processing node;
wherein the upstream view of the first data processing node in the topology sequence is the data reading temporary view.
In combination with the above embodiments, the present invention provides a more detailed embodiment for explaining the steps S401 to S403:
after traversing the DAG graph of the target flow, the SQL node in the graph needs to be taken out to construct a temporary view. In the JSON format, it can be seen that each node contains SQL parameters, so in this step, a Spark temporary view needs to be constructed in different ways according to different node types, and the purpose of constructing the temporary view is to enable a downstream node to normally associate with SQL of an upstream node, complete the building of sub-queries and the multi-table join function, and finally execute Spark action on an output node to complete data output. The specific different node types operate as follows:
when the node type is a data reading node:
1. constructing a data table as dataframe by using spark.
2. Register dataframe into Spark SQL using dataframe.
3. SQL (SQL in a node) is used to filter out fields required by SQL configured in the node;
4. create OrRepleTempView ("VIEW _ NODE ID") is used to re-register dataframe to Spark SQL as a temporary VIEW (i.e., the data read temporary VIEW described above), and the named VIEW name is VIEW _ NODE ID, facilitating SQL combining by downstream nodes.
When the node type is a data cleaning conversion node or a data association node:
1. SQL in nodes was used for data cleaning filtering;
2. createoreplacetempview ("VIEW _ node ID") is used to construct a temporary VIEW (i.e., the above-described data processing temporary VIEW), and the VIEW table name is named VIEW _ node ID.
When the node type is a data output node:
1. SQL in nodes is used for output field filtering;
2. and (3) carrying out data output of different storage schemes by using a dataframe ().
By analyzing the three different types of node formats, most of the functions necessary for the ETL, including data reading, cleaning conversion, data association, data output and the like, can be completed by utilizing Spark SQL. Fig. 5 shows the specific example of the foregoing embodiment in which each node outputs SQL and the constructed view name corresponds to the SQL.
It should be noted that, in this embodiment, when an output operation needs to be executed, an output node needs to perform two actions, that is, execute SQL of a current node first, create a Spark DataFrame, and then perform data processing, but the output node does not need to be registered as a temporary view because the output node is already a last node and can be understood as a Spark Action. The Spark RDD operator is generally divided into transformation and action. The transformation operator does not trigger the execution of the task, only the action operator triggers the execution, and in the foregoing, in addition to the data output node, in the descriptions corresponding to other process nodes, all the processes of data reading, temporary view registration, dataframe construction and the like do not trigger the execution of the task. Save () method is used in the parse output node only to trigger task execution. The output node can complete output and write-in operations on databases of different storage types by using a spark platform, and a final ring of the ETL is completed.
In summary, the specific embodiment of the fast ETL method based on Spark SQL temporary view provided by the present invention has the following beneficial effects:
on one hand, in the embodiment, because the action is only executed in the data reading process, the read-write process of a large amount of data to a disk is omitted in the ETL construction process, the data calculation processing speed is greatly improved, and the advantage of memory calculation is furthest exerted when the Spark is used as the bottom layer to construct the ETL for data processing calculation.
On the other hand, the embodiment is based on Spark SQL to continue development, so that the later maintenance cost of the ETL tool is reduced, the workload of repeated development can be further reduced by using Spark SQL to develop, and the development speed of the product is improved. At present, the SQL function in Spark SQL can meet the requirement of cleaning conversion calculation of most service scenes, and developers only need to write components aiming at special service scenes, so that repeated development work is avoided, the system is more beneficial to the skilled use of the implementers, and the implementation cost is reduced.
In order to better implement the Spark SQL temporary view-based fast ETL method in the embodiment of the present invention, on the basis of the Spark SQL temporary view-based fast ETL method, please refer to fig. 6, where fig. 6 is a schematic structural diagram of an embodiment of the Spark SQL temporary view-based fast ETL device provided in the present invention, a Spark SQL temporary view-based fast ETL device 600 provided in the embodiment of the present invention includes:
a node constructing module 610, configured to construct, according to a target ETL step, an SQL statement corresponding to the ETL step, and establish a flow node corresponding to the ETL step based on the SQL statement;
a DAG graph constructing module 620, configured to construct a target flow DAG graph according to the flow nodes;
a sorting module 630, configured to traverse the target flow DAG graph to obtain a topology sequence of multiple flow nodes;
and the data processing module 640 is configured to establish a temporary view corresponding to the process node according to the SQL statement sequentially through Spark based on the topology sequence, process data according to the finally obtained temporary view, and output target data to a target database.
Here, it should be noted that: the corresponding apparatus 600 provided in the foregoing embodiments may implement the technical solutions described in the foregoing method embodiments, and the specific implementation principle of each module or unit may refer to the corresponding content in the foregoing method embodiments, which is not described herein again.
Further, please refer to fig. 7, wherein fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. Based on the above fast ETL method based on Spark SQL temporary view, the present invention also provides a fast ETL device 700 based on Spark SQL temporary view, that is, the above electronic device, where the fast ETL device 700 based on Spark SQL temporary view may be a mobile terminal, a desktop computer, a notebook, a palm computer, a server, or other computing devices. The Spark SQL temporary view-based fast ETL device 700 includes a processor 710, a memory 720, and a display 730. Fig. 7 shows only some of the components of the fast ETL device based on the Spark SQL temporary view, but it should be understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead.
The memory 720 may be an internal storage unit of the Spark SQL temporary view-based fast ETL device 700 in some embodiments, such as a hard disk or a memory of the Spark SQL temporary view-based fast ETL device 700. The memory 720 may also be an external storage device of the Spark SQL temporary view-based fast ETL device 700 in other embodiments, such as a plug-in hard disk equipped on the Spark SQL temporary view-based fast ETL device 700, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and so on. Further, the memory 720 may also include both an internal storage unit and an external storage device of the fast ETL device 700 based on the Spark SQL temporary view. The memory 720 is used for storing application software installed on the flash ETL device 700 based on the Spark SQL temporary view and various data, such as program codes of the flash ETL device 700 based on the Spark SQL temporary view. The memory 720 may also be used to temporarily store data that has been output or is to be output. In an embodiment, the memory 720 stores a fast ETL program 740 based on a Spark SQL temporary view, and the fast ETL program 740 based on the Spark SQL temporary view can be executed by the processor 710, so as to implement the fast ETL method based on the Spark SQL temporary view according to the embodiments of the present application.
The display 730 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. Display 730 is used to display information at the Spark SQL temporary view-based fast ETL device 700 and to display a user interface for visualization. The components 710-730 of the fast ETL device 700 based on Spark SQL temporary view communicate with each other over a system bus.
In one embodiment, the steps in the Spark SQL temporary view-based fast ETL method described above are implemented when processor 710 executes Spark SQL temporary view-based fast ETL program 740 in memory 720.
The embodiment also provides a computer readable storage medium, on which a Spark SQL temporary view-based fast ETL program is stored, and when being executed by a processor, the Spark SQL temporary view-based fast ETL program can implement the steps in the above embodiments.
According to the quick ETL method, the device, the equipment and the medium based on the Spark SQL temporary view, the SQL sentences of each ETL step are constructed through the specific actions of the target ETL step, the flow nodes and the target flow DAG graph of the whole ETL process are constructed according to the SQL sentences, the temporary view of each flow node is sequentially constructed through the Spark based on the topological sequence of the target flow DAG graph, data are processed according to the finally obtained temporary view, and the target data are output to the target database. Compared with the prior art, the method integrates the Spark SQL with the DAG graph used in the ETL process, finally realizes the expression of the DAG graph into a complete and full-flow Spark SQL, operates on the Spark platform, realizes the basic function of the ETL tool, and improves the development efficiency.
While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims.
Claims (10)
1. A quick ETL method based on Spark SQL temporary view is characterized by comprising the following steps:
according to the target ETL step, constructing an SQL statement corresponding to the ETL step, and establishing a flow node corresponding to the ETL step based on the SQL statement;
constructing a target flow DAG graph according to the flow nodes;
traversing the DAG graph of the target process to obtain a plurality of topological sequences of the process nodes;
and establishing a temporary view corresponding to the process node according to the SQL statement sequentially through Spark based on the topological sequence, processing data according to the finally obtained temporary view, and outputting target data to a target database.
2. The Spark SQL temporary view-based fast ETL method according to claim 1, wherein the constructing the SQL statements corresponding to the ETL step according to the target ETL step and establishing the flow nodes corresponding to the ETL step based on the SQL statements comprises:
establishing a node ID of a target node, and establishing a view name of the target node according to the node ID of the target node;
establishing a node type according to the ETL step;
constructing an SQL statement of the target node according to the action content of the target ETL step and the view name of the target node;
and constructing the target node according to the node ID of the target node, the node type and the SQL statement of the target node.
3. The Spark SQL temporary view-based fast ETL method according to claim 2, wherein the target node SQL statement is constructed according to the action content of the target ETL step and the view name of the target node, including;
acquiring a view name of an upstream node of the target node;
and constructing the SQL statement of the target node according to the action content of the target ETL step, the view name of the upstream node of the target node and the view name of the target node.
4. The Spark SQL temporary view-based fast ETL method according to claim 3, wherein the constructing the target node according to the node ID of the target node, the node type and the SQL statement of the target node comprises:
establishing JDBC parameters according to the target ETL step;
and constructing the target node according to the node ID of the target node, the node type, the SQL statement of the target node and the JDBC parameter.
5. The Spark SQL temporary view-based fast ETL method according to claim 4, wherein the flow nodes comprise a data reading node, a data processing node and a data output node; the establishing a temporary view corresponding to the process node according to the SQL statement sequentially through Spark based on the topological sequence, processing data according to the finally obtained temporary view, and outputting target data to a target database includes:
constructing a dataframe data set according to the data reading nodes, and registering the dataframe data set into Spark SQL according to SQL statements corresponding to the data reading nodes to obtain a data reading temporary view;
according to the data reading view and the topological sequence, sequentially constructing a data processing temporary view corresponding to each processing node according to the SQL statement of each data processing node;
and processing data based on the SQL statement corresponding to the data output node according to the finally obtained data processing temporary view, and outputting target data to a target database.
6. The Spark SQL temporary view-based fast ETL method according to claim 5, wherein the constructing a data processing temporary view corresponding to each processing node according to the SQL statement of each data processing node in turn based on the topological sequence includes:
acquiring a temporary view of an upstream node of the data processing node according to the topological sequence;
based on the temporary view of the upstream node of the data processing node, constructing the temporary view of the data processing according to the SQL statement corresponding to the data processing node;
wherein the upstream view of the first data processing node in the topology sequence is the data reading temporary view.
7. The Spark SQL temporary view-based fast ETL method according to claim 1, wherein traversing the target flow DAG graph to obtain a topological sequence of a plurality of the flow nodes comprises:
establishing a stack, and stacking the flow nodes with zero out-degree in the DAG graph of the target flow;
acquiring a flow node of a stack top, judging whether the flow node of the stack top has an upstream node or not, if so, stacking the upstream node of the flow node of the stack top, and if not, popping the flow node of the stack top;
and obtaining the topology sequence according to the pop sequence of the process nodes.
8. A fast ETL device based on Spark SQL temporary view is characterized by comprising:
the node construction module is used for constructing an SQL (structured query language) statement corresponding to the ETL step according to the target ETL step and establishing a process node corresponding to the ETL step based on the SQL statement;
a DAG graph constructing module, configured to construct a target flow DAG graph according to the flow nodes;
the sequencing module is used for traversing the DAG graph of the target process to obtain a plurality of topological sequences of the process nodes;
and the data processing module is used for establishing a temporary view corresponding to the process node according to the SQL statement sequentially through Spark based on the topological sequence, processing data according to the finally obtained temporary view and outputting target data to a target database.
9. An electronic device comprising a memory and a processor, wherein,
the memory is used for storing programs;
the processor, coupled to the memory, is configured to execute the program stored in the memory to implement the steps in the Spark SQL temporary view-based fast ETL method according to any of the above claims 1 to 7.
10. A computer readable storage medium for storing a computer readable program or instructions, which when executed by a processor, can implement the steps in the Spark SQL temporary view-based fast ETL method according to any of the above claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211580856.1A CN115809294A (en) | 2022-12-09 | 2022-12-09 | Rapid ETL method based on Spark SQL temporary view |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211580856.1A CN115809294A (en) | 2022-12-09 | 2022-12-09 | Rapid ETL method based on Spark SQL temporary view |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115809294A true CN115809294A (en) | 2023-03-17 |
Family
ID=85485528
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211580856.1A Pending CN115809294A (en) | 2022-12-09 | 2022-12-09 | Rapid ETL method based on Spark SQL temporary view |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115809294A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116932575A (en) * | 2023-09-12 | 2023-10-24 | 长城证券股份有限公司 | Spark-based cross-data source operation method, device and storage medium |
-
2022
- 2022-12-09 CN CN202211580856.1A patent/CN115809294A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116932575A (en) * | 2023-09-12 | 2023-10-24 | 长城证券股份有限公司 | Spark-based cross-data source operation method, device and storage medium |
CN116932575B (en) * | 2023-09-12 | 2023-12-15 | 长城证券股份有限公司 | Spark-based cross-data source operation method, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106897322B (en) | A kind of access method and device of database and file system | |
CN112559554B (en) | Query statement optimization method and device | |
US8537160B2 (en) | Generating distributed dataflow graphs | |
CN115136113A (en) | Editor for generating computation graph | |
JP2020522790A (en) | Automatic dependency analyzer for heterogeneously programmed data processing systems | |
EP2595072A1 (en) | System and method implementing a text analysis repository | |
CN106611044B (en) | SQL optimization method and equipment | |
US20140172914A1 (en) | Graph query processing using plurality of engines | |
CN105677812A (en) | Method and device for querying data | |
JP2010524060A (en) | Data merging in distributed computing | |
Sellami et al. | Complex queries optimization and evaluation over relational and NoSQL data stores in cloud environments | |
CN102982095B (en) | A kind of body automatic creation system based on thesaurus and method thereof | |
CN116628066B (en) | Data transmission method, device, computer equipment and storage medium | |
CN116483850A (en) | Data processing method, device, equipment and medium | |
EP3293645B1 (en) | Iterative evaluation of data through simd processor registers | |
EP3293644B1 (en) | Loading data for iterative evaluation through simd registers | |
CN113407807A (en) | Query optimization method and device for search engine and electronic equipment | |
CN115809294A (en) | Rapid ETL method based on Spark SQL temporary view | |
CN113505278B (en) | Graph matching method and device, electronic equipment and storage medium | |
Rompf et al. | A SQL to C compiler in 500 lines of code | |
WO2023089076A1 (en) | A computing system and process | |
Bai et al. | Para-g: Path pattern query processing on large graphs | |
Shmeis et al. | A rewrite-based optimizer for spark | |
CN111191106B (en) | DSL construction method, system, electronic device and medium | |
CN106484706B (en) | Method and apparatus for executing procedural SQL statements for distributed systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |