CN116301755A

CN116301755A - Automatic batch flow data marking framework construction method based on directed calculation graph

Info

Publication number: CN116301755A
Application number: CN202310300686.5A
Authority: CN
Inventors: 赵生捷; 温庭杰; 邓浩
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-06-23

Abstract

The invention relates to an automatic batch flow data marking frame construction method based on a directed calculation graph, which comprises the following steps: inputting an interface description language, wherein the interface description language comprises a table creation instruction, a data source definition and a data connection instruction; performing lexical analysis on the interface description language to generate an abstract syntax tree; converting the abstract syntax tree into a relational algebra expression tree; defining a relational algebra expression tree as a logic directed computation graph; and performing code generation operation on nodes on the logic directed computation graph to obtain a physical directed graph, and completing the construction of the batch data marking framework. Compared with the prior art, the automatic marking of batch data can be flexibly realized, and even if marking rules are changed, the code and the online and offline programs are not required to be rewritten, so that the production efficiency of enterprises can be effectively improved.

Description

Automatic batch flow data marking framework construction method based on directed calculation graph

Technical Field

The invention relates to the technical field of data processing, in particular to an automatic batch flow data marking frame construction method based on a directed calculation graph.

Background

The internet of things (IoT) refers to collecting any object or process needing to be monitored, connected and interacted in real time through various devices and technologies such as various information sensors, radio frequency identification technologies, global positioning systems, infrared sensors and laser scanners, collecting various needed information such as sound, light, heat, electricity, mechanics, chemistry, biology and positions of the object or process, and realizing ubiquitous connection of the object and people through various possible network access, thereby realizing intelligent sensing, identification and management of the object and the process. The internet of things is an important component of smart city engineering. In advancing smart city engineering, various sensors generate large amounts of streaming structured data in real-time. At the same time, each data center also stores static (batch) city related data (such as street longitude and latitude, sensor geographic position, etc.).

However, the data collected by the sensor alone lacks global semanteme, an example is that a certain pipeline sensor collects values such as pressure, temperature and the like in a pipeline and records the street on which the sensor is positioned, but the related attributes such as longitude and latitude and the like of the street are stored in a relational database of the data center, so that the data collected by the sensor cannot be related to the related semantic information of the street. In order to add global semantics to sensor data generated in real time, the generated data needs to be associated with semantic data of a data center every time the sensor data is generated, and for convenience of explanation, the process is called data marking, namely adding global semantic related labels to streaming data.

According to investigation, there is no frame of a special processing batch stream data marking scene of a development source in the market at present, the main stream marking method of the current processing batch stream marking is hard coding based on marking rules, the fatal defect of the method is that the method is inflexible, once the marking rules are changed, a code is required to be rewritten, and the online and offline programs are required to be rewritten, which is not acceptable in the production environment of software enterprises, so the development of a frame for automatically generating marking logic based on the marking rules is particularly important.

A similar scene is marking of double-flow data (see e-commerce OLAP scene), an open source distributed meter framework such as flink, spark, storm can effectively process the scene of automatic marking of double-flow data, and a FLink can automatically generate a DAG calculation graph (directed calculation graph) according to sql sentences and automatically generate corresponding data processing codes on each node of the DAG by utilizing a codegen tool, so that the service codes of automatic marking of double-flow data are generated according to the sql sentences.

In view of the foregoing, there is still a lack of research in automated marking of batch data.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an automatic batch flow data marking frame construction method based on a directed calculation graph.

The aim of the invention can be achieved by the following technical scheme:

an automatic batch flow data marking framework construction method based on a directed calculation graph comprises the following steps:

inputting an interface description language, wherein the interface description language comprises a table creation instruction, a data source definition and a data connection instruction;

performing lexical analysis on the interface description language to generate an abstract syntax tree;

converting the abstract syntax tree into a relational algebra expression tree;

defining the relational algebra expression tree as a logic directed computation graph;

and performing code generation operation on the nodes on the logic directed computation graph to obtain a physical directed graph, and completing the construction of the batch data marking framework.

Further, the interface description language is spl language.

Further, a root node of the relational algebra expression tree is an exit of the logical directed computation graph, and a leaf node of the relational algebra expression tree is an entry of the logical directed computation graph.

Further, the logic directed computation graph is formed by connecting a plurality of logic nodes in series, wherein the logic nodes comprise table scanning nodes, filtering nodes, association nodes and point projection nodes;

the table scanning node is an entry of a logic directed computation graph;

the projection node is used for specifying the mapping relation between the input item and the output item;

the association node is used for defining the association relation between the input item and the output item.

Further, code generation operation is performed on the nodes on the logic directed computation graph to obtain a physical directed graph, and the method specifically comprises the following steps:

generating specific service codes on the corresponding nodes according to the logic nodes and the node rules thereof, and sequentially performing code generation operation on each logic node to obtain corresponding physical nodes;

and combining all the physical nodes, and optimizing based on rules to obtain the physical directed graph.

Further, in the table scanning node, a base table class is defined, and the base table class uses a GetROWs interface;

the sweep table node also comprises a flow table class and a batch table class, wherein the flow table class and the batch table class are inherited from the base table class and the GetROWs interface is also used;

defining corresponding streaming data tables separately for the related streaming data sources, and inheriting the corresponding streaming data tables from the streaming table class; defining corresponding batch data tables for the related batch data sources independently, wherein the corresponding batch data tables are inherited from the batch table class;

the specific table scanning code logic is realized in the flow data table and batch data table class.

Further, when generating the logic node in the logic directed computation graph, data source information in the interface description language needs to be saved, and code generation operation of the node is realized based on the data source information.

Further, the projection node performs mapping operation according to semantic information attached to the relational algebra expression tree.

Further, the association node selects batch data sources to construct a hash table so as to carry out association operation;

if the association condition includes and logic, namely: select from f1, d1left join on f1. B=d1. B and f1. C=d1. C, then construct the hash table with concat (d 1.B, d1. C) as the key when constructing the hash table;

if there is or logic in the association condition, namely: select from f1, d1left join on d1. B=f1. B or d1. C=f1. C), then construct one hash table with d1.B as key, construct another Zhang Haxi table with d1.C as key, do union output the query result of two hash tables while associating;

if the association condition includes the and or nested logic, namely: ((f1. B=d1. B or f1. C=d1. C) and f1. D=d1. D), the assignment rate by the logical operator is converted into: ((f1. B=d1. B and f1. D=d1. D) or (f1. C=d1. C and f1. D=d1. D)), the converted rule is that only and operation is carried out in all basic rules, the basic rules are connected by or, a hash table is built in the basic rules by adopting splicing, and the output results are combined by the connection between the basic rules.

Further, performing rule-based optimization to equivalently convert all filtering nodes to positions close to the scan table node, wherein after the optimization, the direct input of the filtering nodes is a scan table node so as to remove filtering related operations in the interface description language, and directly writing filtering logic functions at a code layer.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, by performing lexical analysis on the input interface description language to generate the logic directed computation graph, a framework for automatically generating the marking logic based on the marking rule is constructed, automatic marking of batch stream data can be flexibly realized, and even if the marking rule is changed, the code and the online and offline programs are not required to be rewritten, so that the production efficiency of enterprises can be effectively improved.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention;

FIG. 2 is a block diagram of automated generation of marking logic in accordance with an embodiment of the present invention;

FIG. 3 is an example of a code implementation for generating a relational algebra expression tree using calicite in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a relationship between related classes of nodes in a table sweeping manner in an embodiment of the present invention;

FIG. 5 is a schematic diagram of the result of resolving projection nodes using calilite in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a result of resolving associated nodes using calcite in an embodiment of the present invention;

fig. 7 is a schematic flow chart of RBO optimization in an embodiment of the invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

Noun interpretation:

batch data: refers to data that is statically stored at a data center (that does not change over time);

stream data: refers to data generated by devices such as a sensor in real time (a new piece of data is generated each time a piece of data is acquired);

marking batch stream data: refers to a process of associating batch data and stream data according to certain rules, such as street id identity. For example, a stream data shape such as (id: 001, street_id:20180010, city: shanghai, pressure:100kapa, temperature:452, depth: -1422.3), a batch data shape such as (id: 001, street_id:20180010, city: shanghai, credit: 0.672, poll: -24732), a marked data shape such as (id: 001, street_id:20180010, city: sh: -100 kapa, temperature:452, depth: -1422.3, credit: 0.672, poll: -24732), semantic information of credit and position is added to the stream data by means of a street_id attribute association (the actual process is much more complex than this, it is possible that a stream data is recursively associated with batch data in a plurality of databases);

automatic marking: as in the example described above, the dominant approach to current process batch marking is to hard code marking rules, which require re-writing of code once the rules change, based on the correlation of street_id, whereas automated marking would like to automatically generate code to process data marking logic via an intermediate language (e.g., sql).

Aiming at the problem that a frame of a specially processed batch stream data marking scene with an open source is lacking in the prior art, the invention provides an automatic batch stream data marking frame construction method based on a directed calculation graph, as shown in fig. 1, the method comprises the following steps:

inputting an interface description language (IDL, interface description language), wherein the interface description language comprises a table creation instruction, a data source definition and a data connection instruction;

performing lexical analysis on an Interface Description Language (IDL) to generate an abstract syntax tree;

converting the abstract syntax tree into a relational algebra expression tree;

defining a tree of relational algebra expressions as a logical directed computation graph (DAG, directed Acyclic Graph);

and performing code generation operation on nodes on the logic directed computation graph to obtain a physical directed graph, and completing the construction of the batch data marking framework.

In the embodiment, the framework automatically generates the marking logic according to sql, that is, the input of the whole framework is an IDL (Interface description language ), and the sql is designated as the IDL.

Taking fig. 2 as an example, three abstract tables F1 (streaming), D1 (batch), D2 (batch) are defined in the interface description language input by the present embodiment, where F1 is associated with a kafka message queue (a common streaming data source), and D1, D2 are associated with mysql (batch data source). The assumed scene is that F1 continuously generates new data along with the time, and the data collected by F1 only comprises an a field and a b field. Every time F1 generates a piece of data, the F1 item needs to be associated with an item in D1 and D2 (the F1 item is firstly associated with the D1 item according to the b field to generate an F2 item, and then the F2 item is associated with the D2 item according to the c field to generate an F3 item). After the association is completed, the F3 entry is stored in a persistence layer (e.g., HDFS) for subsequent OLAP ((Online Analytical Processing) operations) according to the associated traffic requirements.

The significance of sql in fig. 2 is: firstly, creating a streaming table, named as F1, wherein a streaming data source is kafka and has two fields a and b; then creating two batch tables D1 and D2, wherein the data source is mysql, D1 has two fields b and c, D2 has two fields c and D, and then executing the sql; the specific meaning of sql is: left connection is firstly carried out on the condition that the b field of F1 and the b field of D1 are equal to obtain a temporary table F2 (F2 has three fields of a, b and c), then left connection is carried out on the condition that the c field of F2 and the c field of D1 are equal to obtain an output table, and the total number of the output table has 4 fields of a, b, c and D (corresponding to F2.A, F2.B, c2.C and D2. D).

In the framework of this embodiment, the process of automatically generating the marking logic includes two major phases, the first phase being to convert the sql statement into a logical DAG. At this stage, the sql is lexically analyzed by means of an apache calilite tool (a management framework for dynamic data) to generate an abstract syntax tree, which is finally converted into a relational algebra expression tree (relnode tree), where the relational algebra expression tree is a logical DAG (root node is the exit of the graph and leaf nodes are the entry of the graph). Fig. 3 shows the procedure of the calilite framework to translate the sql statement in this embodiment into a logical DAG (relational algebra expression Tree), and the RelNode in fig. 3 corresponds entirely to the RelNode Tree part in fig. 2.

It should be noted that the DAG herein is only logical, and all filtering, projection, association and scan nodes have no specific code implementation yet. In the second stage of the framework, specific service codes are required to be generated on the corresponding nodes according to the nodes and the node rules, code generation operation is performed on each logic node to obtain a physical node, the physical nodes are combined and RBO (Rule-Based Optimization based optimization) is performed to obtain a physical directed graph. A physical directed graph is a program that can be deployed directly.

In both stages of the framework flow, the calisuite already handles the stage one requirements well. The technical details in this application are mainly directed to stage two, i.e. how the service codes are generated on the logical nodes.

In the scenario of the present embodiment, the entry (leaf node) of the DAG must be a scan table node (scantable node). The sweep node may be classified into a batch sweep node and a stream sweep node according to the data source type. As shown in fig. 4, in the framework of the present embodiment, the present embodiment defines a base table class (BaseTable) that has only one interface GetRows. Both the flow table class (FlowTable) and the batch table class (BatchTable) inherit from BaseTable. For each of the streaming data sources involved, a class (e.g., kafkaTable, RBMQTable) is defined separately, which inherits the FlowTable class. The same is true for batch data sources (e.g., mysql, postsql, etc.). The specific scan table code logic is implemented in the lowest class (e.g., kafkaTable layer).

In the first generation of the logical node, the embodiment needs to save information related to the data source in IDL (for example, create table from kafka and create table from mysql in fig. 2. In the second generation of the specific service code, only specific implementation classes need to be replaced according to the data source information of the scan table node (for example, from kafka, from mysql, etc.), because all the classes implement GetRows interfaces.

The service code generation process of the projection node (project node) includes the following steps: in stage one, calcite has generated logical nodes with rich semantic information. As shown in fig. 5, the semantic information attached to the projection node generated by calcite indicates that the 0 th field ($0) of the input entry is to be mapped to the 0 th field of the output entry, the 1 st field ($1) of the input entry is to be mapped to the 1 st field of the output entry, and so on. Therefore, when generating the service code, the present embodiment only needs to map according to the semantic information attached to the relnode.

Associated node (join node):

the process of generating physical association nodes by logical association nodes is relatively complex. This embodiment is shown beginning with the simplest case. Still taking the case of fig. 2 as an example, F1 is a flow table and D1 is a lot table. For the IDL statement select a, b, c from F1, D1left join on F1. B=d1. B. FIG. 6 shows the result of resolution of the calilite on the present statement in the first stage. The semantic information related to the association operation exists in the condition field of relnode (in the condition, for example, fig. 6, the condition indicates that the $1 (f 1. B) field and the $2 (d 1. B) field of the post-concatenation entry (f 1.A, f1.B, d2. C)) are associated with reference to each other.

For single field association of dual static data, the mainstream solution in the industry is to select a data stream to construct a hash table (e.g. construct d1 as a hash table with a key d1.B and a hash table with a value of one whole entry). And then, for each item in f1, taking f1.B as a key to search a corresponding item in the hash table, splicing fields of the item if the corresponding item is found, and splicing blank values if the corresponding item is not found. Generally, for the association of dual static data sources, the efficiency of selecting data sources with larger data volume to construct the hash table is better, but in the scene of batch stream marking, the hash table is constructed by batch data sources.

For a slightly more complex case, for example, there is an and logic in the association condition, such as (select from f1, d1left join on f1. B=d1. B and f1. C=d1. C), then construct the hash table with concat (d 1.B, d1. C) as the key.

If there is or logic in the association condition, for example, (select from f1, d1left join on d1. B=f1. B or d1. C=f1. C), a hash table is constructed by using d1.B as a key, another Zhang Haxi table is constructed by using d1.C as a key, and the query results of the two hash tables are outputted in a union during association.

For the most complex cases: among the associated logic are and or nested logic, if ((f1.b=d1.b or f1.c=d1.c) and f1.d=d1.d) then the assignment rate of the logical operator needs to be used to convert to ((f1.b=d1.b and f1.d=d 1.d) or (f1.c=d1.c and f1.d=d 1.d)). The converted rule is that only and operation is carried out in all basic rules, and the basic rules are connected by or. And splicing is adopted in the basic rules to construct a hash table, and the output results are collected by connection between the basic rules.

Filter node (filter node):

in a normal traffic scenario, the associated node will only handle "=" logic, while the filtering node will handle various logic (e.g. f1.A > 100). It is difficult to integrate the code generation process of various logical operations into one module. In the framework of this embodiment, this embodiment does not directly solve this problem, but rather first borrows the RBO (rule-based optimization) to equivalently translate all filter nodes to neighboring scan table nodes (as shown in fig. 7). After optimization, the direct input of the filtering node must be a table sweeping node, for this embodiment, filtering related operations in idl can be removed, and filtering logic functions are written directly at the code level (the input and output of the functions are both List < Object > types).

Execution of the directed computation graph:

after code generation is performed on each logical node, various physical nodes are obtained, and the input and output of the nodes are connected in series to obtain a final directed computation graph (DAG). The DAG then executes once every unit interval time by setting an interval time, and the executed process is that the root node requests data from its child node, and the child node requests data from its child node until the leaf node requests data from its associated data source.

In summary, a framework for automatically generating marking logic based on marking rules can be constructed, automatic marking of batch data can be flexibly realized, and even if the marking rules are changed, code re-writing and online and offline program re-writing are not needed, so that the production efficiency of enterprises can be effectively improved.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. The automatic batch flow data marking framework construction method based on the directed calculation graph is characterized by comprising the following steps of:

converting the abstract syntax tree into a relational algebra expression tree;

2. The method for constructing an automated batch data marking framework based on a directed computation graph according to claim 1, wherein the interface description language is spl language.

3. The automated batch data marking framework construction method based on directed graphs of claim 1, wherein a root node of the relational algebra expression tree is an exit of the logical directed graph and a leaf node of the relational algebra expression tree is an entry of the logical directed graph.

4. The method for constructing the automated batch data marking framework based on the directed computation graph according to claim 1, wherein the logic directed computation graph is formed by connecting a plurality of logic nodes in series, and the logic nodes comprise a table scanning node, a filtering node, an association node and a point projection node;

the table scanning node is an entry of a logic directed computation graph;

5. The method for constructing an automated batch data marking framework based on a directed graph as claimed in claim 4, wherein the code generating operation is performed on the nodes on the logical directed graph to obtain a physical directed graph, and the method specifically comprises the following steps:

6. The method for constructing the automated batch data marking framework based on the directed computation graph according to claim 4, wherein a base table class is defined in the table scanning node, and the base table class uses a GetRows interface;

7. The automated batch data marking framework construction method based on the directed computation graph according to claim 4, wherein when generating the logical nodes in the logical directed computation graph, data source information in an interface description language needs to be saved, and code generation operation of the nodes is realized based on the data source information.

8. The method for constructing an automated batch data marking framework based on a directed computation graph according to claim 4, wherein the projection nodes perform mapping operations according to semantic information attached to a relational algebra expression tree.

9. The method for constructing an automated batch data marking framework based on a directed graph as claimed in claim 4, wherein the association node selects a batch data source to construct a hash table for association operation;

10. The method of claim 4, wherein the rule-based optimization is performed to convert all filtering nodes to locations adjacent to the scan table node, and after the optimization, the direct input of the filtering nodes is a scan table node to eliminate filtering related operations in the interface description language, and the filtering logic function is written directly at the code level.