CN109101468B

CN109101468B - Execution optimization method of text data conversion script

Info

Publication number: CN109101468B
Application number: CN201810873554.0A
Authority: CN
Inventors: 江大伟; 陈珂; 魏嘉荣; 寿黎但; 陈刚; 胡天磊; 伍赛
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2018-08-02
Filing date: 2018-08-02
Publication date: 2020-07-03
Anticipated expiration: 2038-08-02
Also published as: CN109101468A

Abstract

The invention discloses an execution optimization method of a text data conversion script. Analyzing the text data conversion script to generate an execution plan tree aiming at the text data conversion script executed through network distributed processing; using the tuple-based multiple set as a data model of the text data, wherein the text data conversion script comprises data operations for modifying and converting the structure and the content of the multiple set; adopting a corresponding execution optimization method according to different execution scenes of the conversion script; and generating a logic program for processing and running according to the optimized execution plan result, thereby efficiently converting and processing the data on the big data platform. The method can be applied to processing mass text data in the data preparation stage, and can effectively reduce the time-space cost of the text data conversion script during the execution and improve the efficiency of the data preparation stage by applying the execution optimization method facing the text data conversion script.

Description

Execution optimization method of text data conversion script

Technical Field

The invention relates to an optimization method for processing mass text data, in particular to an execution optimization method for a text data conversion script.

Background

With the rapid development of the related fields such as the mobile internet, the internet of things and the like, data shows an explosive growth trend, and the types of the data which can be utilized are more and more abundant while the data volume is larger and larger. Through data analysis, people can extract key information from the data, and find rules to make decisions.

Data preparation is the first step in data analysis. The traditional data preparation mainly adopts an ETL technology, and requires a user to realize the extraction, conversion and loading processes of data in a manual program coding mode or by using program logic preset in a third-party ETL tool. The emerging self-service data preparation technology provides an interactive data conversion processing method based on a graphical interface. Through a data visualization technology and a machine learning technology, the self-service data preparation method visually displays data to a user, simultaneously conjectures the data conversion intention of the user and generates data conversion operation according to interactive operation such as mouse click and the like of the user in a graphical interface, and finally processes the data. The self-service data preparation method avoids program coding of data conversion logic, reduces the technical threshold of data preparation, and effectively improves the efficiency of the data preparation stage.

The data targeted by the current data analysis is not limited to traditional structured data, but also covers semi/unstructured text data such as XML, JSON and logs. Emerging big data platforms can effectively store large-scale semi/unstructured text data. The self-service data preparation technology facing the big data increases the capacity of processing mass data on the basis of the self-service data preparation technology.

In the autonomous data preparation technology, a text data conversion language is a language for modeling user interaction operations in a graphical interface, and a text data conversion script is a program script described using the text data conversion language. The text data conversion script can convert and process massive text data.

Disclosure of Invention

In order to solve the problems in the background art, the invention provides an execution optimization method of a text data conversion script, which can solve the problem that the text data conversion script processes massive text data and can efficiently and extendably process program logic of the massive text data and execute the program logic.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention can adopt corresponding execution optimization methods according to different execution scenes of the conversion script, namely the execution scene of the single-data conversion script and the execution scene of the multi-data conversion script, thereby generating the efficient and extensible program processing logic aiming at mass text data.

Aiming at the text data conversion script executed by network distributed processing, the following method steps are adopted for processing:

(1) analyzing the text data conversion script to generate an execution plan tree, and checking the legality and validity of the nodes in the execution plan tree; using a tuple-based multi-set (a two-dimensional table formed by rows and columns) as a data model of text data, wherein a text data conversion script comprises data operations for modifying and converting the structure and the content of the multi-set;

the text data conversion script is a program script described using a text data conversion language.

The text data conversion language is a language for modeling user interaction operations in a graphical interface.

An execution plan tree is an abstract data structure based on trees.

In the execution plan tree, one node is a data operation, the parent-child relationship of the nodes in the tree represents the execution sequence of the data operation, and the data operation represented by the parent node can be executed only after the data operation represented by all the child nodes is completed.

(2) Adopting a corresponding execution optimization method according to different execution scenes of the conversion script;

(3) and generating and operating a logic program facing the big data platform according to an execution plan result (execution plan tree/graph) obtained after optimization, thereby efficiently converting and processing the data on the big data platform.

In the execution plan tree of the step (1), one node is a data operation, the parent-child relationship of the nodes in the tree represents the execution sequence of the data operation, and the data operation represented by the parent node can be executed only after the data operation represented by all the child nodes is completed.

The different execution scenes of the conversion script are divided into a single data conversion script and a multi-data conversion script.

Aiming at a single data conversion script, in an execution scene of the single data conversion script (single script), the following step-by-step optimization method oriented to the execution plan tree is adopted for processing, and specifically:

1.1) operation push-down optimization: under the condition of not changing the data conversion processing result, the spatiotemporal cost in the execution process is reduced by changing the execution sequence of the nodes in the execution plan tree, and the method specifically comprises the following steps: if the tuples in the two multi-sets have repeated values under a certain attribute, the screening operation performed after the connection operation is performed on the attribute values of the multi-sets is performed before the connection operation, so that the execution order of the screening operation relative to the connection operation is changed;

1.2) operation combination: the space-time cost in the execution process is reduced by a method for combining two adjacent nodes in the execution plan tree, and the method specifically comprises the following steps: if two adjacent nodes operate aiming at the multiple concentrated columns, the two adjacent nodes are merged, and the nodes are the nodes corresponding to the data operation in the execution plan tree;

1.3) connection optimization: when a plurality of multiple sets are connected, an optimal connection operation method is selected according to the characteristics of the multiple sets, and the method specifically comprises the following steps: taking the number of tuples in the multi-set as a multi-set characteristic, and then processing in parallel according to the following two conditions:

if the two multiple sets need to be connected and the difference of the characteristics of the two multiple sets is within 30%, one multiple set is subjected to screening operation and then connected with the other multiple set;

and if the two multiple sets need to be connected and the difference of the characteristics of the two multiple sets is more than 3 times, transmitting the multiple set with the smaller characteristic to the node of the distributed network where the multiple set with the larger characteristic is located.

Aiming at the multiple data conversion scripts, in the execution scene of the multiple data conversion scripts (multiple scripts), a cost-based graph optimization method is adopted for processing, and the method specifically comprises the following steps:

2.1) constructing an execution plan graph: merging a plurality of execution plan trees corresponding to a plurality of data conversion scripts in a mode of merging common sub-nodes to obtain an execution plan graph, wherein the execution plan graph is also a graph structure formed by nodes, and the common sub-nodes refer to nodes with the same operation semantics;

2.2) cost-based operation merging: and the nodes in the execution planning graph are optimized and combined through a specially designed cost model, so that the space-time cost in the execution process is reduced.

The invention aims to realize the minimum and the optimal space-time cost, wherein the space-time cost refers to the sum of the spent execution time and the occupied physical resources.

In the step 2.2) described above,

the invention realizes the data operation merging method based on input sharing based on the idea of optimizing merging operation, and measures the execution cost corresponding to the data operation by establishing a cost model.

Therefore, in the execution plan diagram, for a group of data operations sharing the same input, by comparing the cost of independent execution and the cost of combined execution, the invention judges whether the execution efficiency is improved after the data operations sharing the same input are combined.

2.2.1) establishing the following cost model aiming at independent data operation, wherein the independent data operation refers to data operation under mutually independent nodes which are shared and input in the same sub-node, namely, the nodes of the independent data operation are connected to the same sub-node together and have no connection relation with each other;

for example for data operation J₁,J₂,…,J_nAll the n operations share the same data input and are divided into independent disjoint m groups, each group G_i(1 ≦ i ≦ m) will include several operations that will be combined into one operation to execute.

The cost model adopts the following method to calculate the cost of the sum of a plurality of independent data operations and the data operation J obtained after the plurality of independent data operations are combined_*The cost of (2):

for n independent data operations J₁,J₂,…,J_nSum of costs of all data operations

The calculation is as follows:

wherein the content of the first and second substances,

for reading data in distributed processingGet a cost, C_tFor the cost of network transmission, C^lFor the local read-write cost of the data,

for data operation J_iThe size of the intermediate result of (a),

for data operation J_iSorting the number of merged passes from middle to outer;

2.2.2) operating on n independent data J₁,J₂,…,J_nOperating the n independent data J₁,J₂,…,J_nMerge get data operation J_*The cost is calculated as:

wherein the content of the first and second substances,

operating J on merged data_*The size of the intermediate result of (a),

operating J on merged data_*Sorting the number of merged passes from middle to outer;

all independent data operations are combined together to form a data operation as a grouping combination scheme, or different grouping combinations of the data operations are carried out on all the independent data operations to obtain a plurality of data operations as different grouping combination schemes, the same nodes do not exist in different groups, namely all the independent data operations are arranged, combined and divided into different combinations to form data operations under different combinations, the cost of various possible combination schemes is solved and calculated by using a cost model and a dynamic programming algorithm, and the data operation grouping combination scheme with the lowest cost is searched for as the optimal scheme.

The invention solves the optimized grouping problem by combining the cost model and the dynamic programming algorithm, can obtain the data operation grouping combination scheme with the minimum execution cost by the mode, and can ensure that the total cost after the operation grouping combination execution is minimum.

The invention has the beneficial effects that:

the method of the invention adopts a corresponding optimization method according to two different scenes of single data conversion script execution and multi-data conversion script execution, and finally generates and executes the text data conversion script as program logic capable of efficiently and extendably processing mass text data.

The method designed by the invention can be applied to processing mass text data in the data preparation stage, and can effectively reduce the time-space cost of the text data conversion script in the execution process and improve the efficiency of the data preparation stage by applying the execution optimization method facing the text data conversion script.

Drawings

FIG. 1 is a flow chart of the steps performed by the present invention.

FIG. 2 is a schematic diagram of an execution plan tree.

FIG. 3 is a schematic diagram of an execution plan tree conversion to big data processing jobs.

Detailed Description

The technical solution of the present invention will now be further explained with reference to specific embodiments and examples.

Referring to fig. 1, the specific implementation process and the working principle of the present invention are as follows:

step 1: and analyzing the text data conversion script to generate an execution plan tree.

The execution plan tree may be abstractly represented as a tree structure composed of data operations as nodes, which corresponds to the data conversion flow described by the text data conversion script. In the execution plan tree, the data input operation is used as a leaf node, the data output operation is used as a root node of the tree, and the data conversion operation is used as an internal node in the tree.

The parent-child relationship of the node represents the dependency relationship between the data operations corresponding to the node due to the input and output of data, that is, the output data generated by the data operation corresponding to the child node is used as the input data of the data operation corresponding to the parent node. FIG. 2 is an execution plan tree containing 11 data operations.

After the execution plan tree is generated, the validity and validity of the nodes in the execution plan tree need to be checked.

Step 2: and according to the execution scene of the text data conversion script, dividing the execution into single data conversion script execution and multi-data conversion script execution, and respectively optimizing.

1) In the single data conversion script execution scene, a step optimization method facing to the execution plan tree is adopted. The step-by-step optimization method facing the execution plan tree comprises the following three steps:

1.1) operation push-down optimization: under the condition of not changing the data conversion processing result, the spatiotemporal cost in the execution process is reduced by changing the execution sequence of the nodes in the execution plan tree, and the method specifically comprises the following steps: if the tuples in the two multisets have repeated values under a certain attribute, the screening operation performed after the connection operation is performed on the attribute values of the multisets is advanced to the front of the connection operation, so that the execution order of the screening operation relative to the connection operation is changed.

1.2) operation combination: the space-time cost in the execution process is reduced by a method for combining two adjacent nodes in the execution plan tree, and the method specifically comprises the following steps: and if the two adjacent nodes operate aiming at the rows in the multiple concentration, merging the two adjacent nodes, wherein the nodes are the nodes corresponding to the data operation in the execution plan tree.

In the scene of executing the multi-data conversion script, a graph optimization method based on cost is adopted. The cost-based graph optimization method consists of two steps:

2.1) constructing an execution plan graph: and merging a plurality of execution plan trees corresponding to the plurality of data conversion scripts in a mode of merging common sub-nodes to obtain an execution plan graph, wherein the execution plan graph is also a graph structure formed by nodes, and the common sub-nodes refer to nodes with the same operation semantics.

Based on the idea of data operation consanguinity, data operations with the same operation meaning are identified by computing hash values of the data operations, thereby eliminating duplicate operation logic. After the execution plan graph is built, the execution plan graph can be further optimized by using an optimization method of a single data conversion script.

And traversing and merging the nodes from the output nodes (namely the nodes with the out degree of 0 in the execution plan graph) in the execution plan graph in sequence, replacing the nodes in the original execution plan graph with the merged nodes, and finally obtaining the nodes after further optimization.

2.2) cost-based operation merging: and optimizing and combining nodes in the execution planning graph through a self-defined cost model, and reducing the time-space cost in the execution process.

The cost model calculates a plurality of independence by adopting the following methodThe sum of the data operations and the data operation J resulting from the merging of multiple independent data operations_*The cost of (2):

The calculation is as follows:

wherein the content of the first and second substances,

for the read cost of data in distributed processing, C_tFor the cost of network transmission, C^lFor the local read-write cost of the data,

for data operation J_iThe size of the intermediate result of (a),

for data operation J_iSorting the number of merged passes from middle to outer;

wherein the content of the first and second substances,

operating J on merged data_*The size of the intermediate result of (a),

And step 3: and generating and operating a program processing logic facing the big data platform according to the optimized execution plan tree/graph, thereby efficiently converting and processing the data on the big data platform.

Typically, an execution plan tree/graph will be converted into a set of one or more large data processing jobs. The execution plan tree transformation in FIG. 3 results in a set of big data processing jobs. The execution plan tree contains 11 data operations, and the set of large data processing jobs contains a total of 8 jobs, where J₁And J₄Processing logic including data input operations and line slicing operations, and J₈Processing logic including packet aggregation operations and data output operations.

For a plurality of big data processing jobs generated by the execution plan tree, the jobs are topologically ordered according to the dependency relationship among different jobs to determine the execution order corresponding to the jobs.

Claims

1. A method for executing and optimizing text data conversion script is characterized in that: aiming at the text data conversion script executed by network distributed processing, the following method steps are adopted for processing:

(1) analyzing the text data conversion script to generate an execution plan tree; using the tuple-based multiple set as a data model of the text data, wherein the text data conversion script comprises data operations for modifying and converting the structure and the content of the multiple set;

(3) generating a logic program for processing and running according to an execution plan result obtained after optimization, thereby efficiently converting and processing data on a big data platform;

the different execution scenes of the conversion script are divided into a single data conversion script and a multi-data conversion script;

aiming at a single data conversion script, the following step-by-step optimization method oriented to the execution plan tree is adopted for processing, and the method specifically comprises the following steps:

1.1) operation push-down optimization: under the condition of not changing the data conversion processing result, the spatiotemporal cost in the execution process is reduced by changing the execution sequence of the nodes in the execution plan tree, and the method specifically comprises the following steps: if the tuples in the two multiple sets have repeated values under a certain attribute, the screening operation carried out after the connection operation is carried out on the attribute values of the multiple sets is carried out before the connection operation;

1.2) operation combination: the space-time cost in the execution process is reduced by a method for combining two adjacent nodes in the execution plan tree, and the method specifically comprises the following steps: if two adjacent nodes operate aiming at the rows in the multiple concentration, the two adjacent nodes are merged;

if the two multiple sets need to be connected and the difference of the characteristics of the two multiple sets is more than 3 times, transmitting the multiple set with smaller characteristics to the node of the distributed network where the multiple set with larger characteristics is located;

aiming at the multi-data conversion script, a cost-based graph optimization method is adopted for processing, and the method specifically comprises the following steps:

2.1) constructing an execution plan graph: merging a plurality of execution plan trees corresponding to a plurality of data conversion scripts in a mode of merging common sub-nodes to obtain an execution plan graph, wherein the common sub-nodes refer to nodes with the same operation semantics;

2. The method of claim 1, wherein the method comprises: in the execution plan tree of the step (1), one node is a data operation, the parent-child relationship of the nodes in the tree represents the execution sequence of the data operation, and the data operation represented by the parent node can be executed only after the data operation represented by all the child nodes is completed.

3. The method of claim 1, wherein the method comprises: the step 2.2) is specifically as follows:

2.2.1) establishing the following cost model aiming at independent data operation, wherein the independent data operation refers to data operation under mutually independent nodes which are shared and input in the same sub-node;

The calculation is as follows:

wherein the content of the first and second substances,

for data operation J_iThe size of the intermediate result of (a),

for data operation J_iSorting the number of merged passes from middle to outer;

wherein the content of the first and second substances,

operating J on merged data_*The size of the intermediate result of (a),

all independent data operations are combined together to form one data operation as a grouping combination scheme, or different grouping combinations of the data operations are carried out on all the independent data operations to obtain a plurality of data operations as different grouping combination schemes, the cost of various possible combination schemes is solved and calculated by using a cost model and a dynamic programming algorithm, and the data operation grouping combination scheme with the lowest cost is searched.