CN108763489B - Method for optimizing Spark SQL execution workflow - Google Patents

Method for optimizing Spark SQL execution workflow Download PDF

Info

Publication number
CN108763489B
CN108763489B CN201810536078.3A CN201810536078A CN108763489B CN 108763489 B CN108763489 B CN 108763489B CN 201810536078 A CN201810536078 A CN 201810536078A CN 108763489 B CN108763489 B CN 108763489B
Authority
CN
China
Prior art keywords
cost
stage
data
task
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810536078.3A
Other languages
Chinese (zh)
Other versions
CN108763489A (en
Inventor
宋爱波
万雨桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201810536078.3A priority Critical patent/CN108763489B/en
Publication of CN108763489A publication Critical patent/CN108763489A/en
Application granted granted Critical
Publication of CN108763489B publication Critical patent/CN108763489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for optimizing Spark SQL execution workflow. The method includes step S1: constructing a cost model for executing the Spark task, wherein the cost model comprises the cost for reading input data, the cost for sequencing intermediate data and the cost for writing output data, and summing the cost for reading the input data, the cost for sequencing the intermediate data and the cost for writing the output data to obtain the total cost for executing the task; step S2: a cost-based correlation combination algorithm is provided, and the idea of the algorithm is to calculate the sum of the costs respectively executed by two tasks with input data correlation and the cost executed after the two tasks are combined into one task, and determine whether to combine the two tasks by comparing the sizes of the two tasks. The invention solves the problem of repeatedly reading the same input data in Spark SQL query by a cost-based correlation merging algorithm.

Description

Method for optimizing Spark SQL execution workflow
The technical field is as follows:
the invention relates to a method for optimizing Spark SQL execution workflow, and belongs to the technical field of computer software.
Background art:
at present, a Catalyst framework is used for optimizing a workflow on a Spark SQL platform, and the framework processes query statements and adopts a similar method with a relational database, namely, a lexical and grammatical analysis (Parse) is firstly carried out on the SQL statements to form a logic execution plan tree, then a certain optimization rule is used for carrying out analysis, optimization, binding and other processing processes on the tree, and different operations are adopted on different types of nodes in the tree through mode matching. The optimization of the logic execution plan tree is mainly algebraic optimization and comprises the rules of predicate push-down, column value clipping and the like. The predicate push-down, that is, pushing down the selection operation and the projection operation included in the query before the join operation, can reduce the magnitude of the relationship in the horizontal direction and the vertical direction, respectively, thereby reducing the cost generated by the join operation. So-called column value clipping, that is, if the data table is stored by columns, when the Spark SQL reads the input data for query, only the required columns are read according to the query statement, thereby reducing the disk I/O cost to some extent. However, the existing Catalyst framework only relates to algebraic optimization of query statements, a set of proper optimization rules does not exist for Spark workflow translated by SQL query, Spark SQL needs to be repeatedly read if tables appearing repeatedly in single query statements submitted by a user, redundant reading cost is caused, and execution efficiency of Spark programs is reduced on a certain program. For example, when a user submits a TPC-H Q17 query, the query statement is as shown in fig. 2. The workflow of the query statement executed under Spark is shown in FIG. 3, and Stage can be found1And Stage3All read the table lineeim, with input dependency, and thereforeThe two can be merged, and the query tree after merging is shown in fig. 4.
By merging into Stage1+3The cost of reading the lineitem table can be reduced by half. But because the output data is the original Stage1And Stage3And thus the cost of writing intermediate data after merging is unchanged.
But merging does not necessarily save query time. E.g. Stage before merging2Reading Stage only1Output of (2), Stage5Reading Stage only3To output of (c). After merging, Stage2And Stage5To read Stage1+3Resulting in intermediate data, thereby incurring additional read costs. Meanwhile, since the internal mechanism of Spark can sort the intermediate data generated in the Shuffle Stage, Stage is sorted1And Stage3The cost of sorting the generated intermediate data together is greater than the sum of the costs of sorting the two stages separately, thereby causing additional sorting costs.
Disclosure of Invention
The invention aims to provide a method for optimizing Spark SQL execution workflow, which solves the problem of repeatedly reading the same input data in Spark SQL query through a cost-based correlation merging algorithm.
The above purpose is realized by the following technical scheme:
a method of optimizing Spark SQL execution workflow, the method comprising the steps of:
step S1: constructing a cost model for executing the Spark task, wherein the cost model comprises the cost for reading input data, the cost for sequencing intermediate data and the cost for writing output data, and summing the cost for reading the input data, the cost for sequencing the intermediate data and the cost for writing the output data to obtain the total cost for executing the task;
step S2: a cost-based correlation combination algorithm is provided, and the idea of the algorithm is to calculate the sum of the costs respectively executed by two tasks with input data correlation and the cost executed after the two tasks are combined into one task, and determine whether to combine the two tasks by comparing the sizes of the two tasks.
In the method for optimizing Spark SQL execution workflow, the specific method for calculating Spark task execution cost in step S1 is as follows: the analyzer in the Catalyst framework analyzes the query submitted by the user to form a logic execution plan tree, each node in the tree corresponds to one task in Spark, and the specific operation in the query corresponding to each task can be accurately positioned through analyzing the tree structure, so that the execution cost of the task can be calculated.
In the method for optimizing Spark SQL execution workflow, step S2 determines whether to merge tasks with input dependencies by comparing the generated additional sorting cost, the reading cost of subsequent tasks and the saved cost of reading input data using a cost model.
The method for optimizing Spark SQL execution workflow comprises the following steps of (1) reading input data cost, sorting intermediate data cost and writing output data cost, and summing the three to obtain the total cost of task execution:
C(Stage)=Cread(Stage)+Csort(Stage)+Cwrite(Stage) (1)
in formula (1):
c (stage) is the total cost of task execution;
Cread(Stage) is the cost of reading the input data;
Csort(Stage) is the cost of ordering intermediate data;
Cwrite(Stage) is the cost of writing the output data;
due to Cread(Stage) and Cwrite(Stage) is all an I/O cost, so formula (1) is expressed as CI/O=C0T+C1x, wherein CI/OSum of cost for reading input data and cost for writing output data, C0For seek time and rotational delay time, T is the number of I/O occurrences, C1To transfer 1MB of data, x is the read and write data size.
The method for optimizing Spark SQL execution workflow is characterized in that the cost C for reading input dataread(Stage) is represented by the formula (2):
Figure BDA0001675284090000021
in formula (2):
C0for seek time and rotational delay time, T is the number of times of occurrence of I/O, TrTime required for local reading of 1MB of data, α is the percentage of non-local data to total data, tbTime required for network transmission of 1MB of data, | DinI is the size of Spark Stage input data, | DoutI is the size of Spark Stage output data, and B is Spark task buffer size.
The method for optimizing Spark SQL execution workflow is characterized in that the cost C for writing output datawrite(Stage) the calculation method is shown in the formula (3),
Figure BDA0001675284090000031
in the formula C0For seek time and rotational delay time, B is Spark task buffer size, | DoutI is the size of Spark Stage output data, twThe time required to locally write 1MB of data.
The method for optimizing Spark SQL execution workflow is characterized in that the cost C for ordering the intermediate datasort(Stage) is calculated as shown in formula (4):
Figure BDA0001675284090000032
in the formula C0For seek time and rotational delay time, B is Spark task buffer size, | DoutI is the size of Spark Stage output data, m is the number of tasks contained in each Stage, trThe time required to read 1MB of data locally.
In the method for optimizing Spark SQL execution workflow, the total cost c (stage) of task execution is calculated as shown in formula (5):
Figure BDA0001675284090000033
wherein P is the number of sorting orders,
Figure BDA0001675284090000034
the method for optimizing Spark SQL execution workflow, which is described in step S2, is used for two tasks Stage with input data correlationiAnd StagejAnd the cost of execution after they are combined into one task is calculated as shown in formula (6):
Figure BDA0001675284090000035
in the formula C (Stage)i+j) For two tasks Stage with input data dependencyiAnd StagejCost, t, of execution after merging into one taskrTime required for local reading of 1MB of data, twTime required for local writing of 1MB of data, tbThe time required for the network to transmit 1MB of data, B is the Spark task buffer size,
Figure BDA0001675284090000036
is StageiAnd StagejThe sum of the sizes of the output data.
Has the advantages that:
compared with the existing Spark SQL workflow, the method has the obvious advantage that the disk I/O cost of the Spark SQL task for reading the input data can be obviously reduced, so that the execution efficiency of the Spark SQL program is further improved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a TPC-H Q17 query statement;
FIG. 3 is a flow chart of the execution of the Q17 query under Spark;
figure 4 is a flow chart of the execution of the Q17 query after merging,
FIG. 5 shows the result of a cost-based correlation merging algorithm test experiment;
FIG. 6 is a comparison of disk I/O;
FIG. 7 is a graph comparing CPU usage.
Detailed Description
The present invention will be further illustrated below with reference to specific embodiments, which are to be understood as merely illustrative and not limitative of the scope of the present invention.
Spark task execution model:
the cost generated by the Spark task during execution mainly occurs in three aspects of reading input data, merging and sorting intermediate data, and writing intermediate data, so that the cost model can be expressed as:
C(Stage)=Cread(Stage)+Csort(Stage)+Cwrite(Stage) (1)
due to Cread(Stage) and Cwrite(Stage) is the I/O cost, so the cost is calculated by CI/O=C0T+C1x. The definition of each parameter is shown in table 1:
TABLE 1 parameters in cost model
Figure BDA0001675284090000041
Cost to read phase of Stage, i.e. Cread(Stage), the calculation method is as follows:
C1x=|Din|tr+α|Din|tb=(tr+αtb)|Din|
|Dinthe size of | is determined by the size of the source input data, or the size of other Stage output data. The size of alpha is 0.3, and is mainly determined by three default copy storage strategies of the HDFS: one copy is stored locally, one is stored in the same rack, and one is stored in a remote rack. And the number T of I/O occurred is determined by the specific StaAnd (5) ge determining. For S-Stage, if the source input data is read, T is 1 because the source data is stored continuously. If the output of other stages is read, the size of T is determined by the number of the intermediate data files generated by the previous Stage. As the Spark task writes data into the buffer area when writing intermediate data, the buffer area is full and then overflows and writes the disk to form a file, and supposing that the previous Stage generates | Dout| size data, then
Figure BDA0001675284090000051
An intermediate file, wherein B is the size of the Spark task buffer, the number of I/O occurrences
Figure BDA0001675284090000052
For a J-Stage, since a join operation is performed on two tables, the input of the Stage must be the output of the other two stages. Suppose that the first two stages together produce | DoutIf the output data of I size adopts two-way merging external sorting connection algorithm, scanning is needed
Figure BDA0001675284090000053
Second, i.e. number of I/O occurring
Figure BDA0001675284090000054
In summary, the cost of Stage read phase is calculated as follows:
Figure BDA0001675284090000055
cost to write phase of Stage, i.e. Cwrite(Stage), because the intermediate data are all overflowed to the local disk, the process does not involve the network transmission cost, and the calculation method is as follows:
C1x=|Dout|tw
|Doutthe size of | can be calculated by the method described in the middle data cache section of the second chapter Spark Shuffle. Occurrence of I/O timesThe number T is determined by how many intermediate files are generated, and the number of the intermediate files is determined by a formula
Figure BDA0001675284090000056
And (6) performing calculation. The cost of the Stage write phase is calculated as follows:
Figure BDA0001675284090000057
cost to the sorting phase of Stage, i.e. Csort(Stage), because each task in Stage needs to sort and combine all the intermediate data files generated by the task, the total generation is carried out
Figure BDA0001675284090000058
An intermediate file, so that each task can be considered to have generated
Figure BDA0001675284090000059
And m is the number of tasks contained in each Stage. Then C issort(Stage) was calculated as follows:
Figure BDA00016752840900000510
order the number of times of sorting
Figure BDA00016752840900000511
The cost of executing one Stage in Spark is
Figure BDA00016752840900000512
Cost-based correlation merging algorithm:
and then calculating the cost before and after merging according to the cost model to judge whether merging is carried out according to the input correlation. Suppose StageiAnd StagejReading the same input data F while assuming that they are inputThe intermediate data not overlapping, i.e. not overlapping
Figure BDA0001675284090000061
According to the formula (3-5), Stage is allowed to standi、StagejAnd Stage after mergingi+jResulting ordering cost Pi,Pj,PG
Figure BDA0001675284090000062
Figure BDA0001675284090000063
Figure BDA0001675284090000064
The cost formula for Stage after merging is as follows:
Figure BDA0001675284090000065
assuming that the revenue generated after merging is variable Earn, Stage before mergingaOnly reading StageiIntermediate data of output, StagebOnly reading StagejOutput intermediate data, merge followed by StageaAnd StagebTo read Stagei+jResulting in intermediate data, thereby incurring additional read costs. Therefore, the saved cost for reading the input data is used to subtract the extra sequencing cost and the reading cost, and the calculation formula of Earn can be obtained as follows:
Figure BDA0001675284090000066
if Earn >0, indicating that the merging reduced disk read cost is greater than the generated additional sorting cost, Stage with input dependency can be merged. If Earn <0, Stage with input correlation is not merged. Therefore, whether to combine according to the input correlation can be checked only by checking whether the Earn value is greater than 0.
In order to verify the feasibility of the invention, from the queries provided by the TPC-H, a plurality of queries are selected to form three query tasks M1, M2 and M3, wherein the M1 task includes queries Q4 and Q14, and the two queries belong to queries with larger input data but less generated intermediate data. The M2 task contains queries Q2 and Q17, which belong to queries that have less input data, but produce more intermediate data. The M3 task contains queries Q5 and Q9, which are queries with larger input data and more intermediate data generated. Programming and submitting the program to an SSO system and an existing Spark SQL system for operation, recording query execution time, and obtaining an experimental result as shown in FIG. 5:
in the figure, the horizontal axis represents query execution time in seconds; the corresponding pairs of query tasks are recorded on the vertical axis, and in the test results of one group of query tasks, the upper side is the execution time of the SSO system, and the lower side is the execution time of the existing Spark SQL system, and the smaller the value, the better the performance is.
As can be seen from fig. 5, the SSO system has the most significant optimization effect on the M1 task, while the execution time of the M2 task in the SSO system is consistent with that in the existing Spark system. This is because merging two queries in the M1 task reduces the disk I/O cost for many reads of input data, and both generate little intermediate data, which means that even the extra ordering cost of merging is little, so the optimization effect is most obvious. The M2 task is finished and opposite, the Q2 query reads 0.92GB input data and outputs 2.3GB intermediate data; the Q17 query reads 16.7GB of input data and outputs 7.1GB of intermediate data. Calculated according to the cost model above, both Q2 and Q17 read the part table, which is 232MB in size (under the condition that 10g is generated by TPC-H benchmark test), so merging can reduce the read cost of the part table once, which is 2.9 seconds of disk I/O, and the generated additional sorting cost is 18.2 seconds. It can be seen that sharing the same input data part table of queries Q2 and Q17 results in a higher ordering cost than the cost of the economized read input data, so the SSO system does not choose to merge the two queries in the M2 task. In order to analyze the performance advantages of the SSO system in detail, the invention submits the M1 task to two systems, namely the SSO system and the native Spark SQL system, and monitors the disk I/O and the CPU utilization rate on the master node, and the results are shown in fig. 6 and 7:
it can be found that M1 uses 23 seconds and 77 seconds of execution time to complete the query in the two systems, respectively, the advantage of the SSO system is obvious, and in combination with the disk I/O and CPU usage in the figure, it can be found that the SSO system significantly reduces the disk I/O cost at the initial stage of program execution by merging two jobs with input dependency. However, because the intermediate data generated by the Q4 query and the Q14 query do not have overlapped parts, merging does not reduce the disk I/O cost for reading and writing the intermediate data, and then the change rules of the disk I/O in the two systems tend to be consistent. However, after the merging, the unified sorting operation is performed on the intermediate data generated by the two queries together, which causes a cost greater than that of performing the sorting operation on the two queries separately, so that it can be found that in the SSO system, the CPU utilization rate at the Shuffle stage is higher than that of the native Spark system. For data intensive applications such as SQL queries, disk I/O is the most consumed resource, while other resources in the cluster tend to be in a less used state. The SSO system provided by the invention reduces the cost of disk I/O by improving the utilization rate of memory and CPU resources, and experiments prove that the method can effectively reduce the execution time of SQL query application.
It should be noted that the above embodiments are only examples for clarity of illustration, and are not limiting, and all embodiments need not be exhaustive. All the components not specified in the present embodiment can be realized by the prior art. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (1)

1. A method for optimizing Spark SQL execution workflow, which is characterized by comprising the following steps:
step S1: a parser in a Catalyst framework parses a query submitted by a user to form a logic execution plan tree, each node in the logic execution plan tree corresponds to one task in Spark, and specific operation in the query corresponding to each task can be accurately positioned through analyzing the logic execution plan tree, so that the execution cost of the task can be calculated; constructing a cost model for executing the Spark task, wherein the cost model comprises the cost for reading input data, the cost for sequencing intermediate data and the cost for writing output data, and summing the cost for reading the input data, the cost for sequencing the intermediate data and the cost for writing the output data to obtain the total cost for executing the task, namely;
C(Stage)=Cread(Stage)+Csort(Stage)+Cwrite(Stage) (1)
in formula (1):
c (stage) is the total cost of task execution;
Cread(Stage) is the cost of reading the input data;
Csort(Stage) is the cost of ordering intermediate data;
Cwrite(Stage) is the cost of writing the output data;
due to Cread(Stage) and Cwrite(Stage) is all an I/O cost, so formula (1) is expressed as CI/O=C0T+C1x, wherein CI/OIs the sum of the cost of reading input data and the cost of writing output data, C0For seek time and rotational delay time, T is the number of I/O occurrences, C1The time required for transmitting 1MB of data is x is the size of read-write data;
the cost C of reading input dataread(Stage) is represented by the formula (2):
Figure FDA0003383284900000011
in the formula: t is trTime required for local reading of 1MB of data, α is the percentage of non-local data to total data, tbTime required for network transmission of 1MB of data, | DinI is the size of Spark Stage input data, | DoutI is the size of data output by Spark Stage, B is the size of Spark task buffer;
the cost C of writing output datawrite(Stage) the calculation method is shown in the formula (3),
Figure FDA0003383284900000012
in the formula: t is twTime required for local writing of 1MB of data;
the cost C for ordering the intermediate datasort(Stage) is calculated as shown in formula (4):
Figure FDA0003383284900000013
wherein m is the number of tasks contained in each Stage;
the total cost C (stage) of the task execution is calculated as shown in formula (5):
Figure FDA0003383284900000021
wherein P is the number of sorting orders,
Figure FDA0003383284900000022
step S2: whether the tasks with input correlation are merged or not is determined by comparing the generated extra sequencing cost, the reading cost of the subsequent tasks and the saved cost for reading the input data by using a cost model; calculating the sum of the costs of their respective executions and the cost of the execution after merging into one task is shown in equation (6):
Figure FDA0003383284900000023
in the formula, StageiAnd StagejFor two tasks with input data dependency, C (Stage)i+j) For two tasks Stage with input data dependencyiAnd StagejThe costs of execution after merging into one task,
Figure FDA0003383284900000024
is StageiAnd StagejThe sum of the sizes of the output data.
CN201810536078.3A 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow Active CN108763489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810536078.3A CN108763489B (en) 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810536078.3A CN108763489B (en) 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow

Publications (2)

Publication Number Publication Date
CN108763489A CN108763489A (en) 2018-11-06
CN108763489B true CN108763489B (en) 2022-02-15

Family

ID=64003941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810536078.3A Active CN108763489B (en) 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow

Country Status (1)

Country Link
CN (1) CN108763489B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347122A (en) * 2020-11-10 2021-02-09 西安宇视信息科技有限公司 SQL workflow processing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740249A (en) * 2014-12-08 2016-07-06 Tcl集团股份有限公司 Processing method and system during big data operation parallel scheduling process
CN107025273A (en) * 2017-03-17 2017-08-08 南方电网科学研究院有限责任公司 The optimization method and device of a kind of data query

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286748A1 (en) * 2014-04-08 2015-10-08 RedPoint Global Inc. Data Transformation System and Method
US20170024432A1 (en) * 2015-07-24 2017-01-26 International Business Machines Corporation Generating sql queries from declarative queries for semi-structured data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740249A (en) * 2014-12-08 2016-07-06 Tcl集团股份有限公司 Processing method and system during big data operation parallel scheduling process
CN107025273A (en) * 2017-03-17 2017-08-08 南方电网科学研究院有限责任公司 The optimization method and device of a kind of data query

Also Published As

Publication number Publication date
CN108763489A (en) 2018-11-06

Similar Documents

Publication Publication Date Title
Marcu et al. Spark versus flink: Understanding performance in big data analytics frameworks
US10437573B2 (en) General purpose distributed data parallel computing using a high level language
Shi et al. Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs
Zhou et al. SCOPE: parallel databases meet MapReduce
Bruno et al. Continuous cloud-scale query optimization and processing
Bajda-Pawlikowski et al. Efficient processing of data warehousing queries in a split execution environment
Hueske et al. Opening the black boxes in data flow optimization
US10824622B2 (en) Data statistics in data management systems
US20170083573A1 (en) Multi-query optimization
Loebman et al. Analyzing massive astrophysical datasets: Can Pig/Hadoop or a relational DBMS help?
US8423569B2 (en) Decomposed query conditions
Zhou et al. Advanced partitioning techniques for massively distributed computation
Mustafa et al. A machine learning approach for predicting execution time of spark jobs
Abuzaid et al. Diff: a relational interface for large-scale data explanation
Raasveldt et al. Vectorized udfs in column-stores
CN110597891B (en) Device, system, method and storage medium for aggregating MySQL into PostgreSQL database
Rodrigues et al. Big data processing tools: An experimental performance evaluation
Baldacci et al. A cost model for SPARK SQL
Sinthong et al. Aframe: Extending dataframes for large-scale modern data analysis
Cheng et al. Efficient skew handling for outer joins in a cloud computing environment
CN108763489B (en) Method for optimizing Spark SQL execution workflow
US9280582B2 (en) Optimization of join queries for related data
Sinthong et al. AFrame: Extending DataFrames for large-scale modern data analysis (Extended Version)
JP5084750B2 (en) Managing statistical views in a database system
Ji et al. Query execution optimization in spark SQL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant