CN108763489B

CN108763489B - Method for optimizing Spark SQL execution workflow

Info

Publication number: CN108763489B
Application number: CN201810536078.3A
Authority: CN
Inventors: 宋爱波; 万雨桐
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2022-02-15
Anticipated expiration: 2038-05-28
Also published as: CN108763489A

Abstract

The invention discloses a method for optimizing Spark SQL execution workflow. The method includes step S1: constructing a cost model for executing the Spark task, wherein the cost model comprises the cost for reading input data, the cost for sequencing intermediate data and the cost for writing output data, and summing the cost for reading the input data, the cost for sequencing the intermediate data and the cost for writing the output data to obtain the total cost for executing the task; step S2: a cost-based correlation combination algorithm is provided, and the idea of the algorithm is to calculate the sum of the costs respectively executed by two tasks with input data correlation and the cost executed after the two tasks are combined into one task, and determine whether to combine the two tasks by comparing the sizes of the two tasks. The invention solves the problem of repeatedly reading the same input data in Spark SQL query by a cost-based correlation merging algorithm.

Description

Method for optimizing Spark SQL execution workflow

The technical field is as follows:

the invention relates to a method for optimizing Spark SQL execution workflow, and belongs to the technical field of computer software.

Background art:

at present, a Catalyst framework is used for optimizing a workflow on a Spark SQL platform, and the framework processes query statements and adopts a similar method with a relational database, namely, a lexical and grammatical analysis (Parse) is firstly carried out on the SQL statements to form a logic execution plan tree, then a certain optimization rule is used for carrying out analysis, optimization, binding and other processing processes on the tree, and different operations are adopted on different types of nodes in the tree through mode matching. The optimization of the logic execution plan tree is mainly algebraic optimization and comprises the rules of predicate push-down, column value clipping and the like. The predicate push-down, that is, pushing down the selection operation and the projection operation included in the query before the join operation, can reduce the magnitude of the relationship in the horizontal direction and the vertical direction, respectively, thereby reducing the cost generated by the join operation. So-called column value clipping, that is, if the data table is stored by columns, when the Spark SQL reads the input data for query, only the required columns are read according to the query statement, thereby reducing the disk I/O cost to some extent. However, the existing Catalyst framework only relates to algebraic optimization of query statements, a set of proper optimization rules does not exist for Spark workflow translated by SQL query, Spark SQL needs to be repeatedly read if tables appearing repeatedly in single query statements submitted by a user, redundant reading cost is caused, and execution efficiency of Spark programs is reduced on a certain program. For example, when a user submits a TPC-H Q17 query, the query statement is as shown in fig. 2. The workflow of the query statement executed under Spark is shown in FIG. 3, and Stage can be found₁And Stage₃All read the table lineeim, with input dependency, and thereforeThe two can be merged, and the query tree after merging is shown in fig. 4.

By merging into Stage₁₊₃The cost of reading the lineitem table can be reduced by half. But because the output data is the original Stage₁And Stage₃And thus the cost of writing intermediate data after merging is unchanged.

But merging does not necessarily save query time. E.g. Stage before merging₂Reading Stage only₁Output of (2), Stage₅Reading Stage only₃To output of (c). After merging, Stage₂And Stage₅To read Stage₁₊₃Resulting in intermediate data, thereby incurring additional read costs. Meanwhile, since the internal mechanism of Spark can sort the intermediate data generated in the Shuffle Stage, Stage is sorted₁And Stage₃The cost of sorting the generated intermediate data together is greater than the sum of the costs of sorting the two stages separately, thereby causing additional sorting costs.

Disclosure of Invention

The invention aims to provide a method for optimizing Spark SQL execution workflow, which solves the problem of repeatedly reading the same input data in Spark SQL query through a cost-based correlation merging algorithm.

The above purpose is realized by the following technical scheme:

a method of optimizing Spark SQL execution workflow, the method comprising the steps of:

step S1: constructing a cost model for executing the Spark task, wherein the cost model comprises the cost for reading input data, the cost for sequencing intermediate data and the cost for writing output data, and summing the cost for reading the input data, the cost for sequencing the intermediate data and the cost for writing the output data to obtain the total cost for executing the task;

step S2: a cost-based correlation combination algorithm is provided, and the idea of the algorithm is to calculate the sum of the costs respectively executed by two tasks with input data correlation and the cost executed after the two tasks are combined into one task, and determine whether to combine the two tasks by comparing the sizes of the two tasks.

In the method for optimizing Spark SQL execution workflow, the specific method for calculating Spark task execution cost in step S1 is as follows: the analyzer in the Catalyst framework analyzes the query submitted by the user to form a logic execution plan tree, each node in the tree corresponds to one task in Spark, and the specific operation in the query corresponding to each task can be accurately positioned through analyzing the tree structure, so that the execution cost of the task can be calculated.

In the method for optimizing Spark SQL execution workflow, step S2 determines whether to merge tasks with input dependencies by comparing the generated additional sorting cost, the reading cost of subsequent tasks and the saved cost of reading input data using a cost model.

The method for optimizing Spark SQL execution workflow comprises the following steps of (1) reading input data cost, sorting intermediate data cost and writing output data cost, and summing the three to obtain the total cost of task execution:

C(Stage)＝C_read(Stage)+C_sort(Stage)+C_write(Stage) (1)

in formula (1):

c (stage) is the total cost of task execution;

C_read(Stage) is the cost of reading the input data;

C_sort(Stage) is the cost of ordering intermediate data;

C_write(Stage) is the cost of writing the output data;

due to C_read(Stage) and C_write(Stage) is all an I/O cost, so formula (1) is expressed as C_I/O＝C₀T+C₁x, wherein C_I/OSum of cost for reading input data and cost for writing output data, C₀For seek time and rotational delay time, T is the number of I/O occurrences, C₁To transfer 1MB of data, x is the read and write data size.

The method for optimizing Spark SQL execution workflow is characterized in that the cost C for reading input data_read(Stage) is represented by the formula (2):

in formula (2):

C₀for seek time and rotational delay time, T is the number of times of occurrence of I/O, T_rTime required for local reading of 1MB of data, α is the percentage of non-local data to total data, t_bTime required for network transmission of 1MB of data, | D_inI is the size of Spark Stage input data, | D_outI is the size of Spark Stage output data, and B is Spark task buffer size.

The method for optimizing Spark SQL execution workflow is characterized in that the cost C for writing output data_write(Stage) the calculation method is shown in the formula (3),

in the formula C₀For seek time and rotational delay time, B is Spark task buffer size, | D_outI is the size of Spark Stage output data, t_wThe time required to locally write 1MB of data.

The method for optimizing Spark SQL execution workflow is characterized in that the cost C for ordering the intermediate data_sort(Stage) is calculated as shown in formula (4):

in the formula C₀For seek time and rotational delay time, B is Spark task buffer size, | D_outI is the size of Spark Stage output data, m is the number of tasks contained in each Stage, t_rThe time required to read 1MB of data locally.

In the method for optimizing Spark SQL execution workflow, the total cost c (stage) of task execution is calculated as shown in formula (5):

wherein P is the number of sorting orders,

the method for optimizing Spark SQL execution workflow, which is described in step S2, is used for two tasks Stage with input data correlation_iAnd Stage_jAnd the cost of execution after they are combined into one task is calculated as shown in formula (6):

in the formula C (Stage)_i+j) For two tasks Stage with input data dependency_iAnd Stage_jCost, t, of execution after merging into one task_rTime required for local reading of 1MB of data, t_wTime required for local writing of 1MB of data, t_bThe time required for the network to transmit 1MB of data, B is the Spark task buffer size,

is Stage_iAnd Stage_jThe sum of the sizes of the output data.

Has the advantages that:

compared with the existing Spark SQL workflow, the method has the obvious advantage that the disk I/O cost of the Spark SQL task for reading the input data can be obviously reduced, so that the execution efficiency of the Spark SQL program is further improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a TPC-H Q17 query statement;

FIG. 3 is a flow chart of the execution of the Q17 query under Spark;

figure 4 is a flow chart of the execution of the Q17 query after merging,

FIG. 5 shows the result of a cost-based correlation merging algorithm test experiment;

FIG. 6 is a comparison of disk I/O;

FIG. 7 is a graph comparing CPU usage.

Detailed Description

The present invention will be further illustrated below with reference to specific embodiments, which are to be understood as merely illustrative and not limitative of the scope of the present invention.

Spark task execution model:

the cost generated by the Spark task during execution mainly occurs in three aspects of reading input data, merging and sorting intermediate data, and writing intermediate data, so that the cost model can be expressed as:

C(Stage)＝C_read(Stage)+C_sort(Stage)+C_write(Stage) (1)

due to C_read(Stage) and C_write(Stage) is the I/O cost, so the cost is calculated by C_I/O＝C₀T+C₁x. The definition of each parameter is shown in table 1:

TABLE 1 parameters in cost model

Cost to read phase of Stage, i.e. C_read(Stage), the calculation method is as follows:

C₁x＝|D_in|t_r+α|D_in|t_b＝(t_r+αt_b)|D_in|

|D_inthe size of | is determined by the size of the source input data, or the size of other Stage output data. The size of alpha is 0.3, and is mainly determined by three default copy storage strategies of the HDFS: one copy is stored locally, one is stored in the same rack, and one is stored in a remote rack. And the number T of I/O occurred is determined by the specific StaAnd (5) ge determining. For S-Stage, if the source input data is read, T is 1 because the source data is stored continuously. If the output of other stages is read, the size of T is determined by the number of the intermediate data files generated by the previous Stage. As the Spark task writes data into the buffer area when writing intermediate data, the buffer area is full and then overflows and writes the disk to form a file, and supposing that the previous Stage generates | D_out| size data, then

An intermediate file, wherein B is the size of the Spark task buffer, the number of I/O occurrences

For a J-Stage, since a join operation is performed on two tables, the input of the Stage must be the output of the other two stages. Suppose that the first two stages together produce | D_outIf the output data of I size adopts two-way merging external sorting connection algorithm, scanning is needed

Second, i.e. number of I/O occurring

In summary, the cost of Stage read phase is calculated as follows:

cost to write phase of Stage, i.e. C_write(Stage), because the intermediate data are all overflowed to the local disk, the process does not involve the network transmission cost, and the calculation method is as follows:

C₁x＝|D_out|t_w

|D_outthe size of | can be calculated by the method described in the middle data cache section of the second chapter Spark Shuffle. Occurrence of I/O timesThe number T is determined by how many intermediate files are generated, and the number of the intermediate files is determined by a formula

And (6) performing calculation. The cost of the Stage write phase is calculated as follows:

cost to the sorting phase of Stage, i.e. C_sort(Stage), because each task in Stage needs to sort and combine all the intermediate data files generated by the task, the total generation is carried out

An intermediate file, so that each task can be considered to have generated

And m is the number of tasks contained in each Stage. Then C is_sort(Stage) was calculated as follows:

order the number of times of sorting

The cost of executing one Stage in Spark is

Cost-based correlation merging algorithm:

and then calculating the cost before and after merging according to the cost model to judge whether merging is carried out according to the input correlation. Suppose Stage_iAnd Stage_jReading the same input data F while assuming that they are inputThe intermediate data not overlapping, i.e. not overlapping

According to the formula (3-5), Stage is allowed to stand_i、Stage_jAnd Stage after merging_i+jResulting ordering cost P_i，P_j，P_G

The cost formula for Stage after merging is as follows:

assuming that the revenue generated after merging is variable Earn, Stage before merging_aOnly reading Stage_iIntermediate data of output, Stage_bOnly reading Stage_jOutput intermediate data, merge followed by Stage_aAnd Stage_bTo read Stage_i+jResulting in intermediate data, thereby incurring additional read costs. Therefore, the saved cost for reading the input data is used to subtract the extra sequencing cost and the reading cost, and the calculation formula of Earn can be obtained as follows:

if Earn >0, indicating that the merging reduced disk read cost is greater than the generated additional sorting cost, Stage with input dependency can be merged. If Earn <0, Stage with input correlation is not merged. Therefore, whether to combine according to the input correlation can be checked only by checking whether the Earn value is greater than 0.

In order to verify the feasibility of the invention, from the queries provided by the TPC-H, a plurality of queries are selected to form three query tasks M1, M2 and M3, wherein the M1 task includes queries Q4 and Q14, and the two queries belong to queries with larger input data but less generated intermediate data. The M2 task contains queries Q2 and Q17, which belong to queries that have less input data, but produce more intermediate data. The M3 task contains queries Q5 and Q9, which are queries with larger input data and more intermediate data generated. Programming and submitting the program to an SSO system and an existing Spark SQL system for operation, recording query execution time, and obtaining an experimental result as shown in FIG. 5:

in the figure, the horizontal axis represents query execution time in seconds; the corresponding pairs of query tasks are recorded on the vertical axis, and in the test results of one group of query tasks, the upper side is the execution time of the SSO system, and the lower side is the execution time of the existing Spark SQL system, and the smaller the value, the better the performance is.

As can be seen from fig. 5, the SSO system has the most significant optimization effect on the M1 task, while the execution time of the M2 task in the SSO system is consistent with that in the existing Spark system. This is because merging two queries in the M1 task reduces the disk I/O cost for many reads of input data, and both generate little intermediate data, which means that even the extra ordering cost of merging is little, so the optimization effect is most obvious. The M2 task is finished and opposite, the Q2 query reads 0.92GB input data and outputs 2.3GB intermediate data; the Q17 query reads 16.7GB of input data and outputs 7.1GB of intermediate data. Calculated according to the cost model above, both Q2 and Q17 read the part table, which is 232MB in size (under the condition that 10g is generated by TPC-H benchmark test), so merging can reduce the read cost of the part table once, which is 2.9 seconds of disk I/O, and the generated additional sorting cost is 18.2 seconds. It can be seen that sharing the same input data part table of queries Q2 and Q17 results in a higher ordering cost than the cost of the economized read input data, so the SSO system does not choose to merge the two queries in the M2 task. In order to analyze the performance advantages of the SSO system in detail, the invention submits the M1 task to two systems, namely the SSO system and the native Spark SQL system, and monitors the disk I/O and the CPU utilization rate on the master node, and the results are shown in fig. 6 and 7:

it can be found that M1 uses 23 seconds and 77 seconds of execution time to complete the query in the two systems, respectively, the advantage of the SSO system is obvious, and in combination with the disk I/O and CPU usage in the figure, it can be found that the SSO system significantly reduces the disk I/O cost at the initial stage of program execution by merging two jobs with input dependency. However, because the intermediate data generated by the Q4 query and the Q14 query do not have overlapped parts, merging does not reduce the disk I/O cost for reading and writing the intermediate data, and then the change rules of the disk I/O in the two systems tend to be consistent. However, after the merging, the unified sorting operation is performed on the intermediate data generated by the two queries together, which causes a cost greater than that of performing the sorting operation on the two queries separately, so that it can be found that in the SSO system, the CPU utilization rate at the Shuffle stage is higher than that of the native Spark system. For data intensive applications such as SQL queries, disk I/O is the most consumed resource, while other resources in the cluster tend to be in a less used state. The SSO system provided by the invention reduces the cost of disk I/O by improving the utilization rate of memory and CPU resources, and experiments prove that the method can effectively reduce the execution time of SQL query application.

It should be noted that the above embodiments are only examples for clarity of illustration, and are not limiting, and all embodiments need not be exhaustive. All the components not specified in the present embodiment can be realized by the prior art. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A method for optimizing Spark SQL execution workflow, which is characterized by comprising the following steps:

step S1: a parser in a Catalyst framework parses a query submitted by a user to form a logic execution plan tree, each node in the logic execution plan tree corresponds to one task in Spark, and specific operation in the query corresponding to each task can be accurately positioned through analyzing the logic execution plan tree, so that the execution cost of the task can be calculated; constructing a cost model for executing the Spark task, wherein the cost model comprises the cost for reading input data, the cost for sequencing intermediate data and the cost for writing output data, and summing the cost for reading the input data, the cost for sequencing the intermediate data and the cost for writing the output data to obtain the total cost for executing the task, namely;

C(Stage)＝C_read(Stage)+C_sort(Stage)+C_write(Stage) (1)

in formula (1):

c (stage) is the total cost of task execution;

C_read(Stage) is the cost of reading the input data;

C_sort(Stage) is the cost of ordering intermediate data;

C_write(Stage) is the cost of writing the output data;

due to C_read(Stage) and C_write(Stage) is all an I/O cost, so formula (1) is expressed as C_I/O＝C₀T+C₁x, wherein C_I/OIs the sum of the cost of reading input data and the cost of writing output data, C₀For seek time and rotational delay time, T is the number of I/O occurrences, C₁The time required for transmitting 1MB of data is x is the size of read-write data;

the cost C of reading input data_read(Stage) is represented by the formula (2):

in the formula: t is t_rTime required for local reading of 1MB of data, α is the percentage of non-local data to total data, t_bTime required for network transmission of 1MB of data, | D_inI is the size of Spark Stage input data, | D_outI is the size of data output by Spark Stage, B is the size of Spark task buffer;

the cost C of writing output data_write(Stage) the calculation method is shown in the formula (3),

in the formula: t is t_wTime required for local writing of 1MB of data;

the cost C for ordering the intermediate data_sort(Stage) is calculated as shown in formula (4):

wherein m is the number of tasks contained in each Stage;

the total cost C (stage) of the task execution is calculated as shown in formula (5):

wherein P is the number of sorting orders,

step S2: whether the tasks with input correlation are merged or not is determined by comparing the generated extra sequencing cost, the reading cost of the subsequent tasks and the saved cost for reading the input data by using a cost model; calculating the sum of the costs of their respective executions and the cost of the execution after merging into one task is shown in equation (6):

in the formula, Stage_iAnd Stage_jFor two tasks with input data dependency, C (Stage)_i+j) For two tasks Stage with input data dependency_iAnd Stage_jThe costs of execution after merging into one task,

is Stage_iAnd Stage_jThe sum of the sizes of the output data.