CN108763489A

CN108763489A - A method of optimization Spark SQL execute workflow

Info

Publication number: CN108763489A
Application number: CN201810536078.3A
Authority: CN
Inventors: 宋爱波; 万雨桐
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2018-11-06
Anticipated expiration: 2038-05-28
Also published as: CN108763489B

Abstract

The invention discloses a kind of methods that optimization Spark SQL execute workflow.The method comprising the steps of S1：The Cost Model for building Spark task executions is divided into the cost for reading input data, and the cost be ranked up to intermediate data and the cost for writing output data are summed three to obtain the total cost of task execution；Step S2：It is proposed that the correlation based on cost merges algorithm, the thought of the algorithm is the task with input data correlation for two, it calculates the sum of cost that they are executed respectively and is merged into the cost executed after a task, decide whether to merge them by comparing the size of the two.The present invention merges algorithm by the correlation based on cost and solves the problems, such as to carry out repeating to read to identical input data in Spark SQL queries.

Description

A method of optimization Spark SQL execute workflow

Technical field：

The present invention relates to a kind of methods that optimization Spark SQL execute workflow, belong to computer software technical field.

Background technology：

Catalyst frames are used to the optimization of workflow on Spark SQL platforms at present, the frame is to query statement Processing uses similar method with relevant database, i.e., SQL statement is first carried out morphology syntax parsing (Parse) forms one Then a logic executive plan tree the processing procedures such as is parsed, optimized and is bound to the tree using certain principle of optimality, lead to It crosses pattern match and different operations is used to different types of node in tree.To logic executive plan tree carry out optimization be mainly It is pushed away under algebraic optimization, including predicate regular with train value cutting etc..Pushed away under so-called predicate, i.e., by the selection operation for including in inquiry and It is carried out before shifting attended operation under projection operation onto, can both horizontally and vertically reduce the big of relationship respectively in this way It is small, to reduce cost caused by attended operation.So-called train value is cut, i.e., if tables of data is stored by row, Spark SQL can only read the row of needs, to a certain degree when reading input data is inquired according to query statement Upper reduction magnetic disc i/o cost.But current Catalyst frames merely relate to the algebraic optimization to query statement itself, for SQL There is no a set of suitable principles of optimality for the translated into Spark job streams of inquiry, if weight in single query statement that user submits It appears again existing table, Spark SQL need to read repeatedly, cause the reading cost of redundancy, and Spark programs are reduced on certain procedures Execution efficiency.Such as user, when submitting TPC-H Q17 inquiries, the query statement is as shown in Figure 2.The query statement exists The job stream executed under Spark is as shown in figure 3, it can be found that Stage₁And Stage₃Table lineitem will be read, there is input Correlation, therefore the two can be merged, the query tree after merging is as shown in Figure 4.

By being merged into Stage₁₊₃, the cost of reading lineitem tables can be made to reduce half.But due to output data For former Stage₁And Stage₃The sum of output data, therefore the cost that intermediate data is write after merging is constant.

But query time may not be able to be saved by merging.Such as before the combining, Stage₂Only read Stage₁It is defeated Go out, Stage₅Only read Stage₃Output.After merging, Stage₂And Stage₅Stage will be read₁₊₃The mediant of generation According to cause additional reading cost.Simultaneously because the internal mechanism of Spark can be to intermediate data that the Shuffle stages generate It is ranked up, then to Stage₁And Stage₃The cost that the intermediate data of generation is ranked up together is greater than respectively to two The sum of the cost that Stage is ranked up, to cause additional sequence cost.

Invention content

The object of the present invention is to provide a kind of methods that optimization Spark SQL execute workflow, pass through the phase based on cost Closing property merges algorithm and solves the problems, such as to carry out repeating to read to identical input data in Spark SQL queries.

Above-mentioned purpose is achieved through the following technical solutions：

A method of optimization Spark SQL execute workflow, and this method comprises the following steps：

Step S1：The Cost Model for building Spark task executions is divided into the cost for reading input data, to intermediate data The cost being ranked up and the cost for writing output data, three is summed to obtain the total cost of task execution；

Step S2：It is proposed that the correlation based on cost merges algorithm, the thought of the algorithm is that have input number for two According to the task of correlation, calculates the sum of cost that they are executed respectively and be merged into the cost executed after a task, pass through Compare the size of the two to decide whether to merge them.

The method that the described optimization Spark SQL execute workflow, the calculating Spark task execution costs described in step S1 Specific method be：The inquiry that resolver in Catalyst frames submits user parses, and forms logic executive plan It sets, each node in tree corresponds to a task in Spark, can be accurately positioned often by the analysis to the tree structure One task corresponds to the concrete operations in inquiry, is calculated so as to the Executing Cost to the task.

The method that the optimization Spark SQL execute workflow, step S2 compare generation by using Cost Model The reading cost of additional sequence cost, follow-up work and the cost for the reading input data saved decide whether to defeated The entering correlation of the task merges.

The method that the optimization Spark SQL execute workflow, the generation of the reading input data described in step (1) Valence, the cost be ranked up to intermediate data and the cost for writing output data, three is summed to obtain the total of task execution The algorithm of cost is：

C (Stage)=C_read(Stage)+C_sort(Stage)+C_write(Stage) (1)

In formula (1)：

C (Stage) is the total cost of task execution；

C_read(Stage) it is the cost for reading input data；

C_sort(Stage) it is the cost being ranked up to intermediate data；

C_write(Stage) it is the cost for writing output data；

Due to C_read(Stage) and C_write(Stage) all it is I/O costs, therefore formula (1) is expressed as C_I/O=C₀T+C₁X, Middle C_I/OTo read the cost of input data and writing the sum of the cost of output data, C₀For tracking time and rotational delay time, T For I/O number, C occurs₁The time required to transmission 1MB data, x is read-write size of data.

The method that the optimization Spark SQL execute workflow, the cost C of the reading input data_read (Stage) shown in calculation formula such as formula (2)：

In formula (2)：

C₀For tracking time and rotational delay time, T is that I/O number, t occurs_r1MB data required times are read to be local, α is the percentage that non-local data accounts for total data, t_bThe time required to for network transmission 1MB data, | D_in| it is defeated for Spark Stage Enter the size of data, | D_out| it is the size of Spark Stage output datas, B is Spark task buffer sizes.

The method that the optimization Spark SQL execute workflow, the cost C for writing output data_write(Stage) Computational methods such as formula (3) shown in,

C in formula₀For tracking time and rotational delay time, B is Spark task buffer sizes, | D_out| it is Spark The size of Stage output datas, t_wFor the time required to being written locally 1MB data.

The method that the optimization Spark SQL execute workflow, the cost that intermediate data is ranked up C_sort(Stage) shown in calculation such as formula (4)：

C in formula₀For tracking time and rotational delay time, B is Spark task buffer sizes, | D_out| it is Spark The size of Stage output datas, m are the task number for including, t in each Stage_r1MB data required times are read to be local.

The method that the optimization Spark SQL execute workflow, the total cost C's (Stage) of the task execution It calculates as shown in formula (5)：

P is the number of sequence in formula,

The method that the optimization Spark SQL execute workflow has input number described in step S2 for two According to the task Stage of correlation correlation_iWith Stage_j, calculate them and be merged into the cost such as formula (6) executed after a task It is shown：

C (Stage in formula_i+j) it is two task Stage with input data correlation correlation_iWith Stage_jIt is merged into The cost executed after one task, t_r1MB data required times, t are read to be local_wIt is taken to be written locally 1MB data Between, t_bThe time required to network transmission 1MB data, B is Spark task buffer sizes,For Stage_iWith Stage_jIt is defeated Go out the sum of the size of data.

Advantageous effect：

For the method for the present invention compared with the workflow of existing Spark SQL, notable advantage is significantly to subtract The magnetic disc i/o cost that few Spark SQL tasks read input data, to further increase the execution of Spark SQL programs Efficiency.

Description of the drawings

Fig. 1 is the flow chart of the present invention；

Fig. 2 is TPC-H Q17 query statements；

Fig. 3 is that Q17 inquires the execution flow chart at Spark；

Fig. 4 is the execution flow chart of the Q17 inquiries after merging,

Fig. 5 is that the correlation based on cost merges test of heuristics experimental result；

Fig. 6 is magnetic disc i/o comparison diagram；

Fig. 7 is CPU usage comparison diagram.

Specific implementation mode

With reference to embodiment, the present invention is furture elucidated, it should be understood that following specific implementation modes are only used for It is bright the present invention rather than limit the scope of the invention.

Spark task execution models：

Generated cost occurs mainly in read input data to Spark tasks when being executed, and merger is carried out to intermediate data Three aspects of intermediate data are sorted and write, then Cost Model can be expressed as：

C (Stage)=C_read(Stage)+C_sort(Stage)+C_write(Stage) (1)

Due to C_read(Stage) and C_write(Stage) all it is I/O costs, therefore the calculation formula of the cost is C_I/O= C₀T+C₁x.The definition of each parameter is as shown in table 1：

Parameter in 1 Cost Model of table

For the cost of the read phase of Stage, i.e. C_read(Stage), computational methods are as follows：

C₁X=| D_in|t_r+α|D_in|t_b=(t_r+αt_b)|D_in|

|D_in| size by source input data size or other Stage output datas size determine.The size of α 0.3 is taken, mainly stores strategy decision by three copies of HDFS acquiescences：One copy deposits local, and one is deposited same rack, one Deposit distal end rack.And the I/O several T occurred are determined by specific Stage.For S-Stage, if what is read is source input number According to, since source data is all continuously stored, T=1.If what is read is the output of other Stage, the size of T is by previous The file number for the intermediate data that a Stage is generated determines.Since the task of Spark can be first number when writing intermediate data According to write-in buffering area, buffering area has expired the write magnetic disk that overflows again and has formed a file, it is assumed that previous Stage is generated | D_out| size Data, then will generateA intermediate file, wherein B are the size of Spark task buffering areas, then I/O times occurred NumberFor J-Stage, due to do attended operation to two tables, the input one of the Stage is set to other two The output of stage.Assuming that the first two stage is produced altogether | D_out| the output data of size, according to two-way merger external sorting Join algorithm then needs to scanI/O number that be secondary, that is, occurringIn conclusion The calculation formula of the cost of Stage read phases is as follows：

For the cost of the write phase of Stage, i.e. C_write(Stage), since intermediate data is all the handwritten copy local disk that overflows, Therefore the process is not related to net cost, and computational methods are as follows：

C₁X=| D_out|t_w

|D_out| size can by this paper chapter 2 Spark Shuffle intermediate data cache introduction method carry out It calculates.And the I/O several T occurred determine that the number of intermediate file passes through formula by producing how many intermediate fileIt carries out It calculates.Then the calculation formula of the cost of Stage write phases is as follows：

For the cost of the phase sorting of Stage, i.e. C_sort(Stage), since each task in Stage will be to certainly All intermediate data files that oneself generates are ranked up merging, collectively generateA intermediate file, it can be considered that often A task generatesA intermediate file, wherein m are the task number for including in each Stage.Then C_sort(Stage) meter Calculation mode is as follows：

Enable the number of sequenceThen the Executing Cost of a Stage is in Spark

Correlation based on cost merges algorithm：

Next it is calculated according to Cost Model and merges front and back cost, to determine whether being closed according to input correlation And.Assuming that Stage_iWith Stage_jIdentical input data F is read, while assuming what the intermediate data of their outputs was not overlapped Part, i.e.,Then according to formula (3-5) it is found that enabling Stage_i、Stage_jAfter merging Stage_i+jThe sequence cost P of generation_i, P_j, P_G

The cost formula of Stage is as follows after then merging：

Assuming that it is variable Earn, Stage before merging to merge the income generated later_aStage only need to be read_iIn output Between data, Stage_bStage only need to be read_jThe intermediate data of output, Stage after merging_aWith Stage_bIt will read Stage_i+jThe intermediate data of generation, to cause additional reading cost.Therefore with the cost for the reading input data saved, subtract It goes additional sequence cost and reads cost, you can the calculation formula for obtaining Earn is as follows：

If Earn>0, then illustrate to merge reduced disk and reads cost and be more than generated additional sequence cost, it can be with Stage with input correlation is merged.If Earn<0, then nonjoinder has the Stage of input correlation.Therefore it is It is no to be merged according to input correlation, it is only necessary to check whether Earn values are more than 0.

In order to verify the feasibility of the present invention, in the inquiry that TPC-H is provided, chooses several inquiries and form three and inquire and appoint Be engaged in M1, M2, M3, and wherein M1 tasks include inquiry Q4 and Q14, and it is larger that the two inquiries belong to input data, but in generating Between the less inquiry of data.M2 tasks include inquiry Q2 and Q17, the two inquiries belong to that input data is less, but generate The more inquiry of intermediate data.M3 tasks include inquiry Q5 and Q9, the two inquiries belong to that input data is larger, and generate Intermediate data also more inquiry.It is programmed and is committed to SSO systems and is run with existing Spark SQL systems, record is looked into It askes and executes the time, experimental result is as shown in Figure 5：

Horizontal axis is the query execution time in figure, and unit is the second；Had recorded on the longitudinal axis it is corresponding several to query task, one group In the test result of query task, upside is the execution time of SSO systems, and downside is the execution of existing Spark SQL systems Time, it is better to be worth smaller expression performance.

From fig. 5, it can be seen that SSO systems are the most apparent for the effect of optimization of M1 tasks, and M2 tasks SSO systems with The execution time in existing Spark systems reaches unanimity.This is because two inquiries merged in M1 tasks can be much less To the magnetic disc i/o cost that input data is read, and the intermediate data that the two generates is all seldom, it means that even if merging institute's band The additional sequence cost come is also seldom, therefore effect of optimization is also just most apparent.And M2 tasks are whole good on the contrary, Q2 inquiries have read The input data of 0.92GB outputs the intermediate data of 2.3GB；Q17 inquires the input data for having read 16.7GB, outputs The intermediate data of 7.1GB.It is calculated according to above-mentioned Cost Model, Q2 and Q17 have read part tables, and the size of the table is 232MB (under conditions of TPC-H benchmark tests generate 10g), therefore merging can reduce the once reading cost to part tables, The magnetic disc i/o that the cost is 2.9 seconds, and generated additional sequence cost is 18.2 seconds.It can be found that shared inquiry Q2 with The sequence cost that identical input data part tables generate in Q17 is even higher than the cost for the reading input data saved, therefore SSO systems will not select to merge two inquiries in M2 tasks.For the performance advantage of detailed analysis SSO systems, originally M1 tasks are committed to SSO herein and are run with primary Spark SQL two systems by invention, on host node magnetic disc i/o and CPU usage is monitored, as a result as shown in Figure 6, Figure 7：

It can be found that M1 has used 23 seconds and 77 seconds execution time to complete inquiry work respectively in two systems, SSO system advantages are apparent, in conjunction with the magnetic disc i/o and cpu usage in figure, it can be found that SSO systems are by merging with defeated Two operations for entering correlation significantly reduce magnetic disc i/o cost at the initial stage of program operation.But due to Q4 inquiry and The part that the intermediate data of Q14 query generations does not overlap, therefore merge the magnetic disc i/o generation that will not be reduced to intermediate reading and writing data Valence, then next the changing rule of the magnetic disc i/o in two systems reaches unanimity.But due to be looked into two after merging It askes the intermediate data generated and carries out unified sorting operation together, than two inquiries of cost generated are ranked up behaviour respectively Make greatly, therefore to can be found that in SSO systems, the CPU usage in Shuffle stages is higher than primary Spark systems.It is right In this data-intensive application of SQL query, magnetic disc i/o is the most resource of consumption, and other resources in cluster are often In the lower state of utilization rate.SSO systems proposed by the present invention are reduced by improving the utilization rate to memory and cpu resource The cost of magnetic disc i/o, the execution time of SQL query application can be effectively reduced by being experimentally confirmed this method.

It should be pointed out that above-mentioned embodiment is only intended to clearly illustrate example, and not to embodiment It limits, there is no necessity and possibility to exhaust all the enbodiments.Each component part being not known in the present embodiment It is realized with the prior art.For those skilled in the art, in the premise for not departing from the principle of the invention Under, several improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims

1. a kind of method that optimization Spark SQL execute workflow, which is characterized in that this method comprises the following steps：

Step S1：The Cost Model for building Spark task executions is divided into the cost for reading input data, is carried out to intermediate data The cost of sequence and the cost for writing output data, three is summed to obtain the total cost of task execution；

Step S2：It is proposed that the correlation based on cost merges algorithm, the thought of the algorithm is that have input data phase for two The task of closing property calculates the sum of cost that they are executed respectively and is merged into the cost executed after a task, by comparing The size of the two decides whether to merge them.

2. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that described in step S1 The specific methods of calculating Spark task execution costs be：The inquiry that resolver in Catalyst frames submits user into Row parsing forms logic executive plan tree, each node in tree corresponds to a task in Spark, by tree-like to this The analysis of structure can be accurately positioned the concrete operations in the correspondence inquiry of each task, so as to the execution generation to the task Valence is calculated.

3. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that step S2 passes through Utilization cost model compares the additional sequence cost of generation, the reading cost of follow-up work and the reading input data of saving Cost decides whether to merge the task with input correlation.

4. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that institute in step (1) The cost for the reading input data stated, the cost be ranked up to intermediate data and the cost for writing output data, three is carried out The algorithm of total cost for obtaining task execution of summing is：

C (Stage)=C_read(Stage)+C_sort(Stage)+C_write(Stage) (1)

In formula (1)：

C (Stage) is the total cost of task execution；

C_read(Stage) it is the cost for reading input data；

C_sort(Stage) it is the cost being ranked up to intermediate data；

C_write(Stage) it is the cost for writing output data；

Due to C_read(Stage) and C_write(Stage) all it is I/O costs, therefore formula (1) is expressed as C_I/O=C₀T+C₁X, wherein C_I/OFor the cost and the sum of the cost for writing output data, C for reading input data₀For tracking time and rotational delay time, T For I/O number, C occurs₁The time required to transmission 1MB data, x is read-write size of data.

5. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that the reading The cost C of input data_read(Stage) shown in calculation formula such as formula (2)：

In formula (2)：

C₀For tracking time and rotational delay time, T is that I/O number, t occurs_r1MB data required times are read to be local, α is Non-local data accounts for the percentage of total data, t_bThe time required to for network transmission 1MB data, | D_in| for Spark Stage inputs The size of data, | D_out| it is the size of Spark Stage output datas, B is Spark task buffer sizes.

6. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that described writing is defeated Go out the cost C of data_write(Stage) shown in computational methods such as formula (3),

C in formula₀For tracking time and rotational delay time, B is Spark task buffer sizes, | D_out| it is Spark Stage The size of output data, t_wFor the time required to being written locally 1MB data.

7. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that the centering Between the cost C that is ranked up of data_sort(Stage) shown in calculation such as formula (4)：

C in formula₀For tracking time and rotational delay time, B is Spark task buffer sizes, | D_out| it is Spark Stage The size of output data, m are the task number for including, t in each Stage_r1MB data required times are read to be local.

8. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that the task Shown in the calculating such as formula (5) of the total cost C (Stage) of execution：

P is the number of sequence in formula,

9. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that institute in step S2 The task Stage that there is input data correlation correlation for two stated_iWith Stage_j, calculate them and be merged into one It is engaged in shown in the cost such as formula (6) executed later：

C (Stage in formula_i+j) it is two task Stage with input data correlation correlation_iWith Stage_jIt is merged into one The cost executed after task, t_r1MB data required times, t are read to be local_wTo be written locally 1MB data required times, t_b The time required to network transmission 1MB data, B is Spark task buffer sizes,For Stage_iWith Stage_jExport number According to the sum of size.