CN108763489A - A method of optimization Spark SQL execute workflow - Google Patents

A method of optimization Spark SQL execute workflow Download PDF

Info

Publication number
CN108763489A
CN108763489A CN201810536078.3A CN201810536078A CN108763489A CN 108763489 A CN108763489 A CN 108763489A CN 201810536078 A CN201810536078 A CN 201810536078A CN 108763489 A CN108763489 A CN 108763489A
Authority
CN
China
Prior art keywords
cost
stage
data
task
spark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810536078.3A
Other languages
Chinese (zh)
Other versions
CN108763489B (en
Inventor
宋爱波
万雨桐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN201810536078.3A priority Critical patent/CN108763489B/en
Publication of CN108763489A publication Critical patent/CN108763489A/en
Application granted granted Critical
Publication of CN108763489B publication Critical patent/CN108763489B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of methods that optimization Spark SQL execute workflow.The method comprising the steps of S1:The Cost Model for building Spark task executions is divided into the cost for reading input data, and the cost be ranked up to intermediate data and the cost for writing output data are summed three to obtain the total cost of task execution;Step S2:It is proposed that the correlation based on cost merges algorithm, the thought of the algorithm is the task with input data correlation for two, it calculates the sum of cost that they are executed respectively and is merged into the cost executed after a task, decide whether to merge them by comparing the size of the two.The present invention merges algorithm by the correlation based on cost and solves the problems, such as to carry out repeating to read to identical input data in Spark SQL queries.

Description

A method of optimization Spark SQL execute workflow
Technical field:
The present invention relates to a kind of methods that optimization Spark SQL execute workflow, belong to computer software technical field.
Background technology:
Catalyst frames are used to the optimization of workflow on Spark SQL platforms at present, the frame is to query statement Processing uses similar method with relevant database, i.e., SQL statement is first carried out morphology syntax parsing (Parse) forms one Then a logic executive plan tree the processing procedures such as is parsed, optimized and is bound to the tree using certain principle of optimality, lead to It crosses pattern match and different operations is used to different types of node in tree.To logic executive plan tree carry out optimization be mainly It is pushed away under algebraic optimization, including predicate regular with train value cutting etc..Pushed away under so-called predicate, i.e., by the selection operation for including in inquiry and It is carried out before shifting attended operation under projection operation onto, can both horizontally and vertically reduce the big of relationship respectively in this way It is small, to reduce cost caused by attended operation.So-called train value is cut, i.e., if tables of data is stored by row, Spark SQL can only read the row of needs, to a certain degree when reading input data is inquired according to query statement Upper reduction magnetic disc i/o cost.But current Catalyst frames merely relate to the algebraic optimization to query statement itself, for SQL There is no a set of suitable principles of optimality for the translated into Spark job streams of inquiry, if weight in single query statement that user submits It appears again existing table, Spark SQL need to read repeatedly, cause the reading cost of redundancy, and Spark programs are reduced on certain procedures Execution efficiency.Such as user, when submitting TPC-H Q17 inquiries, the query statement is as shown in Figure 2.The query statement exists The job stream executed under Spark is as shown in figure 3, it can be found that Stage1And Stage3Table lineitem will be read, there is input Correlation, therefore the two can be merged, the query tree after merging is as shown in Figure 4.
By being merged into Stage1+3, the cost of reading lineitem tables can be made to reduce half.But due to output data For former Stage1And Stage3The sum of output data, therefore the cost that intermediate data is write after merging is constant.
But query time may not be able to be saved by merging.Such as before the combining, Stage2Only read Stage1It is defeated Go out, Stage5Only read Stage3Output.After merging, Stage2And Stage5Stage will be read1+3The mediant of generation According to cause additional reading cost.Simultaneously because the internal mechanism of Spark can be to intermediate data that the Shuffle stages generate It is ranked up, then to Stage1And Stage3The cost that the intermediate data of generation is ranked up together is greater than respectively to two The sum of the cost that Stage is ranked up, to cause additional sequence cost.
Invention content
The object of the present invention is to provide a kind of methods that optimization Spark SQL execute workflow, pass through the phase based on cost Closing property merges algorithm and solves the problems, such as to carry out repeating to read to identical input data in Spark SQL queries.
Above-mentioned purpose is achieved through the following technical solutions:
A method of optimization Spark SQL execute workflow, and this method comprises the following steps:
Step S1:The Cost Model for building Spark task executions is divided into the cost for reading input data, to intermediate data The cost being ranked up and the cost for writing output data, three is summed to obtain the total cost of task execution;
Step S2:It is proposed that the correlation based on cost merges algorithm, the thought of the algorithm is that have input number for two According to the task of correlation, calculates the sum of cost that they are executed respectively and be merged into the cost executed after a task, pass through Compare the size of the two to decide whether to merge them.
The method that the described optimization Spark SQL execute workflow, the calculating Spark task execution costs described in step S1 Specific method be:The inquiry that resolver in Catalyst frames submits user parses, and forms logic executive plan It sets, each node in tree corresponds to a task in Spark, can be accurately positioned often by the analysis to the tree structure One task corresponds to the concrete operations in inquiry, is calculated so as to the Executing Cost to the task.
The method that the optimization Spark SQL execute workflow, step S2 compare generation by using Cost Model The reading cost of additional sequence cost, follow-up work and the cost for the reading input data saved decide whether to defeated The entering correlation of the task merges.
The method that the optimization Spark SQL execute workflow, the generation of the reading input data described in step (1) Valence, the cost be ranked up to intermediate data and the cost for writing output data, three is summed to obtain the total of task execution The algorithm of cost is:
C (Stage)=Cread(Stage)+Csort(Stage)+Cwrite(Stage) (1)
In formula (1):
C (Stage) is the total cost of task execution;
Cread(Stage) it is the cost for reading input data;
Csort(Stage) it is the cost being ranked up to intermediate data;
Cwrite(Stage) it is the cost for writing output data;
Due to Cread(Stage) and Cwrite(Stage) all it is I/O costs, therefore formula (1) is expressed as CI/O=C0T+C1X, Middle CI/OTo read the cost of input data and writing the sum of the cost of output data, C0For tracking time and rotational delay time, T For I/O number, C occurs1The time required to transmission 1MB data, x is read-write size of data.
The method that the optimization Spark SQL execute workflow, the cost C of the reading input dataread (Stage) shown in calculation formula such as formula (2):
In formula (2):
C0For tracking time and rotational delay time, T is that I/O number, t occursr1MB data required times are read to be local, α is the percentage that non-local data accounts for total data, tbThe time required to for network transmission 1MB data, | Din| it is defeated for Spark Stage Enter the size of data, | Dout| it is the size of Spark Stage output datas, B is Spark task buffer sizes.
The method that the optimization Spark SQL execute workflow, the cost C for writing output datawrite(Stage) Computational methods such as formula (3) shown in,
C in formula0For tracking time and rotational delay time, B is Spark task buffer sizes, | Dout| it is Spark The size of Stage output datas, twFor the time required to being written locally 1MB data.
The method that the optimization Spark SQL execute workflow, the cost that intermediate data is ranked up Csort(Stage) shown in calculation such as formula (4):
C in formula0For tracking time and rotational delay time, B is Spark task buffer sizes, | Dout| it is Spark The size of Stage output datas, m are the task number for including, t in each Stager1MB data required times are read to be local.
The method that the optimization Spark SQL execute workflow, the total cost C's (Stage) of the task execution It calculates as shown in formula (5):
P is the number of sequence in formula,
The method that the optimization Spark SQL execute workflow has input number described in step S2 for two According to the task Stage of correlation correlationiWith Stagej, calculate them and be merged into the cost such as formula (6) executed after a task It is shown:
C (Stage in formulai+j) it is two task Stage with input data correlation correlationiWith StagejIt is merged into The cost executed after one task, tr1MB data required times, t are read to be localwIt is taken to be written locally 1MB data Between, tbThe time required to network transmission 1MB data, B is Spark task buffer sizes,For StageiWith StagejIt is defeated Go out the sum of the size of data.
Advantageous effect:
For the method for the present invention compared with the workflow of existing Spark SQL, notable advantage is significantly to subtract The magnetic disc i/o cost that few Spark SQL tasks read input data, to further increase the execution of Spark SQL programs Efficiency.
Description of the drawings
Fig. 1 is the flow chart of the present invention;
Fig. 2 is TPC-H Q17 query statements;
Fig. 3 is that Q17 inquires the execution flow chart at Spark;
Fig. 4 is the execution flow chart of the Q17 inquiries after merging,
Fig. 5 is that the correlation based on cost merges test of heuristics experimental result;
Fig. 6 is magnetic disc i/o comparison diagram;
Fig. 7 is CPU usage comparison diagram.
Specific implementation mode
With reference to embodiment, the present invention is furture elucidated, it should be understood that following specific implementation modes are only used for It is bright the present invention rather than limit the scope of the invention.
Spark task execution models:
Generated cost occurs mainly in read input data to Spark tasks when being executed, and merger is carried out to intermediate data Three aspects of intermediate data are sorted and write, then Cost Model can be expressed as:
C (Stage)=Cread(Stage)+Csort(Stage)+Cwrite(Stage) (1)
Due to Cread(Stage) and Cwrite(Stage) all it is I/O costs, therefore the calculation formula of the cost is CI/O= C0T+C1x.The definition of each parameter is as shown in table 1:
Parameter in 1 Cost Model of table
For the cost of the read phase of Stage, i.e. Cread(Stage), computational methods are as follows:
C1X=| Din|tr+α|Din|tb=(tr+αtb)|Din|
|Din| size by source input data size or other Stage output datas size determine.The size of α 0.3 is taken, mainly stores strategy decision by three copies of HDFS acquiescences:One copy deposits local, and one is deposited same rack, one Deposit distal end rack.And the I/O several T occurred are determined by specific Stage.For S-Stage, if what is read is source input number According to, since source data is all continuously stored, T=1.If what is read is the output of other Stage, the size of T is by previous The file number for the intermediate data that a Stage is generated determines.Since the task of Spark can be first number when writing intermediate data According to write-in buffering area, buffering area has expired the write magnetic disk that overflows again and has formed a file, it is assumed that previous Stage is generated | Dout| size Data, then will generateA intermediate file, wherein B are the size of Spark task buffering areas, then I/O times occurred NumberFor J-Stage, due to do attended operation to two tables, the input one of the Stage is set to other two The output of stage.Assuming that the first two stage is produced altogether | Dout| the output data of size, according to two-way merger external sorting Join algorithm then needs to scanI/O number that be secondary, that is, occurringIn conclusion The calculation formula of the cost of Stage read phases is as follows:
For the cost of the write phase of Stage, i.e. Cwrite(Stage), since intermediate data is all the handwritten copy local disk that overflows, Therefore the process is not related to net cost, and computational methods are as follows:
C1X=| Dout|tw
|Dout| size can by this paper chapter 2 Spark Shuffle intermediate data cache introduction method carry out It calculates.And the I/O several T occurred determine that the number of intermediate file passes through formula by producing how many intermediate fileIt carries out It calculates.Then the calculation formula of the cost of Stage write phases is as follows:
For the cost of the phase sorting of Stage, i.e. Csort(Stage), since each task in Stage will be to certainly All intermediate data files that oneself generates are ranked up merging, collectively generateA intermediate file, it can be considered that often A task generatesA intermediate file, wherein m are the task number for including in each Stage.Then Csort(Stage) meter Calculation mode is as follows:
Enable the number of sequenceThen the Executing Cost of a Stage is in Spark
Correlation based on cost merges algorithm:
Next it is calculated according to Cost Model and merges front and back cost, to determine whether being closed according to input correlation And.Assuming that StageiWith StagejIdentical input data F is read, while assuming what the intermediate data of their outputs was not overlapped Part, i.e.,Then according to formula (3-5) it is found that enabling Stagei、StagejAfter merging Stagei+jThe sequence cost P of generationi, Pj, PG
The cost formula of Stage is as follows after then merging:
Assuming that it is variable Earn, Stage before merging to merge the income generated lateraStage only need to be readiIn output Between data, StagebStage only need to be readjThe intermediate data of output, Stage after mergingaWith StagebIt will read Stagei+jThe intermediate data of generation, to cause additional reading cost.Therefore with the cost for the reading input data saved, subtract It goes additional sequence cost and reads cost, you can the calculation formula for obtaining Earn is as follows:
If Earn>0, then illustrate to merge reduced disk and reads cost and be more than generated additional sequence cost, it can be with Stage with input correlation is merged.If Earn<0, then nonjoinder has the Stage of input correlation.Therefore it is It is no to be merged according to input correlation, it is only necessary to check whether Earn values are more than 0.
In order to verify the feasibility of the present invention, in the inquiry that TPC-H is provided, chooses several inquiries and form three and inquire and appoint Be engaged in M1, M2, M3, and wherein M1 tasks include inquiry Q4 and Q14, and it is larger that the two inquiries belong to input data, but in generating Between the less inquiry of data.M2 tasks include inquiry Q2 and Q17, the two inquiries belong to that input data is less, but generate The more inquiry of intermediate data.M3 tasks include inquiry Q5 and Q9, the two inquiries belong to that input data is larger, and generate Intermediate data also more inquiry.It is programmed and is committed to SSO systems and is run with existing Spark SQL systems, record is looked into It askes and executes the time, experimental result is as shown in Figure 5:
Horizontal axis is the query execution time in figure, and unit is the second;Had recorded on the longitudinal axis it is corresponding several to query task, one group In the test result of query task, upside is the execution time of SSO systems, and downside is the execution of existing Spark SQL systems Time, it is better to be worth smaller expression performance.
From fig. 5, it can be seen that SSO systems are the most apparent for the effect of optimization of M1 tasks, and M2 tasks SSO systems with The execution time in existing Spark systems reaches unanimity.This is because two inquiries merged in M1 tasks can be much less To the magnetic disc i/o cost that input data is read, and the intermediate data that the two generates is all seldom, it means that even if merging institute's band The additional sequence cost come is also seldom, therefore effect of optimization is also just most apparent.And M2 tasks are whole good on the contrary, Q2 inquiries have read The input data of 0.92GB outputs the intermediate data of 2.3GB;Q17 inquires the input data for having read 16.7GB, outputs The intermediate data of 7.1GB.It is calculated according to above-mentioned Cost Model, Q2 and Q17 have read part tables, and the size of the table is 232MB (under conditions of TPC-H benchmark tests generate 10g), therefore merging can reduce the once reading cost to part tables, The magnetic disc i/o that the cost is 2.9 seconds, and generated additional sequence cost is 18.2 seconds.It can be found that shared inquiry Q2 with The sequence cost that identical input data part tables generate in Q17 is even higher than the cost for the reading input data saved, therefore SSO systems will not select to merge two inquiries in M2 tasks.For the performance advantage of detailed analysis SSO systems, originally M1 tasks are committed to SSO herein and are run with primary Spark SQL two systems by invention, on host node magnetic disc i/o and CPU usage is monitored, as a result as shown in Figure 6, Figure 7:
It can be found that M1 has used 23 seconds and 77 seconds execution time to complete inquiry work respectively in two systems, SSO system advantages are apparent, in conjunction with the magnetic disc i/o and cpu usage in figure, it can be found that SSO systems are by merging with defeated Two operations for entering correlation significantly reduce magnetic disc i/o cost at the initial stage of program operation.But due to Q4 inquiry and The part that the intermediate data of Q14 query generations does not overlap, therefore merge the magnetic disc i/o generation that will not be reduced to intermediate reading and writing data Valence, then next the changing rule of the magnetic disc i/o in two systems reaches unanimity.But due to be looked into two after merging It askes the intermediate data generated and carries out unified sorting operation together, than two inquiries of cost generated are ranked up behaviour respectively Make greatly, therefore to can be found that in SSO systems, the CPU usage in Shuffle stages is higher than primary Spark systems.It is right In this data-intensive application of SQL query, magnetic disc i/o is the most resource of consumption, and other resources in cluster are often In the lower state of utilization rate.SSO systems proposed by the present invention are reduced by improving the utilization rate to memory and cpu resource The cost of magnetic disc i/o, the execution time of SQL query application can be effectively reduced by being experimentally confirmed this method.
It should be pointed out that above-mentioned embodiment is only intended to clearly illustrate example, and not to embodiment It limits, there is no necessity and possibility to exhaust all the enbodiments.Each component part being not known in the present embodiment It is realized with the prior art.For those skilled in the art, in the premise for not departing from the principle of the invention Under, several improvements and modifications can also be made, these improvements and modifications also should be regarded as protection scope of the present invention.

Claims (9)

1. a kind of method that optimization Spark SQL execute workflow, which is characterized in that this method comprises the following steps:
Step S1:The Cost Model for building Spark task executions is divided into the cost for reading input data, is carried out to intermediate data The cost of sequence and the cost for writing output data, three is summed to obtain the total cost of task execution;
Step S2:It is proposed that the correlation based on cost merges algorithm, the thought of the algorithm is that have input data phase for two The task of closing property calculates the sum of cost that they are executed respectively and is merged into the cost executed after a task, by comparing The size of the two decides whether to merge them.
2. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that described in step S1 The specific methods of calculating Spark task execution costs be:The inquiry that resolver in Catalyst frames submits user into Row parsing forms logic executive plan tree, each node in tree corresponds to a task in Spark, by tree-like to this The analysis of structure can be accurately positioned the concrete operations in the correspondence inquiry of each task, so as to the execution generation to the task Valence is calculated.
3. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that step S2 passes through Utilization cost model compares the additional sequence cost of generation, the reading cost of follow-up work and the reading input data of saving Cost decides whether to merge the task with input correlation.
4. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that institute in step (1) The cost for the reading input data stated, the cost be ranked up to intermediate data and the cost for writing output data, three is carried out The algorithm of total cost for obtaining task execution of summing is:
C (Stage)=Cread(Stage)+Csort(Stage)+Cwrite(Stage) (1)
In formula (1):
C (Stage) is the total cost of task execution;
Cread(Stage) it is the cost for reading input data;
Csort(Stage) it is the cost being ranked up to intermediate data;
Cwrite(Stage) it is the cost for writing output data;
Due to Cread(Stage) and Cwrite(Stage) all it is I/O costs, therefore formula (1) is expressed as CI/O=C0T+C1X, wherein CI/OFor the cost and the sum of the cost for writing output data, C for reading input data0For tracking time and rotational delay time, T For I/O number, C occurs1The time required to transmission 1MB data, x is read-write size of data.
5. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that the reading The cost C of input dataread(Stage) shown in calculation formula such as formula (2):
In formula (2):
C0For tracking time and rotational delay time, T is that I/O number, t occursr1MB data required times are read to be local, α is Non-local data accounts for the percentage of total data, tbThe time required to for network transmission 1MB data, | Din| for Spark Stage inputs The size of data, | Dout| it is the size of Spark Stage output datas, B is Spark task buffer sizes.
6. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that described writing is defeated Go out the cost C of datawrite(Stage) shown in computational methods such as formula (3),
C in formula0For tracking time and rotational delay time, B is Spark task buffer sizes, | Dout| it is Spark Stage The size of output data, twFor the time required to being written locally 1MB data.
7. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that the centering Between the cost C that is ranked up of datasort(Stage) shown in calculation such as formula (4):
C in formula0For tracking time and rotational delay time, B is Spark task buffer sizes, | Dout| it is Spark Stage The size of output data, m are the task number for including, t in each Stager1MB data required times are read to be local.
8. the method that optimization Spark SQL according to claim 4 execute workflow, which is characterized in that the task Shown in the calculating such as formula (5) of the total cost C (Stage) of execution:
P is the number of sequence in formula,
9. the method that optimization Spark SQL according to claim 1 execute workflow, which is characterized in that institute in step S2 The task Stage that there is input data correlation correlation for two statediWith Stagej, calculate them and be merged into one It is engaged in shown in the cost such as formula (6) executed later:
C (Stage in formulai+j) it is two task Stage with input data correlation correlationiWith StagejIt is merged into one The cost executed after task, tr1MB data required times, t are read to be localwTo be written locally 1MB data required times, tb The time required to network transmission 1MB data, B is Spark task buffer sizes,For StageiWith StagejExport number According to the sum of size.
CN201810536078.3A 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow Active CN108763489B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810536078.3A CN108763489B (en) 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810536078.3A CN108763489B (en) 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow

Publications (2)

Publication Number Publication Date
CN108763489A true CN108763489A (en) 2018-11-06
CN108763489B CN108763489B (en) 2022-02-15

Family

ID=64003941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810536078.3A Active CN108763489B (en) 2018-05-28 2018-05-28 Method for optimizing Spark SQL execution workflow

Country Status (1)

Country Link
CN (1) CN108763489B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347122A (en) * 2020-11-10 2021-02-09 西安宇视信息科技有限公司 SQL workflow processing method and device, electronic equipment and storage medium
CN113868230A (en) * 2021-10-20 2021-12-31 重庆邮电大学 Large table connection optimization method based on Spark calculation framework
CN113868230B (en) * 2021-10-20 2024-06-04 重庆邮电大学 Large-scale connection optimization method based on Spark computing framework

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286748A1 (en) * 2014-04-08 2015-10-08 RedPoint Global Inc. Data Transformation System and Method
CN105740249A (en) * 2014-12-08 2016-07-06 Tcl集团股份有限公司 Processing method and system during big data operation parallel scheduling process
US20170024432A1 (en) * 2015-07-24 2017-01-26 International Business Machines Corporation Generating sql queries from declarative queries for semi-structured data
CN107025273A (en) * 2017-03-17 2017-08-08 南方电网科学研究院有限责任公司 The optimization method and device of a kind of data query

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286748A1 (en) * 2014-04-08 2015-10-08 RedPoint Global Inc. Data Transformation System and Method
CN105740249A (en) * 2014-12-08 2016-07-06 Tcl集团股份有限公司 Processing method and system during big data operation parallel scheduling process
US20170024432A1 (en) * 2015-07-24 2017-01-26 International Business Machines Corporation Generating sql queries from declarative queries for semi-structured data
CN107025273A (en) * 2017-03-17 2017-08-08 南方电网科学研究院有限责任公司 The optimization method and device of a kind of data query

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112347122A (en) * 2020-11-10 2021-02-09 西安宇视信息科技有限公司 SQL workflow processing method and device, electronic equipment and storage medium
CN113868230A (en) * 2021-10-20 2021-12-31 重庆邮电大学 Large table connection optimization method based on Spark calculation framework
CN113868230B (en) * 2021-10-20 2024-06-04 重庆邮电大学 Large-scale connection optimization method based on Spark computing framework

Also Published As

Publication number Publication date
CN108763489B (en) 2022-02-15

Similar Documents

Publication Publication Date Title
Marcu et al. Spark versus flink: Understanding performance in big data analytics frameworks
US10521427B2 (en) Managing data queries
Nehme et al. Automated partitioning design in parallel database systems
Hueske et al. Opening the black boxes in data flow optimization
Nykiel et al. MRShare: sharing across multiple queries in MapReduce
Shi et al. Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs
US5325525A (en) Method of automatically controlling the allocation of resources of a parallel processor computer system by calculating a minimum execution time of a task and scheduling subtasks against resources to execute the task in the minimum time
Bruno et al. Continuous cloud-scale query optimization and processing
US20170255673A1 (en) Batch Data Query Method and Apparatus
CN101021874A (en) Method and apparatus for optimizing request to poll SQL
Perron et al. How I learned to stop worrying and love re-optimization
Graefe et al. Robust query processing (dagstuhl seminar 12321)
JP2017535869A (en) Processing queries involving union-type operations
Tatemura et al. Partiqle: An elastic SQL engine over key-value stores
CN108763489A (en) A method of optimization Spark SQL execute workflow
CN108334565A (en) A kind of data mixing storage organization, data store query method, terminal and medium
CN108710640B (en) Method for improving search efficiency of Spark SQL
Borovica-Gajic et al. Robust performance in database query processing (Dagstuhl seminar 17222)
Galindo-Legaria et al. Optimizing star join queries for data warehousing in microsoft sql server
Ji et al. Query execution optimization in spark SQL
Chacko et al. Improving execution speed of incremental runs of MapReduce using provenance
Rozet et al. Muse: Multi-query event trend aggregation
JP2009529735A (en) Managing statistical views in a database system
JP2780996B2 (en) Query optimization processing method
Wang et al. A new scheme for cache optimization based on cluster computing framework spark

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant