CN102323946B

CN102323946B - Implementation method for operator reuse in parallel database

Info

Publication number: CN102323946B
Application number: CN 201110259524
Authority: CN
Inventors: 李阳; 何清法; 顾云苏; 冯柯; 蒋志勇; 徐岩; 饶路; 李晓鹏; 刘荣; 赵婧
Original assignee: TIANJIN SHENZHOU GENERAL DATA CO Ltd
Current assignee: TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority date: 2011-09-05
Filing date: 2011-09-05
Publication date: 2013-03-27
Anticipated expiration: 2031-09-05
Also published as: CN102323946A

Abstract

The invention discloses an implementation method for operator reuse in a parallel database, comprising the following steps of: step 1, generating a serial query plan for query through a normal query planning method, wherein the query plan is a binary tree structure; step 2, executing the query plane by scanning from top to bottom, searching materialized reusable operators, changing the query plane structure, and changing thread level materialized operators into global reusable materialized operators; step 3, parallelizing the query plan changed in the step 2, and generating a plan forest for parallel execution of a plurality of threads; step 4, executing global reusable operator combination on the plan forest generated in the step 3, and generating a directed graph plan for the materialized reusable operators capable of being executed by the plurality of threads in parallel; step 5, executing own plan part in the directed graph by each thread in parallel, wherein the thread which executes the global reusable operator firstly is called a main thread, the main thread locks the global reusable operator and truly executes the operator and the plan of the operator, and other threads wait; step 6, unlocking the global reusable operator by the main thread after execution, wherein other threads start to read data from the global reusable operator and continue to execute own plan tree; and step 7, releasing the materialized data of the operator by the main thread after all the plans read the data of the global reusable operator.

Description

The implementation method of operator reuse in parallel database

Technical field

The present invention relates to a kind of Database Systems, especially relate to a kind of multiplexing implementation method of operator of parallel database.

Background technology

Along with the development of infotech with popularize, data expand rapidly with index speed, and processing mass data more and more becomes the major issue that computer realm faces.The research to OLAP, DSS, data mining etc. that database field is risen all is the research to mass data processing in essence.

Solving the popular technology of mass data processing problem at present is parallel query technology and Clustering.The parallel query technology all is the study hotspot of database field all the time, and academia has proposed the architectural framework of multiple parallel query: Share-Everything (fully shared) framework, Share-Memory (shared drive) framework, Share-Disk (shared disk) framework and Share-Nothing (without sharing) framework.Share-Memory framework and Share-Everything framework all can shared main storages, and process or thread can pass through the internal memory swap data.But at present popular Share-Memory and the parallel database system of Share-Everything framework all are to use shared drive to communicate and exchanges data, and do not utilize the operator between a plurality of concurrent processes or the thread multiplexing.Under the parallel architecture based on subregion, a plurality of threads or process are independently carried out task separately, parallel often a plurality of processes or the almost completely identical statement of thread execution structure in the inquiry of same query statement, the part table that difference just wherein relates to is different.

In this case, each process or thread be all separately query execution one time, and do not consider operator multiplexing be significant wastage to resource.

Summary of the invention

In order to address the above problem, the invention provides a kind of resource utilization and performance that in the parallel database of SM framework, improves system by realization operator multiplex technique.

The present invention adopts following technical scheme:

Step 1, use common query planning method to be the inquiry plan of query generation serial, described inquiry plan is a binary tree structure;

Described inquiry plan is carried out in step 2, top-down scanning, seeks reusable materialization operator, and change inquiry plan structure, changes thread-level materialization operator into overall multiplexing materialization operator;

Inquiry plan after step 3, the change that step 2 is generated carries out parallelization to be processed, and generates to be used for the plan forest that a plurality of thread parallels are carried out;

Step 4, the plan forest that step 3 is generated carry out overall multiplexing operator merging processing, and generation is used for can be for the digraph plan of a plurality of thread parallels execution and reusable materialization operator;

Step 5, each thread parallel are carried out the calculated plan tree separately of described digraph, first thread of carrying out overall multiplexing operator is referred to as main thread, pin the multiplexing operator of this overall situation and real this operator and following plan tree thereof, other thread waits carried out by main thread;

Step 6, described main thread execute the afterwards release of this operator, and other threads begin reading out data and continuation plan tree separately from the multiplexing operator of this overall situation;

Step 7, described main thread wait for that all plan tree all reads the data that the data of overall multiplexing operator discharge this operator materialization after complete; Wherein, the criterion of described reusable materialization operator is: if certain materialization operator and following plan tree thereof do not comprise partition table, this materialization operator can be re-used so.

The present invention is the optimization to the parallel database query execution flow process of SM framework, and key features is materialization operator identical in a plurality of threads is shared.Carry out flow process with common parallel query and compare, not only save CPU and memory source, also will lack IO and read, do not increase any cost.

Description of drawings

The present invention is further illustrated below in conjunction with the drawings and specific embodiments.

Fig. 1 shows the synoptic diagram of the plan tree construction of step 1 generation;

Fig. 2 shows the synoptic diagram of the plan tree construction of step 2;

Fig. 3 shows the synoptic diagram of the plan forest structure of step 3;

Fig. 4 shows the synoptic diagram of the digraph proposed figures for the plan of step 4;

Fig. 5 shows the data flowchart of plan execute phase.

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:

Can supply multiplexing operator by in query optimization stage of database the parallel plan of multithreading being scanned therefrom to seek, can be revised as the overall situation for multiplexing operator and share operator, and the change proposed figures for the plan, will plan tree and become the plan forest, and further be rewritten into digraph.By this digraph plan of executed in parallel, the intermediate result of multiplexing materialization operator in the plan implementation.

This method mainly may further comprise the steps:

The plan generation phase:

Step 1:

Use common query planning method to be the inquiry plan of query generation serial, this inquiry plan is a binary tree structure.Wherein inquiry relates to partition table, so the part leaf node is the scanning to partition table.As shown in Figure 1, some inquiries are select*from A, B, P where A.a=B.b and A.a=P.p; Wherein A, B are not partition tables, and P is partition table, and child partition is P1, P2.This plan tree table is bright in this inquiry, at first B table is created the Hash table, be connected with B to connect by the HashJoin mode by A, then the result of the connection of A, B is created Hash and shows, and is connected with P to connect by the HashJoin mode again.

Step 2:

Reusable materialization class operator is sought in top-down scanning executive plan, and the change proposed figures for the plan, changes thread-level materialization operator into the overall situation multiplexing materialization operator.When finding after one, no longer continue scanning subtree downwards.The criterion of reusable materialization operator is: if certain materialization operator and following plan tree thereof do not comprise partition table, this materialization operator can be re-used so.Such as the plan in this example, as shown in Figure 2, top-down scan plan can find at first that the connection result establishment Hash table to A, B is reusable materialization operator, and therefore revising this Hash operator is the GlobalHash operator.If can find that it also is a materialization operator that can be re-used that the B table is created the Hash table although continue downward scanning, but be the subtree that the connection result of A, B is created the Hash table because B table is created the Hash table, we stop to continue downward scanning after searching out a reusable operator, therefore the B table being created the Hash table will not be re-used.

Step 3:

The inquiry plan that step 2 is generated carries out the parallelization processing, generates to be used for the plan that a plurality of thread parallels are carried out.Concrete grammar is partition table in the scan plan, if all partition table partitioned modes that relate to are identical, so just can copy plan with the identical umber of the number of partitions, and the subregion master meter in each plan is replaced with the subregion sublist, forms a plan forest.As shown in Figure 3, the partition table in this example is P, and the number of partitions is 2.Therefore the plan in the step 2 is copied as 2 parts, and the P table in each plan tree is revised as respectively P1 and P2.

Step 4:

The plan forest that step 3 is generated carries out overall multiplexing operator merging processing, and generation is used for can be for the digraph plan of a plurality of thread parallels execution and reusable materialization operator.Concrete grammar is each plan tree in the scan plan forest, runs into overall materialization operator and just the overall materialization operator of the same position in each plan tree is merged into one.As shown in Figure 4, the plan forest in this example is merged into following digraph plan.

The plan execute phase:

The execution of parallel plan is actually the concurrent execution of a plurality of threads, and carries out mutually the process of data transmission.Materialization operator multiplex technique the work that plan is done in the implementation be exactly only have the execution of a thread reality this operator with and under plan, other threads all not have execution, and the result that has been multiplexing.Concrete method is: each thread parallel is carried out the plan part separately in the digraph, first thread of carrying out overall multiplexing operator is referred to as main thread, pin the multiplexing operator of this overall situation and real this operator and following plan thereof, other thread waits carried out by main thread.Main thread executes the afterwards release of this operator, and other threads begin reading out data and continuation plan tree separately from the multiplexing operator of this overall situation.Main thread waits for that all plans all read the data that discharge this operator materialization after the data of the multiplexing operator of the complete overall situation.Data flow diagram as shown in Figure 5 in this example.As can be seen from the figure, two thread parallel execution HashJoin separately, wherein main thread will be finished the connection of P1, A, three tables of B, and another thread will be finished the connection of the same A of P2, B table.But only have main thread to carry out the HashJoin of A and B and A, B connection result created the Hash table, another thread directly in the GlobalHash operator reading out data be Hashjoin with the P2 table.Obviously, multiplexing by to operator creates the Hash table and the operation that the connection result of A, B creates the Hash table only done once the B table, saved internal memory and cpu resource; Attended operation to A, B has only been done once, has saved cpu resource; Data to A, B table read and have only done once, have saved the IO resource.

The present invention is the optimization to the parallel database query execution flow process of SM framework, and key features is materialization operator identical in a plurality of threads is shared.Carry out flow process with common parallel query and compare, not only save CPU and memory source, also will lack IO and read, do not increase any cost.After tested, use the operator multiplex technique cpu resource utilization rate can be reduced by 16% in TPC-H100G benchmark test, internal memory uses reduction by 32%, IO amount to reduce by 35%.

Certainly; the present invention also can have other various embodiments; in the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make according to the present invention various corresponding changes and distortion, but these changes of believing and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. the implementation method of an operator reuse in parallel database comprises the steps:

2. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 1 specifically comprises: described inquiry relates to partition table, and the part leaf node is the scanning to partition table.

3. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 2 specifically comprises: after finding a reusable materialization operator, no longer continue scanning subtree downwards.

4. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 3 specifically comprises: partition table in the scan plan, if the partition table partitioned mode that all relate to is identical, so just can copy plan with the identical umber of the number of partitions, and the subregion sublist replacement of the subregion master meter in each plan, form a plan forest.

5. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 4 specifically comprises: each plan tree in the scan plan forest runs into overall materialization operator and just the overall materialization operator of the same position in each plan tree is merged into one.