CN102323946A

CN102323946A - Implementation method for operator reuse in parallel database

Info

Publication number: CN102323946A
Application number: CN201110259524A
Authority: CN
Inventors: 李阳; 何清法; 顾云苏; 冯柯; 蒋志勇; 徐岩; 饶路; 李晓鹏; 刘荣; 赵婧
Original assignee: TIANJIN SHENZHOU GENERAL DATA CO Ltd
Current assignee: TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority date: 2011-09-05
Filing date: 2011-09-05
Publication date: 2012-01-18
Anticipated expiration: 2031-09-05
Also published as: CN102323946B

Abstract

The invention discloses an implementation method for operator reuse in a parallel database, comprising the following steps of: step 1, generating a serial query plan for query through a normal query planning method, wherein the query plan is a binary tree structure; step 2, executing the query plane by scanning from top to bottom, searching materialized reusable operators, changing the query plane structure, and changing thread level materialized operators into global reusable materialized operators; step 3, parallelizing the query plan changed in the step 2, and generating a plan forest for parallel execution of a plurality of threads; step 4, executing global reusable operator combination on the plan forest generated in the step 3, and generating a directed graph plan for the materialized reusable operators capable of being executed by the plurality of threads in parallel; step 5, executing own plan part in the directed graph by each thread in parallel, wherein the thread which executes the global reusable operator firstly is called a main thread, the main thread locks the global reusable operator and truly executes the operator and the plan of the operator, and other threads wait; step 6, unlocking the global reusable operator by the main thread after execution, wherein other threads start to read data from the global reusable operator and continue to execute own plan tree; and step 7, releasing the materialized data of the operator by the main thread after all the plans read the data of the global reusable operator.

Description

The multiplexing implementation method of operator in the parallel database

Technical field

The present invention relates to a kind of Database Systems, especially relate to a kind of multiplexing implementation method of operator of parallel database.

Background technology

Along with the development of infotech with popularize, data expand rapidly with index speed, and handling mass data more and more becomes the major issue that computer realm faces.The research to OLAP, DSS, data mining etc. that database field is risen all is the research to mass data processing in essence.

Solving the popular technology of mass data processing problem at present is parallel query technology and Clustering.The parallel query technology all is the research focus of database field all the time, and academia has proposed the architectural framework of multiple parallel query: Share-Everything (shared fully) framework, Share-Memory (shared drive) framework, Share-Disk (shared disk) framework and Share-Nothing (do not have and share) framework.Share-Memory framework and Share-Everything framework all can shared main storages, and process or thread can pass through the Memory Exchange data.But the at present popular Share-Memory and the parallel database system of Share-Everything framework all are to use shared drive to communicate and exchanges data, and do not utilize the operator between a plurality of concurrent processes or the thread multiplexing.Under parallel architecture based on subregion; A plurality of threads or process are independently carried out task separately; Parallel a plurality of often processes or the almost completely identical statement of thread execution structure in the inquiry of same query statement, the different part table that just wherein relates to difference.

In this case, each process or thread are all carried out separately inquiry one time, and do not consider operator multiplexing be significant wastage to resource.

Summary of the invention

In order to address the above problem, the invention provides a kind of in the parallel database of SM framework through realizing that the operator multiplex technique improves the resource utilization and the performance of system.

The present invention adopts following technical scheme:

Step 1, the common inquiry planing method of use are the inquiry plan that inquiry generates serial, and said inquiry plan is a binary tree structure;

Said inquiry plan is carried out in step 2, top-down scanning, seeks reusable materialization class operator, and change inquiry plan structure, changes thread-level materialization operator into the overall situation multiplexing materialization operator;

Inquiry plan after step 3, the change that step 2 is generated carries out parallelization to be handled, and generates to be used for the plan forest that a plurality of thread parallels are carried out;

Step 4, the plan forest that step 3 is generated carry out overall multiplexing operator merging processing, generate the digraph plan that is used for supplying a plurality of thread parallels execution and reusable materialization operator;

Step 5, each thread parallel are carried out the plan part separately in the said digraph; First thread that implements overall multiplexing operator is referred to as main thread; Pin the multiplexing operator of this overall situation and real this operator and following plan thereof, other thread waits carried out by main thread;

Step 6, said main thread execute the release afterwards of this operator, and other threads begin reading of data and continuation plan tree separately from the multiplexing operator of this overall situation;

Step 7, said main thread wait for that all plans all read the data that discharge this operator materialization after the data of the overall multiplexing operator that finishes.

The present invention is the optimization of the parallel data library inquiry of SM framework being carried out flow process, and key features is materialization operator identical in a plurality of threads is shared.Carry out flow process with common parallel query and compare, not only save CPU and memory source, also will lack IO and read, do not increase any cost.

Description of drawings

Below in conjunction with accompanying drawing and embodiment the present invention is further described.

Fig. 1 shows the synoptic diagram of the plan tree construction of step 1 generation;

Fig. 2 shows the synoptic diagram of the plan tree construction of step 2;

Fig. 3 shows the synoptic diagram of the plan forest structure of step 3;

Fig. 4 shows the synoptic diagram of the digraph proposed figures for the plan of step 4;

Fig. 5 shows the data flowchart in planning execution stage.

Embodiment

Below in conjunction with accompanying drawing and specific embodiment the present invention is done further detailed description:

Can supply multiplexing operator through the parallel plan of multithreading being scanned therefrom to seek in query optimization stage of database; Can supply multiplexing operator to be revised as the overall situation and share operator; And the change proposed figures for the plan, will plan tree and become the plan forest, and further be rewritten into digraph.Through this digraph plan of executed in parallel, the intermediate result of multiplexing materialization operator in the planning execution process.

This method mainly may further comprise the steps:

The plan generation phase:

Step 1:

Use common inquiry planing method, be the inquiry plan of inquiry generation serial, this inquiry plan is a binary tree structure.Wherein inquiry relates to partition table, so the part leaf node is the scanning to partition table.As shown in Figure 1, some inquiries are select*from A, B, P where A.a=B.b and A.a=P.p; Wherein A, B are not partition tables, and P is a partition table, and child partition is P1, P2.This plan tree is illustrated in this inquiry, at first the B table is created the Hash table, is connected through the HashJoin mode with the B table by A, then the result of the connection of A, B is created Hash and shows, and shows to be connected through the HashJoin mode with P again.

Step 2:

Reusable materialization class operator is sought in top-down scanning executive plan, and the change proposed figures for the plan, changes thread-level materialization operator into the overall situation multiplexing materialization operator.When finding after one, no longer continue scanning subtree downwards.The criterion of reusable materialization operator is: if certain materialization operator and following plan tree thereof do not comprise partition table, this materialization operator can be re-used so.Like the plan in this example, as shown in Figure 2, top-down scan plan can find at first that it is reusable materialization operator that the connection result to A, B creates the Hash table, and therefore revising this Hash operator is the GlobalHash operator.If can find that it also is a materialization operator that can be re-used that the B table is created the Hash table though continue downward scanning; But because the B table is created the Hash table is the subtree of the connection result of A, B being created the Hash table; We stop to continue downward scanning after searching out a reusable operator, therefore the B table being created the Hash table will not be re-used.

Step 3:

The inquiry plan that step 2 is generated carries out the parallelization processing, generates to be used for the plan that a plurality of thread parallels are carried out.Concrete grammar is a partition table in the scan plan, and partitioned mode is identical if all relate to partition table, so just can duplicate plan with the identical mark of the number of partitions, and replace the subregion master meter in each plan with the subregion sublist, forms a plan forest.As shown in Figure 3, the partition table in this example is P, and the number of partitions is 2.Therefore the plan in the step 2 is copied as 2 parts, and the P table in each plan tree is revised as P1 and P2 respectively.

Step 4:

The plan forest that step 3 is generated carries out overall multiplexing operator merging processing, generates the digraph plan that is used for supplying a plurality of thread parallels execution and reusable materialization operator.Concrete grammar is each plan tree in the scan plan forest, runs into overall materialization operator and just is merged into one to the overall materialization operator of the same position in each plan tree.As shown in Figure 4, the plan forest in this example is merged into following digraph plan.

The planning execution stage:

The execution of parallel plan is actually the concurrent execution of a plurality of threads, and carries out the process of data transmission mutually.The work that materialization operator multiplex technique is done in the planning execution process be exactly have only the actual execution of thread this operator with and under plan, other threads all not have execution, and the result that has been multiplexing.Concrete method is: each thread parallel is carried out the plan part separately in the digraph; First thread that implements overall multiplexing operator is referred to as main thread; Pin the multiplexing operator of this overall situation and real this operator and following plan thereof, other thread waits carried out by main thread.Main thread executes the release afterwards of this operator, and other threads begin reading of data and continuation plan tree separately from the multiplexing operator of this overall situation.Main thread waits for that all plans all read the data that discharge this operator materialization after the data of the overall multiplexing operator that finishes.DFD is as shown in Figure 5 in this example.As can be seen from the figure, two thread parallels are carried out HashJoin separately, and wherein main thread will be accomplished the connection of P1, A, three tables of B, and another thread will be accomplished the connection of the same A of P2, B table.Create the Hash table but have only main thread to carry out the HashJoin of A and B and A, B are connected the result, another thread directly in the GlobalHash operator reading of data be Hashjoin with the P2 table.Obviously, multiplexing through to operator created the Hash table to the B table and only done once with the operation that the result creates the Hash table that is connected to A, B, saved internal memory and cpu resource; Attended operation to A, B has only been done once, has saved cpu resource; The data of A, B table are read only done once, saved the IO resource.

The present invention is the optimization of the parallel data library inquiry of SM framework being carried out flow process, and key features is materialization operator identical in a plurality of threads is shared.Carry out flow process with common parallel query and compare, not only save CPU and memory source, also will lack IO and read, do not increase any cost.Through test, in TPC-H100G benchmark test, use the operator multiplex technique can the cpu resource utilization rate be reduced by 16%, internal memory uses and reduces by 32%, and the IO amount reduces by 35%.

Certainly; The present invention also can have other various embodiments; Under the situation that does not deviate from spirit of the present invention and essence thereof; Those of ordinary skill in the art work as can make various corresponding changes and distortion according to the present invention, but these changes believed and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims

1. the multiplexing implementation method of operator in the parallel database comprises the steps:

2. the multiplexing implementation method of operator in the parallel database as claimed in claim 1, wherein step 1 specifically comprises: said inquiry relates to partition table, and the part leaf node is the scanning to partition table.

3. the multiplexing implementation method of operator in the parallel database as claimed in claim 1, wherein step 2 specifically comprises: after finding a reusable materialization class operator, no longer continue scanning subtree downwards; The criterion of reusable materialization operator is: if certain materialization operator and following plan tree thereof do not comprise partition table, this materialization operator can be re-used so.

4. the multiplexing implementation method of operator in the parallel database as claimed in claim 1; Wherein step 3 specifically comprises: partition table in the scan plan; Partitioned mode is identical if all relate to partition table; So just can duplicate plan with the identical mark of the number of partitions, and replace the subregion master meter in each plan with the subregion sublist, form a plan forest.

5. the multiplexing implementation method of operator in the parallel database as claimed in claim 1; Wherein step 4 specifically comprises: each plan tree in the scan plan forest runs into overall materialization operator and just is merged into one to the overall materialization operator of the same position in each plan tree.