CN102323946B - Implementation method for operator reuse in parallel database - Google Patents

Implementation method for operator reuse in parallel database Download PDF

Info

Publication number
CN102323946B
CN102323946B CN 201110259524 CN201110259524A CN102323946B CN 102323946 B CN102323946 B CN 102323946B CN 201110259524 CN201110259524 CN 201110259524 CN 201110259524 A CN201110259524 A CN 201110259524A CN 102323946 B CN102323946 B CN 102323946B
Authority
CN
China
Prior art keywords
operator
plan
thread
materialization
reusable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201110259524
Other languages
Chinese (zh)
Other versions
CN102323946A (en
Inventor
李阳
何清法
顾云苏
冯柯
蒋志勇
徐岩
饶路
李晓鹏
刘荣
赵婧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Original Assignee
TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TIANJIN SHENZHOU GENERAL DATA CO Ltd filed Critical TIANJIN SHENZHOU GENERAL DATA CO Ltd
Priority to CN 201110259524 priority Critical patent/CN102323946B/en
Publication of CN102323946A publication Critical patent/CN102323946A/en
Application granted granted Critical
Publication of CN102323946B publication Critical patent/CN102323946B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an implementation method for operator reuse in a parallel database, comprising the following steps of: step 1, generating a serial query plan for query through a normal query planning method, wherein the query plan is a binary tree structure; step 2, executing the query plane by scanning from top to bottom, searching materialized reusable operators, changing the query plane structure, and changing thread level materialized operators into global reusable materialized operators; step 3, parallelizing the query plan changed in the step 2, and generating a plan forest for parallel execution of a plurality of threads; step 4, executing global reusable operator combination on the plan forest generated in the step 3, and generating a directed graph plan for the materialized reusable operators capable of being executed by the plurality of threads in parallel; step 5, executing own plan part in the directed graph by each thread in parallel, wherein the thread which executes the global reusable operator firstly is called a main thread, the main thread locks the global reusable operator and truly executes the operator and the plan of the operator, and other threads wait; step 6, unlocking the global reusable operator by the main thread after execution, wherein other threads start to read data from the global reusable operator and continue to execute own plan tree; and step 7, releasing the materialized data of the operator by the main thread after all the plans read the data of the global reusable operator.

Description

The implementation method of operator reuse in parallel database
Technical field
The present invention relates to a kind of Database Systems, especially relate to a kind of multiplexing implementation method of operator of parallel database.
Background technology
Along with the development of infotech with popularize, data expand rapidly with index speed, and processing mass data more and more becomes the major issue that computer realm faces.The research to OLAP, DSS, data mining etc. that database field is risen all is the research to mass data processing in essence.
Solving the popular technology of mass data processing problem at present is parallel query technology and Clustering.The parallel query technology all is the study hotspot of database field all the time, and academia has proposed the architectural framework of multiple parallel query: Share-Everything (fully shared) framework, Share-Memory (shared drive) framework, Share-Disk (shared disk) framework and Share-Nothing (without sharing) framework.Share-Memory framework and Share-Everything framework all can shared main storages, and process or thread can pass through the internal memory swap data.But at present popular Share-Memory and the parallel database system of Share-Everything framework all are to use shared drive to communicate and exchanges data, and do not utilize the operator between a plurality of concurrent processes or the thread multiplexing.Under the parallel architecture based on subregion, a plurality of threads or process are independently carried out task separately, parallel often a plurality of processes or the almost completely identical statement of thread execution structure in the inquiry of same query statement, the part table that difference just wherein relates to is different.
In this case, each process or thread be all separately query execution one time, and do not consider operator multiplexing be significant wastage to resource.
Summary of the invention
In order to address the above problem, the invention provides a kind of resource utilization and performance that in the parallel database of SM framework, improves system by realization operator multiplex technique.
The present invention adopts following technical scheme:
Step 1, use common query planning method to be the inquiry plan of query generation serial, described inquiry plan is a binary tree structure;
Described inquiry plan is carried out in step 2, top-down scanning, seeks reusable materialization operator, and change inquiry plan structure, changes thread-level materialization operator into overall multiplexing materialization operator;
Inquiry plan after step 3, the change that step 2 is generated carries out parallelization to be processed, and generates to be used for the plan forest that a plurality of thread parallels are carried out;
Step 4, the plan forest that step 3 is generated carry out overall multiplexing operator merging processing, and generation is used for can be for the digraph plan of a plurality of thread parallels execution and reusable materialization operator;
Step 5, each thread parallel are carried out the calculated plan tree separately of described digraph, first thread of carrying out overall multiplexing operator is referred to as main thread, pin the multiplexing operator of this overall situation and real this operator and following plan tree thereof, other thread waits carried out by main thread;
Step 6, described main thread execute the afterwards release of this operator, and other threads begin reading out data and continuation plan tree separately from the multiplexing operator of this overall situation;
Step 7, described main thread wait for that all plan tree all reads the data that the data of overall multiplexing operator discharge this operator materialization after complete; Wherein, the criterion of described reusable materialization operator is: if certain materialization operator and following plan tree thereof do not comprise partition table, this materialization operator can be re-used so.
The present invention is the optimization to the parallel database query execution flow process of SM framework, and key features is materialization operator identical in a plurality of threads is shared.Carry out flow process with common parallel query and compare, not only save CPU and memory source, also will lack IO and read, do not increase any cost.
Description of drawings
The present invention is further illustrated below in conjunction with the drawings and specific embodiments.
Fig. 1 shows the synoptic diagram of the plan tree construction of step 1 generation;
Fig. 2 shows the synoptic diagram of the plan tree construction of step 2;
Fig. 3 shows the synoptic diagram of the plan forest structure of step 3;
Fig. 4 shows the synoptic diagram of the digraph proposed figures for the plan of step 4;
Fig. 5 shows the data flowchart of plan execute phase.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further detail:
Can supply multiplexing operator by in query optimization stage of database the parallel plan of multithreading being scanned therefrom to seek, can be revised as the overall situation for multiplexing operator and share operator, and the change proposed figures for the plan, will plan tree and become the plan forest, and further be rewritten into digraph.By this digraph plan of executed in parallel, the intermediate result of multiplexing materialization operator in the plan implementation.
This method mainly may further comprise the steps:
The plan generation phase:
Step 1:
Use common query planning method to be the inquiry plan of query generation serial, this inquiry plan is a binary tree structure.Wherein inquiry relates to partition table, so the part leaf node is the scanning to partition table.As shown in Figure 1, some inquiries are select*from A, B, P where A.a=B.b and A.a=P.p; Wherein A, B are not partition tables, and P is partition table, and child partition is P1, P2.This plan tree table is bright in this inquiry, at first B table is created the Hash table, be connected with B to connect by the HashJoin mode by A, then the result of the connection of A, B is created Hash and shows, and is connected with P to connect by the HashJoin mode again.
Step 2:
Reusable materialization class operator is sought in top-down scanning executive plan, and the change proposed figures for the plan, changes thread-level materialization operator into the overall situation multiplexing materialization operator.When finding after one, no longer continue scanning subtree downwards.The criterion of reusable materialization operator is: if certain materialization operator and following plan tree thereof do not comprise partition table, this materialization operator can be re-used so.Such as the plan in this example, as shown in Figure 2, top-down scan plan can find at first that the connection result establishment Hash table to A, B is reusable materialization operator, and therefore revising this Hash operator is the GlobalHash operator.If can find that it also is a materialization operator that can be re-used that the B table is created the Hash table although continue downward scanning, but be the subtree that the connection result of A, B is created the Hash table because B table is created the Hash table, we stop to continue downward scanning after searching out a reusable operator, therefore the B table being created the Hash table will not be re-used.
Step 3:
The inquiry plan that step 2 is generated carries out the parallelization processing, generates to be used for the plan that a plurality of thread parallels are carried out.Concrete grammar is partition table in the scan plan, if all partition table partitioned modes that relate to are identical, so just can copy plan with the identical umber of the number of partitions, and the subregion master meter in each plan is replaced with the subregion sublist, forms a plan forest.As shown in Figure 3, the partition table in this example is P, and the number of partitions is 2.Therefore the plan in the step 2 is copied as 2 parts, and the P table in each plan tree is revised as respectively P1 and P2.
Step 4:
The plan forest that step 3 is generated carries out overall multiplexing operator merging processing, and generation is used for can be for the digraph plan of a plurality of thread parallels execution and reusable materialization operator.Concrete grammar is each plan tree in the scan plan forest, runs into overall materialization operator and just the overall materialization operator of the same position in each plan tree is merged into one.As shown in Figure 4, the plan forest in this example is merged into following digraph plan.
The plan execute phase:
The execution of parallel plan is actually the concurrent execution of a plurality of threads, and carries out mutually the process of data transmission.Materialization operator multiplex technique the work that plan is done in the implementation be exactly only have the execution of a thread reality this operator with and under plan, other threads all not have execution, and the result that has been multiplexing.Concrete method is: each thread parallel is carried out the plan part separately in the digraph, first thread of carrying out overall multiplexing operator is referred to as main thread, pin the multiplexing operator of this overall situation and real this operator and following plan thereof, other thread waits carried out by main thread.Main thread executes the afterwards release of this operator, and other threads begin reading out data and continuation plan tree separately from the multiplexing operator of this overall situation.Main thread waits for that all plans all read the data that discharge this operator materialization after the data of the multiplexing operator of the complete overall situation.Data flow diagram as shown in Figure 5 in this example.As can be seen from the figure, two thread parallel execution HashJoin separately, wherein main thread will be finished the connection of P1, A, three tables of B, and another thread will be finished the connection of the same A of P2, B table.But only have main thread to carry out the HashJoin of A and B and A, B connection result created the Hash table, another thread directly in the GlobalHash operator reading out data be Hashjoin with the P2 table.Obviously, multiplexing by to operator creates the Hash table and the operation that the connection result of A, B creates the Hash table only done once the B table, saved internal memory and cpu resource; Attended operation to A, B has only been done once, has saved cpu resource; Data to A, B table read and have only done once, have saved the IO resource.
The present invention is the optimization to the parallel database query execution flow process of SM framework, and key features is materialization operator identical in a plurality of threads is shared.Carry out flow process with common parallel query and compare, not only save CPU and memory source, also will lack IO and read, do not increase any cost.After tested, use the operator multiplex technique cpu resource utilization rate can be reduced by 16% in TPC-H100G benchmark test, internal memory uses reduction by 32%, IO amount to reduce by 35%.
Certainly; the present invention also can have other various embodiments; in the situation that does not deviate from spirit of the present invention and essence thereof; those of ordinary skill in the art work as can make according to the present invention various corresponding changes and distortion, but these changes of believing and distortion all should belong to the protection domain of the appended claim of the present invention.

Claims (5)

1. the implementation method of an operator reuse in parallel database comprises the steps:
Step 1, use common query planning method to be the inquiry plan of query generation serial, described inquiry plan is a binary tree structure;
Described inquiry plan is carried out in step 2, top-down scanning, seeks reusable materialization operator, and change inquiry plan structure, changes thread-level materialization operator into overall multiplexing materialization operator;
Inquiry plan after step 3, the change that step 2 is generated carries out parallelization to be processed, and generates to be used for the plan forest that a plurality of thread parallels are carried out;
Step 4, the plan forest that step 3 is generated carry out overall multiplexing operator merging processing, and generation is used for can be for the digraph plan of a plurality of thread parallels execution and reusable materialization operator;
Step 5, each thread parallel are carried out the calculated plan tree separately of described digraph, first thread of carrying out overall multiplexing operator is referred to as main thread, pin the multiplexing operator of this overall situation and real this operator and following plan tree thereof, other thread waits carried out by main thread;
Step 6, described main thread execute the afterwards release of this operator, and other threads begin reading out data and continuation plan tree separately from the multiplexing operator of this overall situation;
Step 7, described main thread wait for that all plan tree all reads the data that the data of overall multiplexing operator discharge this operator materialization after complete; Wherein, the criterion of described reusable materialization operator is: if certain materialization operator and following plan tree thereof do not comprise partition table, this materialization operator can be re-used so.
2. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 1 specifically comprises: described inquiry relates to partition table, and the part leaf node is the scanning to partition table.
3. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 2 specifically comprises: after finding a reusable materialization operator, no longer continue scanning subtree downwards.
4. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 3 specifically comprises: partition table in the scan plan, if the partition table partitioned mode that all relate to is identical, so just can copy plan with the identical umber of the number of partitions, and the subregion sublist replacement of the subregion master meter in each plan, form a plan forest.
5. the implementation method of operator reuse in parallel database as claimed in claim 1, wherein step 4 specifically comprises: each plan tree in the scan plan forest runs into overall materialization operator and just the overall materialization operator of the same position in each plan tree is merged into one.
CN 201110259524 2011-09-05 2011-09-05 Implementation method for operator reuse in parallel database Active CN102323946B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110259524 CN102323946B (en) 2011-09-05 2011-09-05 Implementation method for operator reuse in parallel database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110259524 CN102323946B (en) 2011-09-05 2011-09-05 Implementation method for operator reuse in parallel database

Publications (2)

Publication Number Publication Date
CN102323946A CN102323946A (en) 2012-01-18
CN102323946B true CN102323946B (en) 2013-03-27

Family

ID=45451689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110259524 Active CN102323946B (en) 2011-09-05 2011-09-05 Implementation method for operator reuse in parallel database

Country Status (1)

Country Link
CN (1) CN102323946B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014015492A1 (en) * 2012-07-26 2014-01-30 华为技术有限公司 Data distribution method, device, and system
CN103678368B (en) * 2012-09-14 2017-02-08 华为技术有限公司 query processing method and device
CN103678619B (en) * 2013-12-17 2017-06-30 北京国双科技有限公司 Database index treating method and apparatus
CN105630789B (en) * 2014-10-28 2019-07-12 华为技术有限公司 A kind of inquiry plan method for transformation and device
US10339137B2 (en) * 2015-12-07 2019-07-02 Futurewei Technologies, Inc. System and method for caching and parameterizing IR
US10671607B2 (en) * 2016-09-23 2020-06-02 Futurewei Technologies, Inc. Pipeline dependent tree query optimizer and scheduler
US20180173753A1 (en) * 2016-12-16 2018-06-21 Futurewei Technologies, Inc. Database system and method for compiling serial and parallel database query execution plans
CN108829735B (en) * 2018-05-21 2021-06-29 上海达梦数据库有限公司 Synchronization method, device, server and storage medium for parallel execution plan
CN110909023B (en) * 2018-09-17 2021-11-19 华为技术有限公司 Query plan acquisition method, data query method and data query device
CN112270412B (en) * 2020-10-15 2023-10-27 北京百度网讯科技有限公司 Network operator processing method and device, electronic equipment and storage medium
CN112270413B (en) * 2020-10-22 2024-02-27 北京百度网讯科技有限公司 Operator merging method, device, electronic equipment and storage medium
CN116644090B (en) * 2023-07-27 2023-11-10 天津神舟通用数据技术有限公司 Data query method, device, equipment and medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2002029643A1 (en) * 2000-10-06 2002-04-11 Whamtech, L.P. Enhanced boolean processor with parallel input
US7818349B2 (en) * 2004-02-21 2010-10-19 Datallegro, Inc. Ultra-shared-nothing parallel database
CN101187937A (en) * 2007-10-30 2008-05-28 北京航空航天大学 Mode multiplexing isomerous database access and integration method under gridding environment

Also Published As

Publication number Publication date
CN102323946A (en) 2012-01-18

Similar Documents

Publication Publication Date Title
CN102323946B (en) Implementation method for operator reuse in parallel database
Ammar et al. Distributed evaluation of subgraph queries using worstcase optimal lowmemory dataflows
Lai et al. Scalable distributed subgraph enumeration
US8984516B2 (en) System and method for shared execution of mixed data flows
Kalavri et al. Mapreduce: Limitations, optimizations and open issues
US9323619B2 (en) Deploying parallel data integration applications to distributed computing environments
US20150032758A1 (en) High Performance Index Creation
Serafini et al. Qfrag: Distributed graph search via subgraph isomorphism
CN103473642A (en) Method for rule engine for production dispatching
Guo et al. Exploiting reuse for gpu subgraph enumeration
CN104111936A (en) Method and system for querying data
CN104871153A (en) System and method for flexible distributed massively parallel processing (mpp) database
Wang et al. A mapreducemerge-based data cube construction method
Pertesis et al. Efficient skyline query processing in spatialhadoop
FI128995B (en) Object grouping in computer aided modeling
Gao et al. GLog: A high level graph analysis system using MapReduce
Peng et al. Mining frequent subgraphs from tremendous amount of small graphs using MapReduce
An et al. Using index in the mapreduce framework
Nam et al. A parallel query processing system based on graph-based database partitioning
CN109635473A (en) A kind of heuristic high-throughput material simulation calculation optimization method
Douglas et al. Blind men and an elephant coalescing open-source, academic, and industrial perspectives on BigData
Boehm et al. Vectorizing instance-based integration processes
Holland et al. Distributing an SQL Query over a Cluster of Containers
JP2015043140A (en) Source code generation device
Simitsis et al. Hybrid analytic flows-the case for optimization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant