CN112035523A - Method, device and equipment for determining parallelism and storage medium - Google Patents

Method, device and equipment for determining parallelism and storage medium Download PDF

Info

Publication number
CN112035523A
CN112035523A CN202010865389.1A CN202010865389A CN112035523A CN 112035523 A CN112035523 A CN 112035523A CN 202010865389 A CN202010865389 A CN 202010865389A CN 112035523 A CN112035523 A CN 112035523A
Authority
CN
China
Prior art keywords
plan
sub
parallelism
output data
current sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010865389.1A
Other languages
Chinese (zh)
Inventor
宋鑫
韩朱忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Dameng Database Co Ltd
Original Assignee
Shanghai Dameng Database Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Dameng Database Co Ltd filed Critical Shanghai Dameng Database Co Ltd
Priority to CN202010865389.1A priority Critical patent/CN112035523A/en
Publication of CN112035523A publication Critical patent/CN112035523A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for determining parallelism. The method comprises the following steps: determining the initial parallelism of each sub-plan contained in the execution plan tree; scheduling the current sub-plan according to the initial parallelism of the current sub-plan; when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan; and continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled. According to the scheme, in the scheduling execution process, the parallelism of the sub-plan which is associated with the actual output data line number of the scheduled sub-plan and is not scheduled is dynamically adjusted according to the actual output data line number of the scheduled sub-plan, so that the problem that the parallelism cannot be flexibly adjusted in the prior art is effectively solved, the effective utilization of system resources is realized, and the execution efficiency is improved.

Description

Method, device and equipment for determining parallelism and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a method, a device, equipment and a storage medium for determining parallelism.
Background
With the advent of the big data era, the data volume processed by the system is larger and larger, and how to acquire required data from mass data and process the data correspondingly in time becomes a difficult problem.
In database management systems, parallel execution is an effective means to solve this problem. Parallel execution is the simultaneous opening of multiple threads or processes to collectively accomplish a task. In a database management system, the parallelism determines the number of simultaneously-started threads or processes, and is a key factor influencing the parallel execution efficiency and stability. For example, too large parallelism means that the data amount processed by a single task is small, so that the capability of each execution unit cannot be fully utilized, and the defects of occupied resources, large scheduling overhead and the like exist; too little parallelism can result in too heavy a single task, too much system CPU pressure, under-utilization of hardware resources and too long system response time. How to effectively determine parallelism is therefore an important issue that database management systems need to address in the face of analytical application scenarios.
The parallelism obtained by the traditional parallelism determining method is usually fixed and has poor flexibility.
Disclosure of Invention
The embodiment of the invention provides a method, a device and equipment for determining parallelism and a storage medium, which improve the flexibility of parallelism scheduling.
In a first aspect, an embodiment of the present invention provides a method for determining a parallelism, including:
determining the initial parallelism of each sub-plan contained in an execution plan tree, wherein the execution plan tree is generated by analyzing a query statement input by a user;
scheduling the current sub-plan according to the initial parallelism of the current sub-plan;
when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan;
and continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled.
In a second aspect, an embodiment of the present invention further provides a device for determining parallelism, including:
the system comprises a parallelism determination module, a query statement analysis module and a parallelism determination module, wherein the parallelism determination module is used for determining the initial parallelism of each sub-plan contained in an execution plan tree, and the execution plan tree is generated by analyzing the query statement input by a user;
the scheduling module is used for scheduling the current sub-plan according to the initial parallelism of the current sub-plan;
the parallelism adjusting module is used for adjusting the initial parallelism of the sub-plan which is related to the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan if the actual output data line number of the current sub-plan meets a preset condition when the current sub-plan is scheduled;
and the scheduling module is used for continuously scheduling the next sub-plan until the sub-plan included in the execution plan tree is scheduled.
In a third aspect, an embodiment of the present invention further provides a computer device, including:
one or more processors;
a memory for storing one or more programs;
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method for determining parallelism as described in the first aspect.
In a fourth aspect, an embodiment of the present invention further provides a storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for determining parallelism according to the first aspect.
The embodiment of the invention provides a method, a device, equipment and a storage medium for determining parallelism, wherein the initial parallelism of each sub-plan contained in an execution plan tree is determined, and the execution plan tree is generated by analyzing a query statement input by a user; scheduling the current sub-plan according to the initial parallelism of the current sub-plan; when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan; and continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled. According to the scheme, in the scheduling execution process, the parallelism of the sub-plan which is associated with the actual output data line number of the scheduled sub-plan and is not scheduled is dynamically adjusted according to the actual output data line number of the scheduled sub-plan, so that the problem that the parallelism cannot be flexibly adjusted in the prior art is effectively solved, the effective utilization of system resources is realized, and the execution efficiency is improved.
Drawings
Fig. 1 is a flowchart of a method for determining parallelism according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an execution plan tree according to an embodiment of the present invention;
fig. 3 is a flowchart of a method for determining parallelism according to a second embodiment of the present invention;
fig. 4 is a structural diagram of a parallelism determining apparatus according to a third embodiment of the present invention;
fig. 5 is a structural diagram of a computer device according to a fourth embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like. In addition, the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.
Example one
Fig. 1 is a flowchart of a method for determining parallelism according to an embodiment of the present invention, where the embodiment is applicable to a case of querying data in parallel, and the method may be executed by a parallelism determining apparatus, which may be implemented by software and/or hardware, and may be integrated in a computer device with a data processing function. Referring to fig. 1, the method may include the steps of:
and S110, determining the initial parallelism of each sub-plan contained in an execution plan tree, wherein the execution plan tree is generated by analyzing a query statement input by a user.
The query statement is used to query data in the database, AND the form of the query statement AND the query condition are not limited in this embodiment, AND may be, for example, SELECT FROM T1, T2, T3 WHERE T1.c1 ═ T2.d1 AND T2.d2 ═ T3.e2 AND T1.c2< >10, which represents data for querying the three data tables T1, T2, AND T3, AND the query condition is T1.c1 ═ T2.d1 AND T2.d2 ═ T3.e2 AND T1.c2< > 10. The execution plan tree is used for indicating the execution sequence of the query statement, and the query statement can be analyzed in terms of morphology, grammar and semantics to generate the execution plan tree. In general, one query statement may correspond to a plurality of feasible execution plan trees, and the embodiment selects the least expensive one of the plurality of execution plan trees as the execution basis.
Exemplarily, referring to fig. 2, fig. 2 is a schematic diagram of an execution plan tree according to an embodiment of the present invention. The execution plan tree is an execution plan tree with the minimum cost obtained by analyzing the query statement. TABLE SCAN is a TABLE SCAN operator for scanning data of a data TABLE, e.g., TABLE SCAN (T1) represents data in SCAN data TABLE T1. FILTER is a FILTER operator, SEND represents a SEND data operator, RECV represents a receive data operator, JOIN represents a JOIN operator, and HASH JOIN represents a HASH JOIN operator.
The sub-plan is a part of the execution process that is split into an execution plan tree based on a receive data operator and a transmit data operator, each sub-plan containing a transmit data operator and one or more receive data operators, such as the execution plan tree shown in fig. 2 comprising 5 sub-plans, and considering that a table scan operator is fetching data from a data storage page and a receive data operator is receiving data sent by other operators, this embodiment treats the table scan operator as a special receive data operator. According to the execution plan tree shown in fig. 2, the sub-plan 3 and the sub-plan 2 may be executed first, then the output results of the sub-plan 3 and the sub-plan 2 are hash-connected, the output result is sent to the sub-plan 4, and finally the sub-plan 4 performs hash-connection on the output results of the sub-plan 1 and the sub-plan 5 to output the query result. The output of each sub-plan, i.e. the output of SEND data operator SEND in each sub-plan, for example the output of sub-plan 2 is the output of SEND data operator SEND 1.
Parallelism is the number of threads or processes that are simultaneously started for execution of sub-plans, each with an independent degree of parallelism. There are many factors that affect the parallelism, such as the cost of the sub-plan, the number of rows of output data of the received data operators in the sub-plan, the information of the resources currently available to the system and the information of the resource restrictions on the user. In one example, parallelism may be determined from the number of data output lines of received data operators contained in the sub-plan, in combination with resource information currently available to the system and resource constraint information for the user. The resource information currently available to the system may be a percentage of the total available resources of the system, and the resource limitation information for the user may be determined based on the user's right when the user logs in the system. The initial parallelism is determined based on the estimated number of rows of data output of the received data operators in combination with information on resources currently available to the system and resource constraints on the users. Illustratively, as shown in fig. 2, the initial parallelism of sub-plan 2 is 2, the initial parallelism of sub-plans 3 and 5 is 3, the initial parallelism of sub-plan 1 is 4, and the initial parallelism of sub-plan 4 is 6.
And S120, scheduling the current sub-plan according to the initial parallelism of the current sub-plan.
The scheduling order of the sub-plans in the execution plan shown in fig. 2 may be a sub-plan 1 and a sub-plan 2, both of which form a data pipeline; after the execution of the sub-plan 2 is finished, the sub-plan 3 is dispatched, and at the moment, the sub-plan 1 and the sub-plan 3 form data flow; when sub-plan 1 outputs the result for the first time, sub-plan 4 is scheduled; when the execution of sub-plan 3 is completed, sub-plan 1 is also completed, and at this time, sub-plan 5 can be scheduled, and sub-plan 4 and sub-plan 5 form a new data flow. Specifically, a group of threads may be started according to the scheduling order and the initial parallelism of the current sub-plan, so as to implement scheduling of the current sub-plan.
It should be noted that, while scheduling always ensures that the producer sub-plan and the consumer sub-plan are started simultaneously, in this embodiment, the sub-plans that are associated with each other and located at the lower layer of the execution plan tree are referred to as producer sub-plans, and the sub-plans located at the upper layer are referred to as consumer sub-plans. The producer sub-plan and the consumer sub-plan are opposite, and in an execution plan tree, the producer sub-plan can also be used as the consumer sub-plan, and the consumer sub-plan can also be used as the producer sub-plan. For example, sub-plan 2 and sub-plan 3 of FIG. 2 are producer sub-plans relative to sub-plan 1, and sub-plan 1 is a consumer sub-plan relative to sub-plan 2 and sub-plan 3. As another example, sub-plan 1 is a producer sub-plan relative to sub-plan 4, and sub-plan 4 is a consumer sub-plan relative to sub-plan 1.
S130, when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan.
The parallelism of the upper sub-plan is affected by the number of output data lines of its producer sub-plan, e.g. the parallelism of sub-plan 1 in fig. 2 is affected by the number of output data lines of sub-plan 2, and the parallelism of sub-plan 4 is affected by the number of output data lines of sub-plan 1. After the current sub-plan is scheduled, the actual output data line number of the current sub-plan may be obtained, where the actual output data line number of the current sub-plan is the actual output data line number of the SEND data operator in the current sub-plan, for example, the actual output data line number of sub-plan 3 is the actual output data line number of SEND data operator SEND2 in sub-plan 3.
The preset condition may be set according to actual conditions, and may be, for example, (C '-C)/C > k, where C' is the actual output data line number of the current sub-plan, C is the estimated output data line number of the current sub-plan, and k is the set threshold. Specifically, when the actual output data line number of the current sub-plan meets the above condition, the parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled is determined again as the adjusted initial parallelism. The process of re-determining the sub-plan parallelism is similar to the process of determining the initial parallelism described above. As shown in fig. 2, assuming that the actual number of output data lines of sub-plan 2 after execution is completed satisfies the above condition, the estimated number of output data lines of each operator in sub-plan 1 and sub-plan 4 is modified, and since sub-plan 1 is already scheduled and sub-plan 4 is not yet scheduled, the parallelism of sub-plan 4 is re-determined according to the estimated number of output data lines modified by receiving data operator RECV1 in sub-plan 4, so that dynamic adjustment of the parallelism of the non-scheduled sub-plan is realized, and system resources are fully utilized.
And S140, continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled.
The embodiment of the invention provides a method for determining parallelism, which comprises the steps of determining the initial parallelism of each sub-plan contained in an execution plan tree, wherein the execution plan tree is generated by analyzing a query statement input by a user; scheduling the current sub-plan according to the initial parallelism of the current sub-plan; when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan; and continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled. In the scheduling execution process, the method dynamically adjusts the parallelism of the sub-plan which is associated with the actual output data line number of the scheduled sub-plan and is not scheduled according to the actual output data line number of the scheduled sub-plan, effectively solves the problem that the parallelism cannot be flexibly adjusted in the prior art, realizes the effective utilization of system resources, and improves the execution efficiency.
Example two
Fig. 3 is a flowchart of a method for determining parallelism according to a second embodiment of the present invention, where the present embodiment is optimized based on the foregoing embodiments, and referring to fig. 3, the method may include the following steps:
s210, determining the estimated output data line number of the operator according to the output line number estimation formula corresponding to each operator in the execution plan tree.
The initial parallelism is determined based on the number of rows of estimated output data of the operators included in the sub-plan, and therefore the number of rows of estimated output data of each operator in the execution plan tree needs to be determined before the initial parallelism is determined. The operators included in the execution plan tree may be determined in advance, and the determination process of the operators is not limited in this embodiment. Each sub-plan included in the execution plan tree includes at least a data exchange operator, the data exchange operator including a send data operator and at least a receive data operator. In addition to the data exchange operator, each sub-plan may contain other operators as desired, e.g., sub-plan 2 in FIG. 2 includes a filter operator in addition to the data exchange operator, and sub-plan 1 and sub-plan 4 include hash join operators.
The output line number estimation formula is used for estimating the output line number of each operator in the operator plan, and is an important function of the database management system, and different operators can correspond to different output line number estimation formulas. For example, for projection, sorting, window functions, and send and receive data operators, which do not alter the number of received data lines, the number of output data lines may be considered equal to the number of input data lines; for example, for a filtering operator, the number of output data lines may be equal to the product of the number of input data lines and a selection rate, which may be estimated based on statistical information or empirical values for the filtering condition; for example, the number of rows of output data of the join operator may be calculated based on the number of rows of output data of the child operators, and in combination with the join condition and the statistical information of the columns in the join condition.
Since different database management systems have different estimation manners for the number of rows of output data, the specific implementation of the output data estimation formula is not limited in this embodiment. In practical application, a proper output line number estimation formula can be selected according to an operator contained in each sub-plan in the execution plan tree and an adopted database management system, and then the output data line number of the corresponding operator is estimated from the bottom of the execution plan tree according to the selected output line number estimation formula, so that the estimated output data line number is obtained. For example, referring to fig. 2, the number of rows of output data may be estimated starting from TABLE SCAN operator TABLE SCAN of sub-plan 2, then the number of rows of output data of FILTER operation operator FILTER, and finally the number of rows of estimated output data of SEND data operator SEND 1; similarly, sub-plan 3 also estimates the number of rows of output data from TABLE SCAN operator TABLE SCAN, and finally obtains the estimated number of rows of output data from SEND data operator SEND 2; the number of rows of output data for each operator in sub-plan 1 may then be estimated until the estimation of the number of rows of output data for all operators of the entire execution plan tree is complete.
S220, determining the initial parallelism of each sub-plan according to the estimated output data row number of the received data operator contained in each sub-plan, the currently available resource information of the system and the resource limit information of the user.
Optionally, the initial parallelism of each sub-plan may be determined according to a parallelism calculation formula, the estimated output data line number of the received data operator included in each sub-plan, currently available resource information of the system, and resource limitation information of the user;
the parallelism calculation formula is as follows:
PARALLEL(X)=f_parallel(C1,C2,...,Cn,R1,R2)
wherein X represents the sub-plan, PARALLEL (X) represents the parallelism of the sub-plan X, CnEstimating the number of rows of output data for a received data operator, n being the received data operatorThe number of symbols, R1 is the resource information currently available to the system, and R2 is the resource restriction information for the user. Of course, other parallelism calculation formulas may also be used, and this embodiment is not limited. For example, referring to fig. 2, the parallelism of sub-plan 2 may be expressed as PARALLEL (sub-plan 2) ═ f _ PARALLEL (C1, R1, R2), where C1 is the estimated number of rows of output data in sub-plan 2 for TABLE SCAN operator TABLE SCAN (T1). The parallelism of sub-plan 1 may be expressed as PARALLEL (sub-plan 1) f _ PARALLEL (C1, C2, R1, R2), where C1 and C2 are the estimated output data line numbers of received data operator RECV1 and received data operator RECV2, respectively, in sub-plan 1.
And S230, scheduling the current sub-plan according to the initial parallelism of the current sub-plan.
And S240, when the current sub-plan is scheduled, judging whether the actual output data line number of the current sub-plan meets a preset condition, if so, executing S250, and otherwise, executing S260.
The preset condition of this embodiment is that the ratio of the difference between the actual output data line number of the current sub-plan and the estimated output data line number thereof to the estimated output data line number thereof is greater than the set threshold, and the size of the set threshold can be selected as required. And after the current sub-plan is scheduled, if the actual output data line number meets the preset condition, dynamically adjusting the initial parallelism of the non-scheduled sub-plan, and otherwise, continuously scheduling the next sub-plan.
And S250, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan.
In the embodiment, the parallelism of the sub-plan which is not scheduled is dynamically adjusted according to the actual output result of the scheduled sub-plan in the execution process, so that the system resources can be more effectively utilized, the execution efficiency is improved, and the execution of the whole execution plan tree is not influenced. In one example, the initial parallelism of the sub-plan associated with the current sub-plan and not scheduled may be adjusted by:
traversing the execution plan tree upward from a root node operator of the current sub-plan, determining a sub-plan of the execution plan tree that is associated with the current sub-plan;
correcting the number of estimated output data lines of the sub-plan associated with the current sub-plan according to the number of actual output data lines of the current sub-plan;
and determining the parallelism of the unscheduled sub-plan associated with the current sub-plan according to the corrected estimated output data line number, the currently available resource information of the system and the resource limit information of the user, and taking the parallelism as the adjusted initial parallelism.
Specifically, when the actual output data line number of the current sub-plan meets a preset condition, the execution plan tree is traversed upwards from a root node operator of the current sub-plan, and a sub-plan associated with the current sub-plan in an upper sub-plan is determined, wherein the root node operator is the uppermost operator in the current sub-plan. For example, the current sub-plan is sub-plan 2, with the root node operator being SEND data operator SEND1, the execution plan tree is traversed upward starting with SEND data operator SEND1, resulting in a sub-plan associated with and above the current sub-plan comprising sub-plan 1 and sub-plan 4, where sub-plan 4 has not yet been scheduled. And correcting the estimated data output line number of each operator in the traversed sub-plan according to the actual output data line number of the current sub-plan, wherein the correction process is similar to the determination process of the estimated data output line number. And if the currently traversed sub-plan is not scheduled, calling the parallelism calculation formula, and adjusting the initial parallelism according to the corrected estimated output data line number, wherein the adjusting process is similar to the determining process of the initial parallelism.
For example, assuming that the actual number of output data lines of sub-plan 2 after execution is completed satisfies a preset condition, the modified number of estimated data output lines of each of sub-plan 1 and sub-plan 4 is recursively calculated, and since sub-plan 1 is already scheduled, the parallelism of sub-plan 1 does not change, and sub-plan 4 is not yet scheduled, the parallelism of sub-plan 4 is recalculated according to the parallelism calculation formula, PARALLEL (sub-plan 4) ═ f _ PARALLEL (C1 ', C2, R1', R2 '), where C1' is the modified number of estimated output data lines of received data operator RECV1 in sub-plan 4, that is, the estimated number of output data lines of sub-plan 1, C2 is the previously estimated number of output data lines of sub-plan 5, R1 'is the modified resource information currently available to the system, and R2' is the modified resource restriction information on the user.
And S260, continuously scheduling the next sub-plan until the sub-plans contained in the execution plan tree are scheduled.
On the basis of the above embodiment, the parallelism of the sub-plan which is associated with the scheduled sub-plan and is not scheduled is dynamically adjusted according to the actual output result of the scheduled sub-plan, so that the problem that the parallelism is unchanged in the execution process and cannot be dynamically adjusted according to the actual situation in the prior art is effectively solved, system resources can be more effectively utilized, the execution efficiency is improved, and the execution of the whole execution plan tree is not influenced.
In this embodiment, the specific scheduling management of each Sub-plan may be responsible for a respective Sub-scheduling manager (SQC), the SQC starts a group of threads according to the specified parallelism, and after all the threads of the group are finished, the SQC reports to a general scheduling manager (QC), so that the QC dynamically adjusts the parallelism of the Sub-plan that has not been scheduled according to the number of real output data lines of the current Sub-plan fed back by the SQC.
EXAMPLE III
Fig. 4 is a structural diagram of a parallelism determining apparatus according to a third embodiment of the present invention, which can execute the parallelism determining method according to the third embodiment, and with reference to fig. 4, the apparatus includes:
a parallelism determining module 31, configured to determine an initial parallelism of each sub-plan included in an execution plan tree, where the execution plan tree is generated by parsing a query statement input by a user;
a scheduling module 32, configured to schedule the current sub-plan according to the initial parallelism of the current sub-plan;
and a parallelism adjusting module 33, configured to, when the current sub-plan is scheduled, adjust the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the number of actual output data lines of the current sub-plan if the number of actual output data lines of the current sub-plan meets a preset condition.
The scheduling module 32 is further configured to continue scheduling the next sub-plan until the sub-plan included in the execution plan tree is scheduled.
The third embodiment of the present invention provides a parallelism determining apparatus, which determines an initial parallelism of each sub-plan included in an execution plan tree, where the execution plan tree is generated by parsing a query statement input by a user; scheduling the current sub-plan according to the initial parallelism of the current sub-plan; when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan; and continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled. In the scheduling execution process, the device dynamically adjusts the parallelism of the sub-plan which is associated with the actual output data line number of the scheduled sub-plan and is not scheduled according to the actual output data line number of the scheduled sub-plan, so that the problem that the parallelism cannot be flexibly adjusted in the prior art is effectively solved, the effective utilization of system resources is realized, and the execution efficiency is improved.
On the basis of the above embodiment, the parallelism determining module 31 includes:
a row number determining unit, configured to determine an estimated output data row number of each operator according to an output row number estimation formula corresponding to each operator in the execution plan tree;
and the parallelism determining unit is used for determining the initial parallelism of each sub-plan according to the estimated output data row number of the received data operator contained in each sub-plan, the currently available resource information of the system and the resource limit information of the user.
On the basis of the foregoing embodiment, the parallelism adjusting module 33 is specifically configured to:
if the ratio of the difference value of the actual output data line number of the current sub-plan and the estimated output data line number of the current sub-plan to the estimated output data line number is larger than a set threshold, traversing the execution plan tree from the root node operator of the current sub-plan upwards, and determining the sub-plan in the execution plan tree associated with the current sub-plan;
correcting the number of estimated output data lines of the sub-plan associated with the current sub-plan according to the number of actual output data lines of the current sub-plan;
and determining the parallelism of the unscheduled sub-plan associated with the current sub-plan according to the corrected estimated output data line number, the currently available resource information of the system and the resource limit information of the user, and taking the parallelism as the adjusted initial parallelism.
On the basis of the foregoing embodiment, the parallelism determining unit is specifically configured to:
determining the initial parallelism of each sub-plan according to a parallelism calculation formula, the estimated output data row number of the received data operator contained in each sub-plan, the currently available resource information of the system and the resource limit information of the user;
the parallelism calculation formula is as follows:
PARALLEL(X)=f_parallel(C1,C2,...,Cn,R1,R2)
wherein X represents the sub-plan, PARALLEL (X) represents the parallelism of the sub-plan X, CnAn estimated number of output data lines for the received operator, n a number of received data operators, R1 resource information currently available to the system, and R2 resource restriction information for said user.
The parallelism determining device provided by the embodiment of the invention can be used for executing the parallelism determining method provided by the embodiment, and has corresponding functions and beneficial effects.
Example four
Fig. 5 is a structural diagram of a computer device according to a fourth embodiment of the present invention, and referring to fig. 5, the computer device includes a processor 41, a memory 42, an input device 43, and an output device 44, the number of the processors 41 in the computer device may be one or more, in fig. 5, taking one processor 41 as an example, the processor 41, the memory 42, the input device 43, and the output device 44 in the computer device may be connected by a bus or in another manner, and in fig. 5, the connection by the bus is taken as an example.
The memory 42 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the parallelism determination method in the embodiment of the present invention. The processor 41 executes various functional applications of the computer device and data processing, namely, realizes the determination method of the parallelism of the above-described embodiment, by executing software programs, instructions, and modules stored in the memory 42.
The memory 42 mainly includes a program storage area and a data storage area, wherein the program storage area can store an operating system and an application program required by at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 42 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 42 may further include memory located remotely from processor 41, which may be connected to a computer device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 43 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function controls of the computer apparatus. The output device 44 may include a display device such as a display screen, and an audio device such as a speaker and a buzzer.
The computer device provided by the embodiment of the present invention belongs to the same inventive concept as the method for determining the parallelism provided by the above embodiment, and the technical details that are not described in detail in the embodiment can be referred to the above embodiment, and the embodiment has the same advantageous effects as the method for determining the parallelism.
EXAMPLE five
An embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is used, when executed by a processor, to perform a method for determining parallelism, and the method includes:
determining the initial parallelism of each sub-plan contained in an execution plan tree, wherein the execution plan tree is generated by analyzing a query statement input by a user;
scheduling the current sub-plan according to the initial parallelism of the current sub-plan;
when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan;
and continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled.
Storage media for embodiments of the present invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a flash Memory, an optical fiber, a portable CD-ROM, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. A computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take a variety of forms, including, but not limited to: an electromagnetic signal, an optical signal, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. A method for determining parallelism, comprising:
determining the initial parallelism of each sub-plan contained in an execution plan tree, wherein the execution plan tree is generated by analyzing a query statement input by a user;
scheduling the current sub-plan according to the initial parallelism of the current sub-plan;
when the current sub-plan is scheduled, if the actual output data line number of the current sub-plan meets a preset condition, adjusting the initial parallelism of the sub-plan which is associated with the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan;
and continuing to schedule the next sub-plan until the sub-plans contained in the execution plan tree are scheduled.
2. The method of claim 1, wherein determining an initial degree of parallelism for each sub-plan included in the execution plan tree comprises:
determining the estimated output data line number of the operational character according to the output line number estimation formula corresponding to the operational character in the execution plan tree;
and determining the initial parallelism of each sub-plan according to the estimated output data row number of the received data operator contained in each sub-plan, the currently available resource information of the system and the resource limit information of the user.
3. The method of claim 1, wherein if the number of actual output data rows of the current sub-plan satisfies a predetermined condition, adjusting the initial parallelism of the non-scheduled sub-plans associated with the current sub-plan in the execution plan tree according to the number of actual output data rows of the current sub-plan comprises:
if the ratio of the difference value of the actual output data line number of the current sub-plan and the estimated output data line number of the current sub-plan to the estimated output data line number is larger than a set threshold, traversing the execution plan tree from the root node operator of the current sub-plan upwards, and determining the sub-plan in the execution plan tree associated with the current sub-plan;
correcting the number of estimated output data lines of the sub-plan associated with the current sub-plan according to the number of actual output data lines of the current sub-plan;
and determining the parallelism of the unscheduled sub-plan associated with the current sub-plan according to the corrected estimated output data line number, the currently available resource information of the system and the resource limit information of the user, and taking the parallelism as the adjusted initial parallelism.
4. The method of claim 2, wherein determining the initial parallelism for each sub-plan based on the estimated number of rows of output data for the received data operators included in each sub-plan, the information on resources currently available to the system, and the information on resource constraints for the user comprises:
determining the initial parallelism of each sub-plan according to a parallelism calculation formula, the estimated output data row number of the received data operator contained in each sub-plan, the currently available resource information of the system and the resource limit information of the user;
the parallelism calculation formula is as follows:
PARALLEL(X)=f_parallel(C1,C2,...,Cn,R1,R2)
wherein X represents the sub-plan, PARALLEL (X) represents the parallelism of the sub-plan X, CnEstimated output data row number for received data operators, n is number of received data operators, R1 is resource information currently available to the system, R2 is resource restriction information for the user.
5. A parallelism determination apparatus, comprising:
the system comprises a parallelism determination module, a query statement analysis module and a parallelism determination module, wherein the parallelism determination module is used for determining the initial parallelism of each sub-plan contained in an execution plan tree, and the execution plan tree is generated by analyzing the query statement input by a user;
the scheduling module is used for scheduling the current sub-plan according to the initial parallelism of the current sub-plan;
the parallelism adjusting module is used for adjusting the initial parallelism of the sub-plan which is related to the current sub-plan and is not scheduled in the execution plan tree according to the actual output data line number of the current sub-plan if the actual output data line number of the current sub-plan meets a preset condition when the current sub-plan is scheduled;
and the scheduling module is further used for continuing to schedule the next sub-plan until the sub-plan contained in the execution plan tree is scheduled.
6. The apparatus of claim 5, wherein the parallelism determination module comprises:
a row number determining unit, configured to determine an estimated output data row number of each operator according to an output row number estimation formula corresponding to each operator in the execution plan tree;
and the parallelism determining unit is used for determining the initial parallelism of each sub-plan according to the estimated output data row number of the received data operator contained in each sub-plan, the currently available resource information of the system and the resource limit information of the user.
7. The apparatus of claim 5, wherein the parallelism adjustment module is specifically configured to:
if the ratio of the difference value of the actual output data line number of the current sub-plan and the estimated output data line number of the current sub-plan to the estimated output data line number is larger than a set threshold, traversing the execution plan tree from the root node operator of the current sub-plan upwards, and determining the sub-plan in the execution plan tree associated with the current sub-plan;
correcting the number of estimated output data lines of the sub-plan associated with the current sub-plan according to the number of actual output data lines of the current sub-plan;
and determining the parallelism of the unscheduled sub-plan associated with the current sub-plan according to the corrected estimated output data line number, the currently available resource information of the system and the resource limit information of the user, and taking the parallelism as the adjusted initial parallelism.
8. The apparatus according to claim 6, wherein the parallelism determination unit is specifically configured to:
determining the initial parallelism of each sub-plan according to a parallelism calculation formula, the estimated output data row number of the received data operator contained in each sub-plan, the currently available resource information of the system and the resource limit information of the user;
the parallelism calculation formula is as follows:
PARALLEL(X)=f_parallel(C1,C2,...,Cn,R1,R2)
wherein X represents the sub-plan, PARALLEL (X) represents the parallelism of the sub-plan X, CnEstimated output data row number for received data operators, n is number of received data operators, R1 is resource information currently available to the system, R2 is resource restriction information for the user.
9. A computer device, comprising:
one or more processors;
a memory for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of determining parallelism as claimed in any one of claims 1-4.
10. A storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the method of determining parallelism as claimed in any one of claims 1 to 4.
CN202010865389.1A 2020-08-25 2020-08-25 Method, device and equipment for determining parallelism and storage medium Pending CN112035523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010865389.1A CN112035523A (en) 2020-08-25 2020-08-25 Method, device and equipment for determining parallelism and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010865389.1A CN112035523A (en) 2020-08-25 2020-08-25 Method, device and equipment for determining parallelism and storage medium

Publications (1)

Publication Number Publication Date
CN112035523A true CN112035523A (en) 2020-12-04

Family

ID=73580068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010865389.1A Pending CN112035523A (en) 2020-08-25 2020-08-25 Method, device and equipment for determining parallelism and storage medium

Country Status (1)

Country Link
CN (1) CN112035523A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078080A1 (en) * 2022-10-14 2024-04-18 华为技术有限公司 Database query method and apparatus, and device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030055813A1 (en) * 2001-05-15 2003-03-20 Microsoft Corporation Query optimization by sub-plan memoization
CN107025225A (en) * 2016-01-30 2017-08-08 华为技术有限公司 A kind of parallel execution method and apparatus of terminal database
CN108121792A (en) * 2017-12-20 2018-06-05 第四范式(北京)技术有限公司 Method, apparatus, equipment and the storage medium of task based access control parallel data processing stream
CN110100241A (en) * 2016-12-16 2019-08-06 华为技术有限公司 It is a kind of for compiling the Database Systems and method of serial and concurrent data base querying executive plan
KR20200063962A (en) * 2018-11-28 2020-06-05 서울대학교산학협력단 Distributed processing system and operating method thereof

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030055813A1 (en) * 2001-05-15 2003-03-20 Microsoft Corporation Query optimization by sub-plan memoization
CN107025225A (en) * 2016-01-30 2017-08-08 华为技术有限公司 A kind of parallel execution method and apparatus of terminal database
CN110100241A (en) * 2016-12-16 2019-08-06 华为技术有限公司 It is a kind of for compiling the Database Systems and method of serial and concurrent data base querying executive plan
CN108121792A (en) * 2017-12-20 2018-06-05 第四范式(北京)技术有限公司 Method, apparatus, equipment and the storage medium of task based access control parallel data processing stream
KR20200063962A (en) * 2018-11-28 2020-06-05 서울대학교산학협력단 Distributed processing system and operating method thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MALIK T: "A Black-Box Approach to Query Cardinality Estimation", THIRD BIENNIAL CONFERENCE ON INNOVATIVE DATA SYSTEMS RESEARCH, 10 January 2007 (2007-01-10), pages 56 *
许新华;胡世港;唐胜群;刘华东;: "数据库查询优化技术的历史、现状与未来", 计算机工程与应用, no. 18, 21 June 2009 (2009-06-21) *
高锦涛;李战怀;刘文洁;: "一种高效准确的基于查询结果的基数估计策略", 西北工业大学学报, vol. 36, no. 4, 15 August 2018 (2018-08-15), pages 769 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024078080A1 (en) * 2022-10-14 2024-04-18 华为技术有限公司 Database query method and apparatus, and device and medium

Similar Documents

Publication Publication Date Title
US10320623B2 (en) Techniques for tracking resource usage statistics per transaction across multiple layers of protocols
EP2822236B1 (en) Network bandwidth distribution method and terminal
US20070079021A1 (en) Selective I/O prioritization by system process/thread and foreground window identification
CN111435354A (en) Data export method and device, storage medium and electronic equipment
US8392577B2 (en) Reduction of message flow between bus-connected consumers and producers
CN114339135A (en) Load balancing method and device, electronic equipment and storage medium
CN108959571B (en) SQL statement operation method and device, terminal equipment and storage medium
CN112035523A (en) Method, device and equipment for determining parallelism and storage medium
WO2019029721A1 (en) Task scheduling method, apparatus and device, and storage medium
CN116248699B (en) Data reading method, device, equipment and storage medium in multi-copy scene
CN111917595A (en) System upgrading method and device, intelligent equipment and storage medium
CN114884893B (en) Forwarding and control definable cooperative traffic scheduling method and system
CN112667368A (en) Task data processing method and device
CN114443293A (en) Deployment system and method for big data platform
CN113391927A (en) Method, device and system for processing business event and storage medium
CN112399470A (en) LoRa communication method, LoRa gateway, LoRa system and computer readable storage medium
CN111625524B (en) Data processing method, device, equipment and storage medium
CN111125161A (en) Real-time data processing method, device, equipment and storage medium
CN111459653A (en) Cluster scheduling method, device and system and electronic equipment
CN110209645A (en) Task processing method, device, electronic equipment and storage medium
CN115987905A (en) Multi-channel flow control method, system, equipment and storage medium
CN105765569A (en) Data distribution method, loader and storage system
CN117221245A (en) Message sending method and device, electronic equipment and storage medium
CN114721829A (en) Coroutine stack resource allocation method, coroutine stack resource allocation device, coroutine stack resource allocation equipment and storage medium
CN116186077A (en) Information transfer method, apparatus, database system, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination