CN109558232B

CN109558232B - Determination method, apparatus, equipment and the medium of degree of parallelism

Info

Publication number: CN109558232B
Application number: CN201811436295.1A
Authority: CN
Inventors: 陈振强; 熊仲健; 刘汪根
Original assignee: Star Link Information Technology (shanghai) Co Ltd
Current assignee: Transwarp Technology Shanghai Co Ltd
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2019-08-23
Anticipated expiration: 2038-11-28
Also published as: CN109558232A

Abstract

The embodiment of the invention discloses determination method, apparatus, equipment and the media of a kind of degree of parallelism.This method comprises: obtaining the executive plan tree of distributed computing task；Determine the degree of parallelism impact factor of operation corresponding with node each in executive plan tree respectively according to the data statistics of preset Cost Model and the distributed computing task；According to preset Cost Model and the data statistics of distributed computing task, the initial degree of parallelism that table handling is swept in executive plan tree is determined；According to the initial degree of parallelism for sweeping table handling, the degree of parallelism of operation corresponding with node each in executive plan tree is calculated separately according to the degree of parallelism impact factor of operation corresponding with node each in executive plan tree according to postorder traversal sequence.The above method avoids the drawbacks of degree of parallelism control program in the prior art, improves the performance, stability and availability of distributed computing engine, realizes the adaptive of degree of parallelism control.

Description

Method, device, equipment and medium for determining parallelism

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method, a device, equipment and a medium for determining parallelism.

Background

The common distributed computing engine framework comprises three stages of mapping (map), data redistribution (shuffle) and reduction (reduce), and represents a process that data is processed by a map end, then the data is re-aggregated by the shuffle end according to a certain rule, and finally the data is computed by the reduction end. The parallelism determines how many map tasks and how many reduce tasks are simultaneously operated in the system, and is a key factor influencing the parallel execution efficiency and stability of the distributed system.

Too much or too little parallelism affects the execution efficiency and stability of the system: too large means that the data amount processed by a single task is so small that the capability of each execution unit cannot be fully utilized, and the defects of occupied resources, large scheduling overhead, excessive small files and the like exist; and too small results in overweight single task and large pressure of system CPU/MEM/IO, thereby causing the problems of slowness performance and even affecting the stability of the system, and simultaneously, the problem that other concurrent execution units cannot be effectively utilized due to the fact that the tasks are concentrated in limited child nodes exists. Therefore, how to control parallelism is an important issue in distributed compute engine applications.

At present, the parallelism scheme of the distributed computing engine mainly focuses on the control of the reduce end, but lacks the control of the map end, and several common ways are as follows:

(1) and (4) fixed parallelism control, namely setting the number of fixed uniform reducers to determine the default parallelism of the system.

The method is simple and direct, but cannot adapt to the requirements of different data distributions and different calculation types of different calculation tasks, and further influences the performance and stability of the system. Meanwhile, when manual intervention is used for independently setting fixed parallelism for a specific computing task, the method has poor usability and expansibility.

(2) And (3) parallelism control based on absolute value calculation, namely determining the number N of reducers according to the estimated total amount totalsuffesize of the shuffle data at the map end and the data amount datasizezereducer processed by a preset system single reducer, wherein N is totalsuffesize/datasizezereducer. In this way, the number of reducers is calculated by absolute value, and a typical system is Hive.

The method needs to preset data quantity parameters processed by a single reducer, cannot adapt to the requirements of different data distributions and different calculation types of different calculation tasks, and meanwhile, the problem of inaccurate estimation exists when the total quantity of the shuffle data in an absolute numerical value form is estimated.

(3) And (3) parallelism control based on protocol merging, namely after the map task is finished, partitioning and merging are carried out by utilizing data fragment information recorded by a system runtime environment, and a plurality of small tasks are merged into reduce tasks with preset sizes, so that the reduce is divided, and typical systems of the reduce are expanded by certain extensions of Spark.

In the method, the reducer is formed by combining partitions of the map, the problems of resource occupation, excessive small files in the map stage and the like are not solved, the execution plan needs to be dynamically modified, and the distributed computing engine framework is strong in invasiveness.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a device, and a medium for determining a parallelism, so as to optimize a parallelism control method in the prior art, implement self-adaptation of parallelism control, and improve performance, stability, and availability of a distributed computing engine.

In a first aspect, an embodiment of the present invention provides a method for determining a parallelism, including:

acquiring an execution plan tree of a distributed computing task, wherein a root node of the execution plan tree corresponds to an output operation in the execution plan, and at least one leaf node of the execution plan tree corresponds to a table scanning operation in the execution plan;

respectively determining operation parallelism influence factors corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task;

determining the initial parallelism of table scanning operation according to a preset cost model and the data statistical information of the distributed computing task;

and according to the initial parallelism of the table scanning operation, according to a subsequent traversal sequence and according to the parallelism influence factor of the operation corresponding to each node in the execution plan tree, respectively calculating the parallelism of the operation corresponding to each node in the execution plan tree.

In a second aspect, an embodiment of the present invention further provides a device for determining parallelism, including:

the execution plan tree obtaining module is used for obtaining an execution plan tree of a distributed computing task, wherein a root node of the execution plan tree corresponds to an output operation in the execution plan, and at least one leaf node of the execution plan tree corresponds to a table scanning operation in the execution plan;

the parallelism factor determining module is used for respectively determining the parallelism factor of the operation corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task;

the initial parallelism determining module of the table scanning operation is used for determining the initial parallelism of the table scanning operation according to a preset cost model and the data statistical information of the distributed computing task;

and the parallelism determining module is used for respectively calculating the parallelism of the operation corresponding to each node in the execution plan tree according to the initial parallelism of the table scanning operation, the subsequent traversal sequence and the parallelism influence factor of the operation corresponding to each node in the execution plan tree.

In a third aspect, an embodiment of the present invention further provides an apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the method for determining the parallelism according to any embodiment of the present invention.

In a fourth aspect, the embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for determining the parallelism according to any embodiment of the present invention.

In a method, an apparatus, a device, and a medium for determining parallelism according to embodiments of the present invention, firstly, obtaining an execution plan tree of a distributed computing task, respectively determining parallelism influence factors for indicating the influence on the parallelism of subsequent operations of the operations corresponding to each node in the execution plan tree according to a preset cost model and data statistical information of the distributed computing task, then determining the initial parallelism of executing the table scanning operation in the plan tree according to the preset cost model and the data statistical information of the distributed computing task, and respectively determining the parallelism of the operation corresponding to each node in the execution plan tree according to the initial parallelism of the table-scanning operation and the parallelism influence factor of the operation corresponding to each node in the execution plan tree, thereby obtaining a parallelism control scheme of the distributed computing task. The method for determining the parallelism avoids the defects of a fixed parallelism and a control scheme of combining the parallelism based on an absolute value or a protocol, improves the performance, the stability and the usability of the distributed computing engine, and realizes the self-adaption of the parallelism control.

Drawings

Fig. 1 is a flowchart of a method for determining parallelism according to a first embodiment of the present invention;

FIG. 2 is an exemplary diagram of an execution plan tree in accordance with one embodiment of the present invention;

FIG. 3 is a flowchart of a method for determining parallelism according to a second embodiment of the present invention;

FIG. 4 is an exemplary diagram of an execution plan tree for generating a parallelism influencing factor according to a second embodiment of the present invention;

FIG. 5 is an exemplary diagram of generating an execution plan tree for parallelism in a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of a parallelism determination apparatus according to a third embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of an apparatus according to a fourth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a parallelism determination method provided in an embodiment of the present invention, which is applicable to a situation where parallelism control is performed in a distributed computing task, and the method may be performed by a parallelism determination apparatus provided in an embodiment of the present invention, and the parallelism determination apparatus may be implemented in software and/or hardware, and may generally be integrated in a processor. As shown in fig. 1, the method of this embodiment specifically includes:

s110, obtaining an execution plan tree of the distributed computing task, wherein a root node of the execution plan tree corresponds to output operation in the execution plan, and at least one leaf node of the execution plan tree corresponds to table scanning operation in the execution plan.

The execution plan of the distributed computing task is usually generated by an SQL compiler, and is an abstract description of each phase of the distributed computing task, taking the following SQL statement as an example:

select i_manager_id,i_size,count(sr_ticket_number)

from store_sales ss join item i on ss.ss_item_sk＝i.i_item_sk

join store_returns sr on ss.ss_customer_sk＝sr.sr_customer_sk

where ss_store_sk>10and i_item_id＝‘A’

group by i_manager_id,i_size；

the execution plan tree corresponding to the SQL statement is shown in fig. 2, and represents a process of performing broadcast connection after filtering and projecting the "store _ sales table" and the "item table" respectively, then performing connection with the "store _ return table", and finally performing aggregation. Where there is no shuffle in the broadcast join (mj1) operation, the join (cj1) operation and the aggregate (g1) operation require a shuffle.

That is, the leaf nodes of the execution plan tree correspond to the sweep operations for the "store _ sales table", "item table", and "store _ return table".

And S120, respectively determining the operation parallelism influence factors corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task.

The parallelism-affecting factor refers to a scale factor by which the current operation affects the parallelism of subsequent operations, such as the execution plan tree shown in FIG. 2, and the parallelism-affecting factor by which the filtering (f1) operation affects the parallelism of subsequent operations.

In this step, the parallelism influence factor of each operation in the execution plan tree is determined respectively, wherein the determination method of the parallelism influence factor is determined according to the specific operation type.

Specifically, a parallelism decision model for generating the parallelism influencing factor may be established in advance, and the parallelism decision model executes the step of determining the parallelism influencing factor of each operation according to the preset cost model and the data statistical information of the distributed computing task in the parallelism determination method provided by the embodiment of the present invention.

And S130, determining the initial parallelism of the table scanning operation according to the preset cost model and the data statistical information of the distributed computing task.

The determination of the initial parallelism of the sweep operation is the basis for the parallelism control of the entire distributed computing task, since the parallelism control of subsequent computing tasks is determined on the basis of the initial parallelism of the sweep operation.

Most of the existing parallelism schemes determine the parallelism of the table scanning operation, namely the initial task according to the number of the file fragments, which may cause the problems of excessive initial tasks, occupation of parallel computing resources, excessive small files and the like.

In the embodiment, the initial parallelism of the table scanning operation is determined by adopting a preset cost model and data statistical information of a distributed computing task.

As an optional implementation manner of this embodiment, the initial parallelism of the table scanning operation may be determined according to a preset cost model and data statistical information of the distributed computing task, and specifically: and calculating the initial parallelism of the table scanning operation according to a preset cost model and the data processing amount of the mapping end in the distributed computing task.

Namely, the initial parallelism of the table scanning operation is estimated according to the preset cost model and the data processing amount of the mapping end in the distributed computing task.

Assuming that the output of the cost model is cost and the data volume processed by the single MAP is datasizezerpap, the initial parallelism IP of the table-sweeping operation can be calculated by the following formula: cost/datasizepramp. The initial parallelism of the table scanning operation determined by the method is not a fixed value, but is different according to different data volumes processed by the cost model and the single MAP, and the effect of adaptively determining the initial parallelism of the table scanning operation can be achieved.

Compared with a parallelism control scheme based on protocol merging, the technical scheme performs initial parallelism control based on cost estimation at the MAP end, overcomes the defect that the existing parallelism scheme lacks MAP end control, avoids the defects of system resource occupation, excessive small files and the like, and does not have invasive change on a distributed computing engine.

And S140, respectively calculating the parallelism of the operation corresponding to each node in the execution plan tree according to the initial parallelism of the table scanning operation, the subsequent traversal sequence and the parallelism influence factor of the operation corresponding to each node in the execution plan tree.

And performing subsequent traversal on the execution plan tree, and determining the parallelism of the operation on each node. Taking the execution plan tree shown in fig. 2 as an example, the parallelism of each operation is determined in turn according to the arrow direction.

Specifically, a parallelism decision model for determining parallelism may be established in advance, and the parallelism decision model executes the step of calculating the parallelism of the operation corresponding to each node in the execution plan tree according to the initial parallelism of the table-sweeping operation, the subsequent traversal order, and the parallelism influencing factor of the operation corresponding to each node in the execution plan tree in the parallelism decision model provided in the embodiment of the present invention.

Wherein, the parallelism of the table-scanning operation is related to the initial parallelism and the parallelism influence factor; besides the table-sweeping operation, the parallelism of other operations is related to the influence factor of the parallelism of the operations on the child nodes of the node where the operations are located, or is only related to the parallelism of the operations on the child nodes of the node where the operations are located.

It should be noted that the parallelism decision model for generating the parallelism influencing factors and the parallelism decision model for determining the parallelism may be the same parallelism decision model, that is, the parallelism influencing factors of each operation are determined according to the preset cost model and the data statistical information of the distributed computing task, and then are used as the scale factors for measuring the parallelism influencing the subsequent operation, and then the parallelism of each operation is calculated and determined according to the parallelism influencing factors of the operation corresponding to each node in the execution plan tree according to the initial parallelism of the table-sweeping operation and the subsequent traversal sequence.

The method for determining the parallelism provided by the embodiment of the invention comprises the steps of firstly obtaining an execution plan tree of a distributed computing task, then respectively determining a parallelism influence factor which is used for indicating the operation corresponding to each node in the execution plan tree and influencing the parallelism of the subsequent operation according to a preset cost model and data statistical information of the distributed computing task, then determining the initial parallelism of table scanning operation in the execution plan tree according to the preset cost model and the data statistical information of the distributed computing task, and further respectively determining the parallelism of the operation corresponding to each node in the execution plan tree according to the initial parallelism of the table scanning operation and the parallelism influence factor of the operation corresponding to each node in the execution plan tree, thereby obtaining a parallelism control scheme of the distributed computing task. The method for determining the parallelism avoids the defects of a fixed parallelism and a control scheme of combining the parallelism based on an absolute value or a protocol, improves the performance, the stability and the usability of the distributed computing engine, and realizes the self-adaption of the parallelism control.

Example two

Fig. 3 is a flowchart of a parallelism determining method according to a second embodiment of the present invention, which is embodied on the basis of the second embodiment, wherein,

respectively determining operation parallelism influence factors corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computation tasks, specifically: respectively determining initial parallelism influence factors of the operation corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task; and fitting each initial parallelism influence factor according to a preset mapping rule to obtain the parallelism influence factor of the operation corresponding to each node in the execution plan tree.

Respectively calculating the parallelism of the operation corresponding to each node in the execution plan tree according to the initial parallelism of the table scanning operation, the subsequent traversal sequence and the parallelism influence factor of the operation corresponding to each node in the execution plan tree, specifically: according to a subsequent traversal sequence, sequentially acquiring an operation corresponding to a node as a current operation; if the current operation is the table scanning operation, calculating the parallelism of the current operation according to the initial parallelism of the table scanning operation and the parallelism influence factor of the current operation; if the current operation is non-table-scanning operation, determining a child node of the node corresponding to the current operation, and calculating the parallelism of the current operation according to the parallelism of the operation corresponding to the child node or the parallelism of the operation corresponding to the child node and the parallelism influence factor of the current operation; and returning to execute and sequentially acquiring the operation corresponding to one node as the current operation according to the subsequent traversal sequence until the processing of all the operations in the execution plan tree is completed.

Further, after calculating the parallelism of the operations corresponding to the nodes in the execution plan tree, the method for determining the parallelism further includes:

and adjusting the parallelism of the operation corresponding to each node in the execution plan tree according to the preset parameters of the system to obtain the final parallelism of the operation corresponding to each node in the execution plan tree.

As shown in fig. 3, the method of this embodiment specifically includes:

s310, obtaining an execution plan tree of the distributed computing task, wherein a root node of the execution plan tree corresponds to output operation in the execution plan, and at least one leaf node of the execution plan tree corresponds to table scanning operation in the execution plan.

And S320, respectively determining initial parallelism influence factors of the operation corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task.

Taking the execution plan tree shown in fig. 2 as an example, the initial parallelism influence factors of the operations corresponding to the nodes in the execution plan tree are determined, that is, the initial parallelism influence factors of the operations on the nodes in the execution plan tree are determined, for example, the initial parallelism influence factors of the operations such as the sweep table (ts1: ss) operation, the sweep table (ts2: i) operation, and the sweep table (ts3: sr) operation.

As an optional implementation manner of this embodiment, determining an initial parallelism influence factor of an operation corresponding to each node in an execution plan tree according to a preset cost model and data statistical information of the distributed computation task includes at least one of the following:

if the operation belongs to the first operation type, calculating an initial parallelism influence factor of the operation according to the cost model and the data statistical information of the distributed calculation task;

if the operation belongs to a second operation type, taking a reference influence factor determined according to the cost model and the data statistical information of the distributed computing task as an initial parallelism influence factor of the operation;

and if the operation belongs to the third operation type, determining a child node of the node corresponding to the operation, and taking the initial parallelism influence factor of the operation corresponding to the child node as the initial parallelism influence factor of the operation.

And aiming at the first operation type, calculating an initial parallelism influence factor of the operation according to the cost model and the data statistical information of the distributed calculation task. Wherein the first operation type comprises at least one of: a filtration operation and a prepolymerization operation.

First, basic statistical information of data related to distributed computing tasks needs to be collected in advance, and typical basic statistical information includes table types, partition bucket information, table sizes, table number, column maximum and minimum values, column non-repetitive value Number (NDV), and the like.

The initial parallelism influence factor of the filtering (f1) operation is represented by the IPIF (f1), the IPIF (f1) can be calculated according to the cost model and the data statistical information of the distributed computing task, the value of the IPIF (f1) is determined as the selection rate (Selectivity) of the filtering (f1) operation, the value of the selection rate (Selectivity) can be estimated by the pre-collected statistical information and the preset cost model, the selection (c1) is 1/NDV (c1) by taking equivalent filtering on a single column c1 as an example, and the NDV (c1) is the number of non-repeated values on a single column c 1.

The initial parallelism impact factor of the pre-polymerization (pg1) operation may be determined from the aggregation rate of the aggregated fields, which is specifically estimated from the statistical information and the cost model.

For the second operation type, a reference impact factor determined according to the cost model and the data statistics information of the distributed computing task may be used as an initial parallelism impact factor of the operation. Wherein the second operation type comprises at least one of: table-sweeping operations, join operations, aggregate operations, join operations, intersect operations, and difference operations.

That is, the initial parallelism influence factors, such as the sweep Table (ts1: ss) operation, the sweep Table (ts2: i) operation, the sweep Table (ts3: sr) operation, the join (cj1) operation, and the aggregate (g1) operation in FIG. 2, are determined as the reference influence factors. For example, the reference impact factor determined according to the cost model and the data statistics of the distributed computing task is 1.0, and the initial parallelism impact factor of the operations is 1.0.

For a third operation type, determining a child node of a node corresponding to an operation, and using an initial parallelism influence factor of the operation corresponding to the child node as the initial parallelism influence factor of the operation, wherein the third operation type comprises at least one of the following: broadcast join operations, projection operations, sort operations, data redistribution operations, and default operations.

Taking the projection (pj4) operation as an example, if the child node of the node corresponding to the projection (pj4) is the node where the connection (cj1) operation is located, then IPIF (pj4) is IPIF (cj 1); taking the projection (pj 3: sr) operation as an example, the child node of the node corresponding to the projection is the node where the sweep table (ts3: sr) operation is located, and then IPIF (pj 3: sr) ═ IPIF (ts3: sr) ═ reference impact factor.

Taking the broadcast connection (mj1) operation as an example, if the broadcast table child node of the node corresponding to the broadcast table child node is the node where the projection (pj2) operation is located, and the fact table child node of the node corresponding to the broadcast table child node is the node where the projection (pj1) operation is located, then the initial parallelism influence factor of the projection (pj2) operation on the broadcast table child node is taken as the initial parallelism influence factor of the broadcast connection (mj1) operation, that is, IPIF (mj1) ═ IPIF (broadcast table) ═ IPIF (pj 2).

TABLE 1 example of a parallelism decision model to generate IPIF

Specifically, a parallelism decision model for generating initial parallelism influencing factors shown in table 1 may be established in advance, and the initial parallelism influencing factors for executing each operation in the plan tree may be determined according to the parallelism decision model.

In the technical scheme, according to the pre-collected statistical information and the preset cost model, the initial parallelism influence factor of each operation is calculated according to the parallelism decision model, so as to represent the scale factor of the parallelism of the operation of the subsequent node influenced by the operation of the current node. Because the parallelism influence factor is different from an absolute value, the problems that parameters need to be preset and estimation is inaccurate when parallelism control is carried out based on an absolute value are solved.

And S330, fitting each initial parallelism influence factor according to a preset mapping rule to obtain the parallelism influence factor of the operation corresponding to each node in the execution plan tree.

After the initial parallelism influence factors are obtained, the initial parallelism influence factors are fitted to generate the parallelism influence factors PIF, so that the number proportion fall between mapper and reducer can be smoothed, and the parallelism is prevented from being excessively amplified or reduced.

For example, the initial parallelism influencing factor IPIF (f2) is the Selectivity of the filtering (f2), and if the filtering (f2) is the equal value filtering and the filtering column is c1, the IPIF (f2) is 1/NDV (c 1). In the case of large NDV (c1), if no fitting is made, the number of reducers will be reduced to 1/NDV of the number of maps (c1), resulting in too few reducers.

As an optional implementation manner of this embodiment, the initial parallelism influence factors may be fitted according to a preset mapping rule to obtain the parallelism influence factors of the operation corresponding to each node in the execution plan tree, which specifically is: and adjusting each initial parallelism influence factor to a preset interval range according to a preset mapping rule, and taking each adjusted result as the parallelism influence factor for executing the operation corresponding to each node in the plan tree.

That is, the fitting process may be implemented by mapping rules, for example, minIPIF and maxiplif are set to define the minimum value and the maximum value of IPIF, the preset interval range is [ minIPIF, maxiplif ], and then the corresponding mapping rules may be as shown in table 2. The parallelism factor of each operation in the execution plan tree shown in fig. 2 is shown in fig. 4, and taking "filter (f2) PIF (0.1)" in fig. 4 as an example, it means that the parallelism factor PIF of the filter (f2) operation is 0.1.

TABLE 2 mapping rules for initial parallelism-influencing factor fitting

Mapping rules	IPIF	PIF
			R1	IPIF<＝minIPIF	minIPIF
R2	minIPIF<IPIF<＝maxIPIF	IPIF
			R3	IPIF>maxIPIF	maxIPIF

The method for fitting the initial parallelism influence factor is only an optional implementation manner in this embodiment, and other methods may also be used to fit the initial parallelism influence factor, which is not specifically limited in this embodiment.

And S340, determining the initial parallelism of the table scanning operation according to the preset cost model and the data statistical information of the distributed computing task.

Specifically, the initial parallelism of the table-sweeping operation can be estimated according to a preset cost model and the data processing amount of the mapping end in the distributed computing task.

Assuming that the output of the cost model is cost and the data volume processed by the single MAP is datasizezerpap, the initial parallelism IP of the table-sweeping operation can be calculated by the following formula: cost/datasizepramp.

And S350, sequentially acquiring an operation corresponding to a node as a current operation according to a subsequent traversal sequence, judging whether the current operation is a table scanning operation, if so, executing S360, and if not, executing S370.

After the determination of the initial parallelism of the table-sweeping operation, the execution plan tree is traversed in a subsequent order, and the parallelism of the operation on each node is determined, wherein the parallelism of the table-sweeping operation is directly related to the initial parallelism of the table-sweeping operation, and the parallelism of the non-table-sweeping operation is not directly related to the initial parallelism of the table-sweeping operation, so that the parallelism of the table-sweeping operation is determined by distinguishing whether the table-sweeping operation is carried out or not.

And S360, calculating the parallelism of the current operation according to the initial parallelism of the table scanning operation and the parallelism influence factor of the current operation, and executing S380.

Specifically, the parallelism of the table-sweeping operation is a product result of the initial parallelism of the table-sweeping operation and the parallelism influence factor of the table-sweeping operation.

S370, determining a child node of the node corresponding to the current operation, calculating the parallelism of the current operation according to the parallelism of the operation corresponding to the child node, or according to the parallelism of the operation corresponding to the child node and the influence factor of the parallelism of the current operation, and executing S380.

As an optional implementation manner of this embodiment, the calculating the parallelism of the current operation according to the parallelism of the operation corresponding to the child node, or according to the parallelism of the operation corresponding to the child node and the parallelism influence factor of the current operation specifically includes:

if the current operation belongs to the fourth operation type, taking the product result of the parallelism of the operation corresponding to the child node and the parallelism influence factor of the current operation as the parallelism of the current operation;

if the current operation belongs to the fifth operation type, taking the parallelism of the operation corresponding to the child node as the parallelism of the current operation;

and if the current operation belongs to the sixth operation type, taking the accumulated sum result of the parallelism of all the operations corresponding to all the child nodes or the maximum parallelism value in the parallelism of all the operations corresponding to all the child nodes as the parallelism of the current operation.

And regarding the fourth operation type, taking the product result of the parallelism of the operation corresponding to the child node and the parallelism influence factor of the current operation as the parallelism of the current operation. Wherein the fourth operation type comprises at least one of: a filtering operation, a broadcast connection operation, and a pre-polymerization operation.

Referring to fig. 2, for example, if the current operation is a filtering (f1) operation, the operation on the child node of the node where the filtering (f1) operation is a scan table (ts1: ss) operation, and the result of the product of the parallelism of the scan table (ts1: ss) operation and the parallelism influence factor of the filtering (f1) operation is taken as the parallelism of the filtering (f1) operation. The parallelism of the filtering (f1) operation is denoted by P (f1), and P (f1) × P (ts1: ss) × PIF (f 1).

For example, if the current operation is the broadcast connection (mj1) operation, the broadcast table child node of the node corresponding to the current operation is the node where the projection (pj2) operation is located, and the fact table child node of the node corresponding to the current operation is the node where the projection (pj1) operation is located, the product of the parallelism of the projection (pj1) operation on the fact table child node and the parallelism influence factor of the broadcast connection (mj1) operation is used as the parallelism of the broadcast connection (mj1), and P (mj1) ═ P (pj1) × PIF (mj 1).

And regarding the fifth operation type, taking the parallelism of the operation corresponding to the child node as the parallelism of the current operation. Wherein the fifth operation type comprises at least one of: aggregation operations, projection operations, sorting operations, and data redistribution operations.

For example, the current operation is an aggregation (g1) operation, then the operation on the child node of the node where the aggregation (g1) operation is located is a data redistribution (rs3) operation, then the parallelism of the aggregation (g1) operation is that of the data redistribution (rs3) operation, and P (g1) ═ P (rs 3).

And regarding the sixth operation type, taking the accumulated sum result of the parallelism of all the operations corresponding to all the child nodes or the maximum parallelism value in the parallelism of all the operations corresponding to all the child nodes as the parallelism of the current operation. Wherein the sixth operation type comprises at least one of: a join operation, an intersect operation, a difference operation, and a default operation.

For example, the current operation is a join (cj1) operation, and the operations on all the child nodes of the node where the current operation is located are a data redistribution (rs1) operation and a data redistribution (rs2) operation, then the maximum value of the parallelism of the data redistribution (rs1) operation and the parallelism of the data redistribution (rs2) operation is taken as the parallelism of the join (cj1) operation, and P (cj1) ═ MAX (P (rs1), P (rs 2)).

Specifically, a parallelism decision model for generating the parallelism shown in table 3 may be constructed in advance, and the parallelism of each operation in the execution plan tree may be determined according to the parallelism decision model. According to the parallelism decision model, as shown in fig. 5, the calculation result of the parallelism of each operation in the execution plan tree shown in fig. 3 takes "filter (f2) PIF (0.1) P (1)" in fig. 5 as an example, which indicates that the parallelism influence factor PIF of the filter (f2) operation is 0.1 and the parallelism P is 1.

Table 3 example of parallelism decision model to generate parallelism

And S380, judging whether the processing of all the operations in the execution plan tree is finished or not, if not, returning to execute S350, and if so, executing S390.

And determining the parallelism of each operation in the execution plan tree in sequence according to the method until all operations are finished.

And S390, adjusting the parallelism of the operation corresponding to each node in the execution plan tree according to the preset parameters of the system to obtain the final parallelism of the operation corresponding to each node in the execution plan tree.

Further, after the parallelism is determined, the final parallelism of each operation on the execution plan tree can be determined by combining the system environment and preset parameters. These pre-set parameters may include the system fixed parallelism in a particular situation, or the system minimum parallelism (SPMIN), etc. For example, if the system presets a minimum parallelism SPMIN, the final parallelism should be MAX (P, SPMIN).

In summary, after the parallelism decision model is constructed according to the parallelism determining method provided by the embodiment of the present invention, the execution plan carrying the optimized parallelism can be output by using the data statistical information of the pre-collected distributed computing task, the execution plan of the distributed computing task and the preset cost model as input information.

In the technical scheme, based on data statistical information such as a selection rate, an aggregation rate and a preset cost model, a parallelism influence factor PIF of each type of task is calculated and used as a scale factor for measuring the parallelism influencing subsequent operation, and then the final parallelism P of each task is calculated and determined according to the PIF information. The parallelism determined according to the embodiment of the invention is controlled, the defects of fixed parallelism and a parallelism control scheme based on absolute value or protocol combination are avoided, and the performance, stability and usability of the distributed computing engine are improved.

EXAMPLE III

Fig. 6 is a schematic structural diagram of a parallelism determining apparatus according to a third embodiment of the present invention, which is applicable to parallelism control in a distributed computing task, and the apparatus may be implemented in software and/or hardware, and may be generally integrated in a processor.

As shown in fig. 6, the parallelism determining apparatus specifically includes: a plan tree acquisition module 610, a parallelism impact factor determination module 620, an initial parallelism determination module 630 for a sweep operation, and a parallelism determination module 640 are executed. Wherein,

an execution plan tree generation module 610, configured to obtain an execution plan tree of a distributed computing task, where a root node of the execution plan tree corresponds to an output operation in the execution plan, and at least one leaf node of the execution plan tree corresponds to a table sweeping operation in the execution plan;

a parallelism influence factor determining module 620, configured to determine parallelism influence factors of operations corresponding to the nodes in the execution plan tree according to a preset cost model and the data statistics information of the distributed computation task;

an initial parallelism determining module 630 of the table scanning operation, configured to determine an initial parallelism of the table scanning operation according to a preset cost model and data statistics information of the distributed computing task;

and the parallelism determining module 640 is configured to calculate, according to the initial parallelism of the table-scanning operation, the parallelism of the operation corresponding to each node in the execution plan tree according to the parallelism influence factor of the operation corresponding to each node in the execution plan tree in the subsequent traversal order.

The device for determining parallelism according to this embodiment first obtains an execution plan tree of a distributed computing task, then determines a parallelism influence factor, which is used for indicating an operation corresponding to each node in the execution plan tree and influencing subsequent operation parallelism, according to a preset cost model and data statistical information of the distributed computing task, and then determines an initial parallelism of a table-sweeping operation in the execution plan tree according to the preset cost model and the data statistical information of the distributed computing task, and further determines a parallelism of an operation corresponding to each node in the execution plan tree according to the initial parallelism of the table-sweeping operation and the parallelism influence factor of the operation corresponding to each node in the execution plan tree, thereby obtaining a parallelism control scheme of the distributed computing task. The method for determining the parallelism avoids the defects of a fixed parallelism and a control scheme of combining the parallelism based on an absolute value or a protocol, improves the performance, the stability and the usability of the distributed computing engine, and realizes the self-adaption of the parallelism control.

Further, the parallelism influence factor determining module 620 specifically includes: an initial parallelism influence factor determining unit and an initial parallelism influence factor fitting unit, wherein,

an initial parallelism factor determining unit, configured to determine initial parallelism factors of operations corresponding to the nodes in the execution plan tree according to a preset cost model and the data statistics information of the distributed computation task;

and the initial parallelism influence factor fitting unit is used for fitting each initial parallelism influence factor according to a preset mapping rule to obtain the parallelism influence factor of the operation corresponding to each node in the execution plan tree.

Further, the initial parallelism factor-determining unit includes at least one of:

if the operation belongs to a first operation type, calculating an initial parallelism influence factor of the operation according to the cost model and the data statistical information of the distributed calculation task;

and if the operation belongs to a third operation type, determining a child node of the node corresponding to the operation, and taking the initial parallelism influence factor of the operation corresponding to the child node as the initial parallelism influence factor of the operation.

Specifically, the first operation type includes at least one of the following: a filtering operation and a pre-polymerization operation;

the second operation type includes at least one of: table scanning operation, connection operation, aggregation operation, union operation, intersection operation and difference set operation;

the third operation type includes at least one of: broadcast join operations, projection operations, sort operations, data redistribution operations, and default operations.

Further, the initial parallelism influence factor fitting unit is specifically configured to adjust each initial parallelism influence factor to a preset interval range according to a preset mapping rule, and use each adjusted result as the parallelism influence factor of the operation corresponding to each node in the execution plan tree.

Further, the parallelism determining module 640 specifically includes:

the acquisition unit is used for sequentially acquiring the operation corresponding to one node as the current operation according to the subsequent traversal sequence;

a first calculating unit, configured to calculate, if the current operation is a table-scanning operation, a parallelism of the current operation according to an initial parallelism of the table-scanning operation and a parallelism influence factor of the current operation;

a second calculating unit, configured to determine a child node of the node corresponding to the current operation if the current operation is a non-table-scanning operation, and calculate a parallelism of the current operation according to a parallelism of the operation corresponding to the child node, or according to the parallelism of the operation corresponding to the child node and a parallelism influence factor of the current operation;

and the circulating unit is used for returning to execute the operation corresponding to one node in sequence according to the subsequent traversal sequence, and taking the operation as the current operation until the processing of all the operations in the execution plan tree is completed.

Further, the second calculating unit specifically includes: a second compute first subunit, a second compute second subunit, and a second compute third subunit, wherein,

a second calculating first subunit, configured to, if the current operation belongs to a fourth operation type, take a result of a product of a parallelism of an operation corresponding to the child node and a parallelism influence factor of the current operation as a parallelism of the current operation;

a second calculating second subunit, configured to, if the current operation belongs to a fifth operation type, take a parallelism of operations corresponding to the child nodes as a parallelism of the current operation;

and a second calculation third subunit, configured to, if the current operation belongs to a sixth operation type, take, as the parallelism of the current operation, an accumulated sum result of the parallelism of all the operations corresponding to all the child nodes, or a maximum parallelism value among the parallelism of all the operations corresponding to all the child nodes.

Further, the fourth operation type includes at least one of: a filtering operation, a broadcast connection operation and a pre-polymerization operation;

the fifth operation type includes at least one of: aggregation operation, projection operation, sorting operation and data redistribution operation;

the sixth operation type includes at least one of: a join operation, an intersect operation, a difference operation, and a default operation.

Further, the initial parallelism determining module 630 of the table scanning operation is specifically configured to calculate the initial parallelism of the table scanning operation according to a preset cost model and the data processing amount of the mapping end in the distributed computing task.

Further, the apparatus for determining parallelism may further include: and the parallelism adjusting module is used for adjusting the parallelism of the operation corresponding to each node in the execution plan tree according to a system preset parameter after respectively calculating the parallelism of the operation corresponding to each node in the execution plan tree, so as to obtain the final parallelism of the operation corresponding to each node in the execution plan tree.

The parallelism determining device can execute the parallelism determining method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the parallelism determining method.

Example four

Fig. 7 is a schematic diagram of a hardware structure of an apparatus according to a fourth embodiment of the present invention, and as shown in fig. 7, the apparatus includes:

one or more processors 710, one processor 710 being illustrated in FIG. 7;

a memory 720;

the apparatus may further include: an input device 730 and an output device 740.

The processor 710, the memory 720, the input device 730 and the output device 740 of the apparatus may be connected by a bus or other means, for example, in fig. 7.

The memory 720, which is a non-transitory computer-readable storage medium, may be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a parallelism determination method in the embodiment of the present invention (for example, the execution plan tree acquisition module 610, the parallelism influence factor determination module 620, the initial parallelism determination module 630 for the sweep operation, and the parallelism determination module 640 shown in fig. 6). The processor 710 executes various functional applications and data processing of the computer device by executing software programs, instructions and modules stored in the memory 720, namely, implements a parallelism determination method of the above-described method embodiments.

The memory 720 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 720 may optionally include memory located remotely from processor 710, which may be connected to the terminal device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus. The output device 740 may include a display device such as a display screen.

EXAMPLE five

An embodiment of the present invention further provides a storage medium containing computer-executable instructions, which when executed by a computer processor, perform a method for determining parallelism, the method including:

Optionally, the computer-executable instructions, when executed by a computer processor, may be further configured to implement a technical solution of a method for determining parallelism according to any embodiment of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the apparatus for determining parallelism, the units and modules included in the apparatus are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method for determining parallelism, comprising:

respectively determining operation parallelism influence factors corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task; the determining, according to a preset cost model and the data statistics information of the distributed computation task, parallelism influence factors of operations corresponding to the nodes in the execution plan tree respectively includes:

respectively determining initial parallelism influence factors of the operation corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task;

fitting each initial parallelism influence factor according to a preset mapping rule to obtain the parallelism influence factor of the operation corresponding to each node in the execution plan tree;

determining the initial parallelism of table scanning operation according to a preset cost model and the data statistical information of the distributed computing task; the determining the initial parallelism of the table scanning operation according to the preset cost model and the data statistical information of the distributed computing task specifically comprises the following steps:

calculating the initial parallelism of the table scanning operation according to a preset cost model and the data processing amount of a mapping end in the distributed computing task;

2. The method of claim 1, wherein determining an initial parallelism impact factor of the operation corresponding to each node in the execution plan tree according to a preset cost model and the data statistics of the distributed computation task comprises at least one of:

3. The method of claim 2, wherein the first type of operation comprises at least one of: a filtering operation and a pre-polymerization operation;

4. The method according to claim 1, wherein the fitting each initial parallelism influence factor according to a preset mapping rule to obtain parallelism influence factors of operations corresponding to each node in the execution plan tree comprises:

and adjusting each initial parallelism influence factor to a preset interval range according to a preset mapping rule, and taking each adjusted result as the parallelism influence factor of the operation corresponding to each node in the execution plan tree.

5. The method of claim 1, wherein the calculating parallelism of operations corresponding to nodes in the execution plan tree based on initial parallelism of the sweep operation, in a subsequent traversal order, based on parallelism impact factors of the operations corresponding to the nodes in the execution plan tree, respectively, comprises:

according to a subsequent traversal sequence, sequentially acquiring an operation corresponding to a node as a current operation;

if the current operation is a table scanning operation, calculating the parallelism of the current operation according to the initial parallelism of the table scanning operation and the parallelism influence factor of the current operation;

if the current operation is non-table-scanning operation, determining a child node of a node corresponding to the current operation, and calculating the parallelism of the current operation according to the parallelism of the operation corresponding to the child node or the parallelism of the operation corresponding to the child node and the parallelism influence factor of the current operation;

and returning to execute and sequentially acquiring the operation corresponding to one node as the current operation according to the subsequent traversal sequence until the processing of all the operations in the execution plan tree is completed.

6. The method according to claim 5, wherein the calculating the parallelism of the current operation according to the parallelism of the operations corresponding to the child nodes, or according to the parallelism of the operations corresponding to the child nodes and the parallelism influence factor of the current operation comprises:

if the current operation belongs to a fourth operation type, taking the product result of the parallelism of the operation corresponding to the child node and the parallelism influence factor of the current operation as the parallelism of the current operation;

if the current operation belongs to a fifth operation type, taking the parallelism of the operation corresponding to the child node as the parallelism of the current operation;

and if the current operation belongs to a sixth operation type, taking the accumulated sum result of the parallelism of all the operations corresponding to all the child nodes or the maximum parallelism value in the parallelism of all the operations corresponding to all the child nodes as the parallelism of the current operation.

7. The method of claim 6, wherein the fourth operation type comprises at least one of: a filtering operation, a broadcast connection operation and a pre-polymerization operation;

8. The method according to any of claims 1-7, further comprising, after separately calculating parallelism of operations corresponding to nodes in the execution plan tree:

and adjusting the parallelism of the operation corresponding to each node in the execution plan tree according to a preset parameter of the system to obtain the final parallelism of the operation corresponding to each node in the execution plan tree.

9. A parallelism determination apparatus, comprising:

the parallelism factor determining module is used for respectively determining the parallelism factor of the operation corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task; the parallelism influence factor determining module specifically includes: an initial parallelism influence factor determining unit and an initial parallelism influence factor fitting unit, wherein,

the initial parallelism influence factor determining unit is used for respectively determining initial parallelism influence factors of the operation corresponding to each node in the execution plan tree according to a preset cost model and the data statistical information of the distributed computing task;

the initial parallelism factor fitting unit is used for fitting each initial parallelism factor according to a preset mapping rule to obtain the parallelism factor corresponding to each node in the execution plan tree;

the initial parallelism determining module of the table scanning operation is used for determining the initial parallelism of the table scanning operation according to a preset cost model and the data statistical information of the distributed computing task; the table scanning operation initial parallelism determining module is specifically used for calculating the table scanning operation initial parallelism according to a preset cost model and the data processing amount of a mapping end in the distributed computing task;

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-8 when executing the program.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.