CN109558232A - Determination method, apparatus, equipment and the medium of degree of parallelism - Google Patents
Determination method, apparatus, equipment and the medium of degree of parallelism Download PDFInfo
- Publication number
- CN109558232A CN109558232A CN201811436295.1A CN201811436295A CN109558232A CN 109558232 A CN109558232 A CN 109558232A CN 201811436295 A CN201811436295 A CN 201811436295A CN 109558232 A CN109558232 A CN 109558232A
- Authority
- CN
- China
- Prior art keywords
- parallelism
- degree
- node
- executive plan
- impact factor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Complex Calculations (AREA)
Abstract
The embodiment of the invention discloses determination method, apparatus, equipment and the media of a kind of degree of parallelism.This method comprises: obtaining the executive plan tree of distributed computing task;Determine the degree of parallelism impact factor of operation corresponding with node each in executive plan tree respectively according to the data statistics of preset Cost Model and the distributed computing task;According to preset Cost Model and the data statistics of distributed computing task, the initial degree of parallelism that table handling is swept in executive plan tree is determined;According to the initial degree of parallelism for sweeping table handling, the degree of parallelism of operation corresponding with node each in executive plan tree is calculated separately according to the degree of parallelism impact factor of operation corresponding with node each in executive plan tree according to postorder traversal sequence.The above method avoids the drawbacks of degree of parallelism control program in the prior art, improves the performance, stability and availability of distributed computing engine, realizes the adaptive of degree of parallelism control.
Description
Technical field
The present embodiments relate to field of computer technology more particularly to determination method, apparatus, the equipment of a kind of degree of parallelism
And medium.
Background technique
Common distributed computing engine framework containment mapping (map)-fast resampling (shuffle)-specification
(reduce) three phases represent the processing that data first pass through the end map, using the end shuffle by data according to certain rule
It regroups, the process calculated finally by the end reduce.Wherein, degree of parallelism determines in system how many map simultaneously
Task and how many reduce tasks are run simultaneously, be the key that influence the parallel execution efficiency of distributed system and stability because
Element.
Degree of parallelism is too big or the too small execution efficiency and stability that can all influence system: then meaning at individual task greatly very much
The data volume very little of reason exists simultaneously that resource is occupied, scheduling so that the ability of each execution unit is not fully utilized
The drawbacks such as the large and small file of expense is excessive;And the too small individual task that will lead to is overweight, system CPU/MEM/IO pressure is big, in turn
Slow performance is dragged in appearance, or even the problem of influence system stability, simultaneously because task concentrates on limited child node, can also exist
The problem of other concurrent execution units cannot efficiently use.Therefore, how to control degree of parallelism is distributed computing engine application
In major issue.
The degree of parallelism scheme of distributed computing engine focuses primarily upon the control at the end reduce at present, and lacks the end map
Control, common several ways are as follows:
(1) number of fixed degree of parallelism control, the i.e. fixed unified reducer of setting is parallel come the default for determining system
Degree.
Which is simply direct, but does not adapt to the different data distribution of different computing tasks and the need of different calculating types
It asks, and then will affect system performance and stability.Meanwhile manual intervention is that fixed degree of parallelism is separately provided in specific calculation task
When, which availability and scalability are poor.
(2) the degree of parallelism control calculated based on absolute value, i.e., according to the shuffle total amount of data at the end map of estimation
The data volume dataSizePerReducer of totalShuffleSize and predetermined system list reducer processing, determines
Reducer several N, wherein N=totalShuffleSize/dataSizePerReducer.Reducer in such mode
Number is calculated by absolute value, canonical system such as Hive.
Which needs the data volume parameter of default list reducer processing, does not adapt to the different numbers of different computing tasks
According to distribution and the different demands for calculating types, while there is estimation not in while estimating the shuffle total amount of data of absolute figure form
Quasi- problem.
(3) the degree of parallelism control merged based on specification that is, after map task, utilizes system runtime environment record
Data fragmentation information is carried out subregion merging and is divided multiple small task mergings at the reduce task of default size with this
Reducer, certain extensions of canonical system such as Spark.
Reducer merges to be formed according to the subregion of map in which, and there is no solve resource occupation and map stage small text
The problems such as part is excessive, and dynamic modification executive plan is needed, it is invasive to distributed computing engine framework strong.
Summary of the invention
The embodiment of the present invention provides determination method, apparatus, equipment and the medium of a kind of degree of parallelism, in the prior art
Degree of parallelism control method optimizes, and realizes the adaptive of degree of parallelism control, improves performance, the stability of distributed computing engine
And availability.
In a first aspect, the embodiment of the invention provides a kind of determination methods of degree of parallelism, comprising:
Obtain the executive plan tree of distributed computing task, wherein hold described in the root node of the executive plan tree is corresponding
The output operation of row in the works, at least one leaf node of the executive plan tree correspond in the executive plan and sweep table behaviour
Make;
According to the data statistics of preset Cost Model and the distributed computing task respectively determine with it is described
The degree of parallelism impact factor of the corresponding operation of each node in executive plan tree;
According to preset Cost Model and the data statistics of the distributed computing task, table handling is swept in determination
Initial degree of parallelism;
According to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to in the executive plan tree
The degree of parallelism impact factor of the corresponding operation of each node, calculates separately operation corresponding with node each in the executive plan tree
Degree of parallelism.
Second aspect, the embodiment of the invention also provides a kind of determining devices of degree of parallelism, comprising:
Executive plan tree obtains module, for obtaining the executive plan tree of distributed computing task, wherein described to execute meter
The root node for drawing tree corresponds to the operation of the output in the executive plan, at least one leaf node of the executive plan tree is corresponding
Table handling is swept in the executive plan;
Degree of parallelism impact factor determining module, for according to preset Cost Model and the distributed computing task
Data statistics determine the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree respectively;
The initial degree of parallelism determining module of table handling is swept, for according to preset Cost Model and the distributed computing
The data statistics of task determine the initial degree of parallelism for sweeping table handling;
Degree of parallelism determining module, for sweeping the initial degree of parallelism of table handling according to, according to postorder traversal sequence, according to
The degree of parallelism impact factor of operation corresponding with node each in the executive plan tree, calculate separately in the executive plan tree
The degree of parallelism of the corresponding operation of each node.
The third aspect the embodiment of the invention also provides a kind of equipment, including memory, processor and is stored in memory
Computer program that is upper and can running on a processor, the processor is realized when executing described program to be implemented as the present invention is any
The determination method of degree of parallelism provided by example.
Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer
Program realizes the determination method such as degree of parallelism provided by any embodiment of the invention when the program is executed by processor.
Determination method, apparatus, equipment and the medium of a kind of degree of parallelism provided in an embodiment of the present invention, the degree of parallelism really
Determine in method, the executive plan tree of distributed computing task is obtained first, further according to preset Cost Model and the distribution
The data statistics of formula calculating task determine that corresponding with each node operation is used to indicate influence in executive plan tree respectively
Then the degree of parallelism impact factor of subsequent operation degree of parallelism is united according to the data of preset Cost Model and distributed computing task
Count information, determine the initial degree of parallelism for sweeping table handling in executive plan tree, so according to the initial degree of parallelism for sweeping table handling and
In executive plan tree the degree of parallelism impact factor of operation corresponding with each node respectively determine executive plan tree in each node pair
The degree of parallelism for the operation answered, it follows that the degree of parallelism control program of distributed computing task.The determination method of above-mentioned degree of parallelism
The drawbacks of avoiding fixed degree of parallelism, merging degree of parallelism control program based on absolute value or specification, improves distributed computing and draws
Performance, stability and the availability held up realize the adaptive of degree of parallelism control.
Detailed description of the invention
Fig. 1 is the flow chart of the determination method of one of the embodiment of the present invention one degree of parallelism;
Fig. 2 is the exemplary diagram of one of embodiment of the present invention one executive plan tree;
Fig. 3 is the flow chart of the determination method of one of the embodiment of the present invention two degree of parallelism;
Fig. 4 is the exemplary diagram for the executive plan tree that one of embodiment of the present invention two generates degree of parallelism impact factor;
Fig. 5 is the exemplary diagram for the executive plan tree that one of embodiment of the present invention two generates degree of parallelism;
Fig. 6 is the structural schematic diagram of the determining device of one of the embodiment of the present invention three degree of parallelism;
Fig. 7 is the hardware structural diagram of one of the embodiment of the present invention four equipment.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail
The processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart,
Many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be pacified again
Row.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing.Institute
Stating processing can correspond to method, function, regulation, subroutine, subprogram etc..
Embodiment one
Fig. 1 is a kind of flow chart of the determination method for degree of parallelism that the embodiment of the present invention one provides, and is applicable to distribution
The case where degree of parallelism control is carried out in calculating task, this method can be by the determining device of degree of parallelism provided in an embodiment of the present invention
It executes, the mode which can be used software and/or hardware is realized, and generally can be integrated in the processor.As shown in Figure 1,
The method of the present embodiment specifically includes:
S110, the executive plan tree for obtaining distributed computing task, wherein the root node of the executive plan tree corresponds to institute
The output operation in executive plan is stated, at least one leaf node of the executive plan tree corresponds to sweeping in the executive plan
Table handling.
The executive plan of distributed computing task is usually generated by SQL compiler, is to each rank of distributed computing task
The abstractdesription of section, by taking following SQL statement as an example:
select i_manager_id,i_size,count(sr_ticket_number)
From store_sales ss join item i on ss.ss_item_sk=i.i_item_sk
Join store_returns sr on ss.ss_customer_sk=sr.sr_customer_sk
Where ss_store_sk > 10and i_item_id=' A '
group by i_manager_id,i_size;
Executive plan tree corresponding with above-mentioned SQL statement is as shown in Fig. 2, indicate " store_sales table " and " item table "
Broadcast connection is carried out after filtering projection respectively, is then attached with " store_returns table ", the process finally polymerizeing again.
Wherein shuffle is not present in broadcast connection (mj1) operation, and connection (cj1) operation and polymerization (g1) operation need shuffle.
That is, the leaf node of executive plan tree is corresponded to " store_sales table ", " item table " and " store_
Returns table " sweeps table handling.
S120, according to the data statistics of preset Cost Model and the distributed computing task respectively determine with
The degree of parallelism impact factor of the corresponding operation of each node in executive plan tree.
Degree of parallelism impact factor refers to that current operation influences the scale factor of subsequent operation degree of parallelism, such as shown in Fig. 2
Executive plan tree, the degree of parallelism impact factor of filtering (f1) operation refer to influencing the ratio of its subsequent operations degree of parallelism because
Son.
In this step, the degree of parallelism impact factor respectively operated in executive plan tree is determined respectively, wherein degree of parallelism influences
The determination method of the factor is depending on specific action type.
Specifically, can be the degree of parallelism decision model for pre-establishing a generation degree of parallelism impact factor, the degree of parallelism
Decision model executes degree of parallelism provided in an embodiment of the present invention and determines in method according to preset Cost Model and the distribution
The data statistics of formula calculating task determine the step of degree of parallelism impact factor of each operation respectively.
S130, according to the data statistics of preset Cost Model and distributed computing task, table handling is swept in determination
Initial degree of parallelism.
Sweep the initial degree of parallelism of table handling determination be entire distributed computing task degree of parallelism control basis because
The degree of parallelism of subsequent computational task, which controls, to be determined based on the initial degree of parallelism for sweeping table handling.
Most of existing degree of parallelism scheme determine to sweep according to file fragmentation number table handling i.e. initiating task and
Row degree may result in the problems such as initiating task is excessive, occupancy concurrent computation resource and small documents are excessive in this way.
In the present embodiment, be using the data statistics according to preset Cost Model and distributed computing task come
Determine the initial degree of parallelism for sweeping table handling.
It, can will be according to preset Cost Model and distributed computing as a kind of optional embodiment of the present embodiment
The data statistics of task determine the initial degree of parallelism for sweeping table handling, specifically: according to preset Cost Model and distribution
The data processing amount that end is mapped in formula calculating task, calculates the initial degree of parallelism for sweeping table handling.
That is, sweeping table according to the data processing amount for mapping end in preset Cost Model and distributed computing task to estimate
The initial degree of parallelism of operation.
Assuming that the output of Cost Model is cost, the data volume of single MAP processing is dataSizePerMap, then sweeps table handling
Initial degree of parallelism IP can pass through following formula calculate obtain: IP=cost/dataSizePerMap.It is true by such method
The fixed initial degree of parallelism for sweeping table handling, not fixed value, but it is different with the data volume that list MAP is handled according to Cost Model and
Difference can achieve and sweep the effect that the initial degree of parallelism of table handling adaptively determines.
Compared to the degree of parallelism control program merged based on specification, above-mentioned technical proposal is carried out at the end MAP based on cost estimation
Initial degree of parallelism control, compensate for lack in existing degree of parallelism scheme the end MAP control defect, avoid system resource and account for
With the drawbacks such as, small documents are excessive, and invasive change is not present to distributed computing engine.
S140, according to sweeping the initial degree of parallelism of table handling, according to postorder traversal sequence, according to respectively saved in executive plan tree
The degree of parallelism impact factor of the corresponding operation of point, calculates separately the degree of parallelism of operation corresponding with node each in executive plan tree.
Postorder traversal is carried out to executive plan tree, determines the degree of parallelism of the operation on each node.With execution as shown in Figure 2
For plan tree, according to arrow direction, the degree of parallelism of each operation is successively determined.
Specifically, can be the degree of parallelism decision model for pre-establishing a determining degree of parallelism, the degree of parallelism decision model
Degree of parallelism provided in an embodiment of the present invention is executed to determine according to the initial degree of parallelism for sweeping table handling in method, it is suitable according to postorder traversal
Sequence, according to the degree of parallelism impact factor of operation corresponding with node each in executive plan tree, calculate separately in executive plan tree
The step of degree of parallelism of the corresponding operation of each node.
Wherein, the degree of parallelism for sweeping table handling is related with its initial degree of parallelism and degree of parallelism impact factor;In addition to sweeping table handling,
The degree of parallelism of other operations degree impact factor in parallel and the degree of parallelism of the operation where it in child node of node are related,
Or it is only related to the degree of parallelism of operation in the child node of node where it.
It is worth noting that degree of parallelism decision model and the determining degree of parallelism of generation degree of parallelism impact factor mentioned above
Degree of parallelism decision model can be the same degree of parallelism decision model, i.e., first according to preset Cost Model and the distribution
The data statistics of calculating task determine the degree of parallelism impact factor of each operation respectively, influence subsequent operation in this, as measuring
Degree of parallelism scale factor, further according to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to executive plan
The degree of parallelism impact factor of the corresponding operation of each node, is calculated and determined the degree of parallelism of each operation in tree.
A kind of determination method of degree of parallelism provided in an embodiment of the present invention obtains the execution meter of distributed computing task first
Tree is drawn, determines executive plan respectively further according to the data statistics of preset Cost Model and the distributed computing task
The degree of parallelism impact factor for being used to indicate influence subsequent operation degree of parallelism of operation corresponding with each node in tree, then according to pre-
If Cost Model and distributed computing task data statistics, determine and sweep the initial parallel of table handling in executive plan tree
Degree, and then influenced according to the degree of parallelism of operation corresponding with each node in the initial degree of parallelism and executive plan tree for sweeping table handling
The factor respectively determine executive plan tree in operation corresponding with each node degree of parallelism, it follows that distributed computing task and
Row degree control program.The determination method of above-mentioned degree of parallelism avoids fixed degree of parallelism, merges degree of parallelism based on absolute value or specification
The drawbacks of control program, improves the performance, stability and availability of distributed computing engine, realizes degree of parallelism control oneself
It adapts to.
Embodiment two
Fig. 3 is the flow chart that a kind of degree of parallelism provided by Embodiment 2 of the present invention determines method, and the present embodiment is with above-mentioned reality
It applies and is embodied based on example, wherein
It will and institute determining respectively according to the data statistics of preset Cost Model and the distributed computing task
State the degree of parallelism impact factor of the corresponding operation of each node in executive plan tree, specifically: according to preset Cost Model and
The data statistics of the distributed computing task determine the initial of operation corresponding with node each in executive plan tree respectively
Degree of parallelism impact factor;Each initial degree of parallelism impact factor is fitted according to preset mapping ruler, obtain and executes meter
Draw the degree of parallelism impact factor of the corresponding operation of each node in tree.
By according to sweeping the initial degree of parallelism of table handling, according to postorder traversal sequence, according to each node in executive plan tree
The degree of parallelism impact factor of corresponding operation calculates separately the degree of parallelism of operation corresponding with node each in executive plan tree, tool
Body are as follows: according to postorder traversal sequence, successively obtain operation corresponding with a node and be used as current operation;If current operation is
Table handling is swept, then according to the initial degree of parallelism of table handling and the degree of parallelism impact factor of current operation is swept, calculates current operation
Degree of parallelism;If current operation is non-to sweep table handling, it is determined that the child node of node corresponding with current operation, according to institute
State the degree of parallelism of the corresponding operation of child node, or according to the degree of parallelism of operation corresponding with child node and current operation and
Row degree impact factor, calculates the degree of parallelism of current operation;It returns and executes according to postorder traversal sequence, successively obtain and a node
Corresponding operation is used as current operation, until completing the processing to all operationss in the executive plan tree.
Further, after the degree of parallelism for calculating separately operation corresponding with node each in executive plan tree, this implementation
The determination method for the degree of parallelism that example provides further include:
It is adjusted, is obtained according to the degree of parallelism of systemic presupposition parameter pair operation corresponding with node each in executive plan tree
The final degree of parallelism of operation corresponding with node each in executive plan tree.
As shown in figure 3, the method for the present embodiment specifically includes:
S310, the executive plan tree for obtaining distributed computing task, wherein the root node of the executive plan tree corresponds to institute
The output operation in executive plan is stated, at least one leaf node of the executive plan tree corresponds to sweeping in the executive plan
Table handling.
S320, according to the data statistics of preset Cost Model and the distributed computing task respectively determine with
The initial degree of parallelism impact factor of the corresponding operation of each node in executive plan tree.
By taking executive plan tree shown in Fig. 2 as an example, the first of operation corresponding with node each in executive plan tree is determined respectively
Beginning degree of parallelism impact factor as determines the initial degree of parallelism impact factor of the operation in executive plan tree on each node respectively,
Such as sweep table (ts1:ss) operation, sweep table (ts2:i) operation, sweep table (ts3:sr) operation etc. operations initial degree of parallelism influence because
Son.
As a kind of optional embodiment of the present embodiment, appointed according to preset Cost Model and the distributed computing
The initial degree of parallelism impact factor of the determining operation corresponding with node each in executive plan tree of the data statistics of business, including under
State at least one:
If operation belongs to the first action type, believed according to the data statistics of Cost Model and distributed computing task
Breath, the initial degree of parallelism impact factor of calculating operation;
It, will be according to the Cost Model and the distributed computing task if operation belongs to the second action type
Initial degree of parallelism impact factor of the baseline impact factor that data statistics determine as operation;
If operation belongs to third action type, it is determined that the child node of node corresponding with operation, it will be with child node pair
Initial degree of parallelism impact factor of the initial degree of parallelism impact factor for the operation answered as the operation.
Behaviour is calculated according to Cost Model and the data statistics of distributed computing task for the first action type
The initial degree of parallelism impact factor made.Wherein, the first action type includes at least one of following: filter operation and prepolymerization behaviour
Make.
Firstly the need of the preparatory basic statistics information for collecting distributed computing task related data, typical basic statistics letter
Breath such as table type, subregion divide the non-duplicate value number (NDV) of bucket information, table size, table item number, column maximin, column.
The initial degree of parallelism impact factor of filtering (f1) operation is indicated with IPIF (f1), according to Cost Model and can be divided
The data statistics of cloth calculating task calculate IPIF (f1), and the value of IPIF (f1) is determined as to filter the selection of (f1) operation
Rate (Selectivity), numerical value can be obtained by the statistical information of pre-collecting and the estimation of preset Cost Model, on single-row c1
Equivalent filtering for, Selectivity (c1)=1/NDV (c1), wherein NDV (c1) is the non-duplicate value on single-row c1
Number.
Prepolymerization (pg1) operation initial degree of parallelism impact factor can be determined according to the aggregate rate of Aggregation field,
In, the aggregate rate of Aggregation field is specifically obtained by statistical information and Cost Model estimation.
For the second action type, can will be united according to the data of the Cost Model and the distributed computing task
Count initial degree of parallelism impact factor of the determining baseline impact factor of information as operation.Wherein, under the second action type includes
It states at least one: sweeping table handling, attended operation, converging operation, joint operation, intersection operation and difference operation.
That is, table (ts1:ss) operation will be swept in such as Fig. 2, and sweep table (ts2:i) operation, sweep table (ts3:sr) operation, connect
The initial degree of parallelism impact factor for connecing (cj1) operation and polymerization (g1) operation is determined as the baseline impact factor.For example, according to described
The data statistics of Cost Model and the distributed computing task determine the baseline impact factor be 1.0, it is above-mentioned these
The initial degree of parallelism impact factor of operation is 1.0.
For third action type, the child node of node corresponding with operation is first determined, it will operation corresponding with child node
Initial degree of parallelism impact factor of the initial degree of parallelism impact factor as the operation, wherein under third action type includes
State at least one: broadcast attended operation, projection operation, sorting operation, fast resampling operation and default action.
By taking projection (pj4) operation as an example, the child node of corresponding node is the node connected where (cj1) operation,
So IPIF (pj4)=IPIF (cj1);By taking projection (pj3:sr) operation as an example, the child node of corresponding node is to sweep table
(ts3:sr) node where operation, then IPIF (pj3:sr)=IPIF (ts3:sr)=baseline impact factor.
By taking broadcast connection (mj1) operation as an example, the broadcast table child node of corresponding node operates institute for projection (pj2)
Node, table child node is the node projected where (pj1) operation for the fact that corresponding node, then will then broadcast table
Initial degree of parallelism of the initial degree of parallelism impact factor of projection (pj2) operation in child node as broadcast connection (mj1) operation
Impact factor, i.e. IPIF (mj1)=IPIF (broadcast table)=IPIF (pj2).
The degree of parallelism decision model example of the generation of table 1 IPIF
Specifically, the degree of parallelism decision model as shown in Table 1 for generating initial degree of parallelism impact factor can be pre-established,
And the initial degree of parallelism impact factor respectively operated in executive plan tree is determined according to the degree of parallelism decision model.
In the above-mentioned technical solutions, according to the statistical information and default Cost Model collected in advance, according to degree of parallelism decision
Model calculates the initial degree of parallelism impact factor of each operation, to indicate that present node operation influences subsequent node operation degree of parallelism
Scale factor.Because degree of parallelism impact factor is different from absolute figure, also avoid to carry out degree of parallelism control based on absolute value
The problem for needing parameter preset and estimation to be not allowed when processed.
S330, each initial degree of parallelism impact factor is fitted according to preset mapping ruler, is obtained and executive plan
The degree of parallelism impact factor of the corresponding operation of each node in tree.
After obtaining initial degree of parallelism impact factor, also need to be fitted these initial degree of parallelism impact factors to generate simultaneously
Row degree impact factor PIF, in this way can number ratio drop between smooth mapper and reducer, prevent excessively amplification or
Reduce degree of parallelism.
For example, initial degree of parallelism impact factor IPIF (f2) is the Selectivity for filtering (f2), it is assumed that filtering (f2) is
Equivalence filters and filtering is classified as c1, then IPIF (f2)=Selectivity=1/NDV (c1).In the very big situation of NDV (c1)
Under, if be not fitted, reducer quantity will narrow down to the 1/NDV (c1) of map quantity, cause reducer very few.
It, can will be according to preset mapping ruler to each initial degree of parallelism as a kind of optional embodiment of the present embodiment
Impact factor is fitted, and obtains the degree of parallelism impact factor of operation corresponding with node each in executive plan tree, specifically: root
Each initial degree of parallelism impact factor is adjusted to preset interval range according to preset mapping ruler, by each result adjusted
Degree of parallelism impact factor as operation corresponding with node each in executive plan tree.
That is, fit procedure can be realized by mapping ruler, such as setting minIPIF and maxIPIF is to define IPIF
Minimum value and maximum value, preset interval range be [minIPIF, maxIPIF], then corresponding mapping ruler can be such as table 2
It is shown.The degree of parallelism impact factor respectively operated in executive plan tree as shown in Figure 2 is as shown in figure 4, with " filtering (f2) in Fig. 4
For PIF (0.1) ", indicate that the degree of parallelism impact factor PIF of filtering (f2) operation is 0.1.
The mapping ruler of the initial degree of parallelism impact factor of table 2 fitting
Mapping ruler | IPIF | PIF |
R1 | IPIF≤minIPIF | minIPIF |
R2 | MinIPIF < IPIF≤maxIPIF | IPIF |
R3 | IPIF>maxIPIF | maxIPIF |
The above-mentioned method being fitted to initial degree of parallelism impact factor is only a kind of optional embodiment of the present embodiment,
It can also adopt and initial degree of parallelism impact factor is fitted with other methods, this present embodiment is not specifically limited.
S340, according to the data statistics of preset Cost Model and distributed computing task, table handling is swept in determination
Initial degree of parallelism.
Specifically, can estimate according to the data processing amount at end is mapped in preset Cost Model and distributed computing task
Calculate the initial degree of parallelism for sweeping table handling.
Assuming that the output of Cost Model is cost, the data volume of single MAP processing is dataSizePerMap, then sweeps table handling
Initial degree of parallelism IP can pass through following formula calculate obtain: IP=cost/dataSizePerMap.
S350, according to postorder traversal sequence, successively obtain operation corresponding with a node as current operation, judge institute
State whether current operation is to sweep table handling, if so, S360 is executed, if it is not, then executing S370.
After the initial degree of parallelism for sweeping table handling determines, postorder traversal executive plan tree determines the operation on each node
Degree of parallelism, wherein the degree of parallelism for sweeping table handling is directly related with the initial degree of parallelism of table handling is swept, non-to sweep the parallel of table handling
Whether the initial degree of parallelism spent and sweep table handling is not directly relevant to, therefore be to sweep table handling by distinguishing, and is calculated to determine how
Its degree of parallelism.
S360, basis sweep the initial degree of parallelism of table handling and the degree of parallelism impact factor of current operation, calculate current behaviour
The degree of parallelism of work executes S380.
Specifically, the degree of parallelism for sweeping table handling be sweep table handling initial degree of parallelism and sweep table handling degree of parallelism influence because
The result of product of son.
The child node of S370, determination node corresponding with current operation, according to operation corresponding with the child node and
Row degree, or according to the degree of parallelism of operation corresponding with child node and the degree of parallelism impact factor of current operation, calculate current
The degree of parallelism of operation executes S380.
As a kind of optional embodiment of the present embodiment, by according to the degree of parallelism of operation corresponding with the child node,
Or according to the degree of parallelism of operation corresponding with child node and the degree of parallelism impact factor of current operation, calculate current operation
Degree of parallelism, specifically:
If current operation belongs to the 4th action type, by the degree of parallelism and current operation of operation corresponding with child node
Degree of parallelism impact factor result of product, the degree of parallelism as current operation;
If current operation belongs to the 5th action type, using the degree of parallelism of operation corresponding with child node as current behaviour
The degree of parallelism of work;
If current operation belongs to the 6th action type, by the degree of parallelism of all operations corresponding with all child nodes
It is cumulative and as a result, or the maximum parallelism degree value in the degree of parallelism of all operations corresponding with all child nodes, as current operation
Degree of parallelism.
For the 4th action type, by the degree of parallelism of the degree of parallelism of operation corresponding with child node and current operation influence because
The result of product of son, the degree of parallelism as current operation.Wherein, the 4th action type includes at least one of following: filter operation,
Broadcast attended operation and pre-polymerization closing operation.
Referring to Fig. 2, such as current operation is filtering (f1) operation, then where filtering (f1) operation in the child node of node
Operation be sweep table (ts1:ss) operation, then will sweep table (ts1:ss) operation degree of parallelism and filtering (f1) operation degree of parallelism shadow
The result of product for ringing the factor, the degree of parallelism as filtering (f1) operation.The degree of parallelism of filtering (f1) operation is indicated with P (f1), then P
(f1)=P (ts1:ss) × PIF (f1).
Such as current operation is broadcast connection (mj1) operation, the broadcast table child node of corresponding node is projection
(pj2) the fact that the node where operation, corresponding node, table child node be the node for projecting (pj1) operation place, that
Then by true table child node projection (pj1) operation degree of parallelism and broadcast connection (mj1) operation degree of parallelism influence because
Degree of parallelism of the product of son as broadcast connection (mj1) operation, P (mj1)=P (pj1) × PIF (mj1).
For the 5th action type, using the degree of parallelism of operation corresponding with child node as the degree of parallelism of current operation.Its
In, the 5th action type includes at least one of following: converging operation, projection operation, sorting operation and fast resampling operation.
Such as current operation is polymerization (g1) operation, then the operation where polymerization (g1) operation in the child node of node is
Fast resampling (rs3) operation, then the degree of parallelism for polymerizeing (g1) operation is the degree of parallelism of fast resampling (rs3) operation, P (g1)
=P (rs3).
For the 6th action type, by the cumulative of the degree of parallelism of all operations corresponding with all child nodes and as a result, or
Maximum parallelism degree value in the degree of parallelism of all operations corresponding with all child nodes, the degree of parallelism as current operation.Wherein,
6th action type includes at least one of following: attended operation, joint operation, intersection operation, difference operation and default action.
Such as current operation is connection (cj1) operation, the operation in all child nodes of place node is that data are divided again
Cloth (rs1) operation and fast resampling (rs2) operation, then the degree of parallelism of fast resampling (rs1) operation and data are divided again
Degree of parallelism of the maximum value as connection (cj1) operation in the degree of parallelism of cloth (rs2) operation, P (cj1)=MAX (P (rs1), P
(rs2))。
Specifically, the degree of parallelism decision model of generation degree of parallelism as shown in table 3 can be constructed in advance, and parallel according to this
Degree decision model determines the degree of parallelism respectively operated in executive plan tree.According to the degree of parallelism decision model, execution as shown in Figure 3
The calculated result of degree of parallelism that is respectively operated in plan tree as shown in figure 5, by taking " filtering (f2) PIF (0.1) P (1) " in Fig. 5 as an example,
It indicates that the degree of parallelism impact factor PIF of filtering (f2) operation is 0.1, and degree of parallelism P is 1.
The degree of parallelism decision model example of the generation degree of parallelism of table 3
S380, judge whether to complete the processing to all operationss in executive plan tree, execute S350 if it is not, then returning, if
It is then to execute S390.
Successively the degree of parallelism of each operation in executive plan tree is determined according to the above method, until being fully completed.
S390, it is adjusted according to the degree of parallelism of systemic presupposition parameter pair operation corresponding with node each in executive plan tree
It is whole, obtain the final degree of parallelism of operation corresponding with node each in executive plan tree.
Further, after degree of parallelism determines, it can be combined with system environments and parameter preset determine on executive plan tree
The final degree of parallelism respectively operated.These parameter presets may include the fixed degree of parallelism of the system under specific condition or system most
Small degree of parallelism (SPMIN) etc..For example, if systemic presupposition minimum degree of parallelism SPMIN, final degree of parallelism should be MAX (P,
SPMIN)。
In conclusion after the determination method building degree of parallelism decision model of the degree of parallelism provided according to embodiments of the present invention,
By the data statistics of pre-collecting distributed computing task, the executive plan of distributed computing task and preset cost mould
Type is as input information, i.e., the exportable executive plan for carrying optimization degree of parallelism.
In the above-mentioned technical solutions, it is based on data statistics such as selection rate, aggregate rate and default Cost Model, is calculated each
The degree of parallelism impact factor PIF of type tasks influences the scale factor of subsequent operation degree of parallelism in this, as measurement, further according to
The final degree of parallelism P of each task is calculated and determined in PIF information.And then carry out the degree of parallelism determined according to embodiments of the present invention simultaneously
The drawbacks of control of row degree, avoids fixed degree of parallelism, merges degree of parallelism control program based on absolute value or specification, improves distribution
The performance of formula computing engines, stability and availability.
Embodiment three
Fig. 6 is a kind of structural schematic diagram of the determining device for degree of parallelism that the embodiment of the present invention three provides, and is applicable to point
The case where degree of parallelism control is carried out in cloth calculating task, the mode which can be used software and/or hardware is realized, and general
It can be integrated in the processor.
As shown in fig. 6, the determining device of the degree of parallelism specifically includes: executive plan tree obtains module 610, degree of parallelism influences
Factor determining module 620, the initial degree of parallelism determining module 630 and degree of parallelism determining module 640 for sweeping table handling.Wherein,
Executive plan tree generation module 610, for obtaining the executive plan tree of distributed computing task, wherein described to hold
The root node of row plan tree corresponds to the operation of the output in the executive plan, at least one leaf node of the executive plan tree
Table handling is swept in the corresponding executive plan;
Degree of parallelism impact factor determining module 620, for being appointed according to preset Cost Model and the distributed computing
The data statistics of business determine the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree respectively;
The initial degree of parallelism determining module 630 of table handling is swept, for according to preset Cost Model and the distribution
The data statistics of calculating task determine the initial degree of parallelism for sweeping table handling;
Degree of parallelism determining module 640, for sweeping the initial degree of parallelism of table handling according to, according to postorder traversal sequence,
According to the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree, calculate separately and the executive plan
The degree of parallelism of the corresponding operation of each node in tree.
A kind of determining device of degree of parallelism provided in this embodiment obtains the executive plan of distributed computing task first
Tree determines executive plan tree further according to the data statistics of preset Cost Model and the distributed computing task respectively
In operation corresponding with each node be used to indicate the degree of parallelism impact factor for influencing subsequent operation degree of parallelism, then according to default
Cost Model and distributed computing task data statistics, determine and sweep the initial parallel of table handling in executive plan tree
Degree, and then influenced according to the degree of parallelism of operation corresponding with each node in the initial degree of parallelism and executive plan tree for sweeping table handling
The factor respectively determine executive plan tree in operation corresponding with each node degree of parallelism, it follows that distributed computing task and
Row degree control program.The determination method of above-mentioned degree of parallelism avoids fixed degree of parallelism, merges degree of parallelism based on absolute value or specification
The drawbacks of control program, improves the performance, stability and availability of distributed computing engine, realizes degree of parallelism control oneself
It adapts to.
Further, degree of parallelism impact factor determining module 620 specifically includes: initial degree of parallelism factor of influence determining unit
With initial degree of parallelism impact factor fitting unit, wherein
Initial degree of parallelism factor of influence determining unit, for being appointed according to preset Cost Model and the distributed computing
The data statistics of business determine respectively the initial degree of parallelism influence of corresponding with node each in executive plan tree operation because
Son;
Initial degree of parallelism impact factor fitting unit, is used for according to preset mapping ruler to each initial degree of parallelism shadow
It rings the factor to be fitted, obtains the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree.
Further, initial degree of parallelism factor of influence determining unit, including at least one of following:
If the operation belongs to the first action type, according to the Cost Model and the distributed computing task
Data statistics, calculate the initial degree of parallelism impact factor of the operation;
If the operation belongs to the second action type, will be appointed according to the Cost Model and the distributed computing
Initial degree of parallelism impact factor of the baseline impact factor that the data statistics of business determine as the operation;
If the operation belongs to third action type, it is determined that the child node of node corresponding with the operation, it will be with
Initial degree of parallelism impact factor of the initial degree of parallelism impact factor of the corresponding operation of the child node as the operation.
Specifically, first action type includes at least one of following: filter operation and pre-polymerization closing operation;
Second action type includes at least one of following: sweep table handling, attended operation, converging operation, joint operation,
Intersection operation and difference operation;
The third action type includes at least one of following: broadcast attended operation, projection operation, sorting operation, data
Redistribution operation and default action.
Further, initial degree of parallelism impact factor fitting unit be specifically used for according to preset mapping ruler will it is each described in
Initial degree of parallelism impact factor is adjusted to preset interval range, using each result adjusted as with the executive plan tree
In the corresponding operation of each node degree of parallelism impact factor.
Further, degree of parallelism determining module 640 specifically includes:
Acquiring unit, for according to postorder traversal sequence, successively obtaining operation corresponding with a node as current behaviour
Make;
First computing unit sweeps the initial of table handling if being to sweep table handling for the current operation according to
The degree of parallelism impact factor of degree of parallelism and the current operation, calculates the degree of parallelism of the current operation;
Second computing unit, if sweeping table handling for the current operation to be non-, it is determined that with the current operation pair
The child node for the node answered, according to the degree of parallelism of operation corresponding with the child node, or according to corresponding with the child node
Operation degree of parallelism and the current operation degree of parallelism impact factor, calculate the degree of parallelism of the current operation;
Cycling element, executes for returning according to postorder traversal sequence, successively obtains operation corresponding with a node and makees
For current operation, until completing the processing to all operationss in the executive plan tree.
Further, the second computing unit specifically includes: second calculate the first subelement, second calculate the second subelement and
Second calculates third subelement, wherein
Second calculates the first subelement will be with the son if belonging to the 4th action type for the current operation
The result of product of the degree of parallelism impact factor of the degree of parallelism and current operation of the corresponding operation of node, as the current behaviour
The degree of parallelism of work;
Second calculates the second subelement will be with the son if belonging to the 5th action type for the current operation
Degree of parallelism of the degree of parallelism of the corresponding operation of node as the current operation;
Second calculates third subelement will be with all institutes if belonging to the 6th action type for the current operation
State the cumulative of the degree of parallelism of the corresponding all operations of child node and as a result, or corresponding with all child nodes all operations
Maximum parallelism degree value in degree of parallelism, the degree of parallelism as the current operation.
Further, the 4th action type includes at least one of following: filter operation, broadcast attended operation and pre-polymerization
Closing operation;
5th action type includes at least one of following: converging operation, projection operation, sorting operation and data are divided again
Cloth operation;
6th action type includes at least one of following: attended operation, joint operation, intersection operation, difference operation
And default action.
Further, sweep table handling initial degree of parallelism determining module 630 be specifically used for according to preset Cost Model with
And the data processing amount at end is mapped in the distributed computing task, calculate the initial degree of parallelism for sweeping table handling.
Further, the determining device of above-mentioned degree of parallelism further include: degree of parallelism adjusts module, for calculating separately and institute
After the degree of parallelism for stating the corresponding operation of each node in executive plan tree, according to systemic presupposition parameter pair and the executive plan tree
In the degree of parallelism of the corresponding operation of each node be adjusted, obtain operation corresponding with node each in the executive plan tree most
Whole degree of parallelism.
The determination method of degree of parallelism provided by any embodiment of the invention can be performed in the determining device of above-mentioned degree of parallelism, tool
The standby corresponding functional module of determination method and beneficial effect for executing degree of parallelism.
Example IV
Fig. 7 is a kind of hardware structural diagram for equipment that the embodiment of the present invention four provides, as shown in fig. 7, the equipment packet
It includes:
One or more processors 710, in Fig. 7 by taking a processor 710 as an example;
Memory 720;
The equipment can also include: input unit 730 and output device 740.
Processor 710, memory 720, input unit 730 and output device 740 in the equipment can pass through bus
Or other modes connect, in Fig. 7 for being connected by bus.
Memory 720 be used as a kind of non-transient computer readable storage medium, can be used for storing software program, computer can
Program and module are executed, such as the corresponding program instruction of the determination method of one of embodiment of the present invention degree of parallelism/module (example
Such as, attached executive plan tree shown in fig. 6 obtains module 610, degree of parallelism impact factor determining module 620, sweeps the initial of table handling
Degree of parallelism determining module 630 and degree of parallelism determining module 640).Processor 710 is stored in soft in memory 720 by operation
Part program, instruction and module realize above-mentioned side thereby executing the various function application and data processing of computer equipment
A kind of determination method of degree of parallelism of method embodiment.
Memory 720 may include storing program area and storage data area, wherein storing program area can store operation system
Application program required for system, at least one function;Storage data area can be stored to be created according to using for computer equipment
Data etc..In addition, memory 720 may include high-speed random access memory, it can also include non-transitory memory, such as
At least one disk memory, flush memory device or other non-transitory solid-state memories.In some embodiments, it stores
Optional device 720 includes the memory remotely located relative to processor 710, these remote memories can be by being connected to the network extremely
Terminal device.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication and its group
It closes.
Input unit 730 can be used for receiving the number or character information of input, and generate the user with computer equipment
Setting and the related key signals input of function control.Output device 740 may include that display screen etc. shows equipment.
Embodiment five
The embodiment of the present invention five also provides a kind of storage medium comprising computer executable instructions, and the computer can be held
Row is instructed when being executed by computer processor for executing a kind of determination method of degree of parallelism, this method comprises:
Obtain the executive plan tree of distributed computing task, wherein hold described in the root node of the executive plan tree is corresponding
The output operation of row in the works, at least one leaf node of the executive plan tree correspond in the executive plan and sweep table behaviour
Make;
According to the data statistics of preset Cost Model and the distributed computing task respectively determine with it is described
The degree of parallelism impact factor of the corresponding operation of each node in executive plan tree;
According to preset Cost Model and the data statistics of the distributed computing task, table handling is swept in determination
Initial degree of parallelism;
According to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to in the executive plan tree
The degree of parallelism impact factor of the corresponding operation of each node, calculates separately operation corresponding with node each in the executive plan tree
Degree of parallelism.
Optionally, which can be also used for executing the present invention times when being executed by computer processor
A kind of technical solution of the determination method of degree of parallelism provided by embodiment of anticipating.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention
It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more
Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art
Part can be embodied in the form of software products, which can store in computer readable storage medium
In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer
Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set
Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
It is worth noting that, included each unit and module are only in the embodiment of the determining device of above-mentioned degree of parallelism
It is to be divided according to the functional logic, but be not limited to the above division, as long as corresponding functions can be realized;Separately
Outside, the specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (13)
1. a kind of determination method of degree of parallelism characterized by comprising
Obtain the executive plan tree of distributed computing task, wherein the root node of the executive plan tree, which corresponds to, described executes meter
Output operation in drawing, at least one leaf node of the executive plan tree, which corresponds in the executive plan, sweeps table handling;
According to the determining and execution respectively of the data statistics of preset Cost Model and the distributed computing task
The degree of parallelism impact factor of the corresponding operation of each node in plan tree;
According to preset Cost Model and the data statistics of the distributed computing task, the initial of table handling is swept in determination
Degree of parallelism;
According to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to respectively saved in the executive plan tree
The degree of parallelism impact factor of the corresponding operation of point, calculates separately the parallel of operation corresponding with node each in the executive plan tree
Degree.
2. the method according to claim 1, wherein described according to preset Cost Model and the distribution
The data statistics of calculating task determine that the degree of parallelism of operation corresponding with node each in the executive plan tree influences respectively
The factor, comprising:
According to the determining and execution respectively of the data statistics of preset Cost Model and the distributed computing task
The initial degree of parallelism impact factor of the corresponding operation of each node in plan tree;
Each initial degree of parallelism impact factor is fitted according to preset mapping ruler, is obtained and the executive plan tree
In the corresponding operation of each node degree of parallelism impact factor.
3. according to the method described in claim 2, it is characterized in that, according to preset Cost Model and the distributed computing
The data statistics of task determine the initial degree of parallelism impact factor of operation corresponding with node each in the executive plan tree,
Including at least one of following:
If the operation belongs to the first action type, according to the Cost Model and the number of the distributed computing task
Information according to statistics calculates the initial degree of parallelism impact factor of the operation;
It, will be according to the Cost Model and the distributed computing task if the operation belongs to the second action type
Initial degree of parallelism impact factor of the baseline impact factor that data statistics determine as the operation;
If the operation belongs to third action type, it is determined that the child node of node corresponding with the operation, will with it is described
Initial degree of parallelism impact factor of the initial degree of parallelism impact factor of the corresponding operation of child node as the operation.
4. according to the method described in claim 3, it is characterized in that, first action type includes at least one of following: mistake
Filter operation and pre-polymerization closing operation;
Second action type includes at least one of following: sweeping table handling, attended operation, converging operation, joint operation, intersection
Operation and difference operation;
The third action type includes at least one of following: broadcast attended operation, projection operation, sorting operation, data are divided again
Cloth operation and default action.
5. according to the method described in claim 2, it is characterized in that, it is described according to preset mapping ruler to it is each it is described it is initial simultaneously
Row degree impact factor is fitted, and obtains the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree,
Include:
Each initial degree of parallelism impact factor is adjusted to preset interval range according to preset mapping ruler, will be adjusted
Degree of parallelism impact factor of each result afterwards as operation corresponding with node each in the executive plan tree.
6. the method according to claim 1, wherein the initial degree of parallelism for sweeping table handling according to, is pressed
It is counted respectively according to postorder traversal sequence according to the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree
Calculate the degree of parallelism of operation corresponding with node each in the executive plan tree, comprising:
According to postorder traversal sequence, successively obtains operation corresponding with a node and be used as current operation;
If the current operation is to sweep table handling, the initial degree of parallelism and the current operation of table handling are swept according to
Degree of parallelism impact factor, calculate the degree of parallelism of the current operation;
If the current operation sweeps table handling to be non-, it is determined that the child node of node corresponding with the current operation, according to
The degree of parallelism of operation corresponding with the child node, or according to the degree of parallelism of operation corresponding with the child node and described
The degree of parallelism impact factor of current operation, calculates the degree of parallelism of the current operation;
It returns and executes according to postorder traversal sequence, successively obtain operation corresponding with a node and be used as current operation, until complete
The processing of all operationss in the pairs of executive plan tree.
7. according to the method described in claim 6, it is characterized in that, basis operation corresponding with the child node it is parallel
Degree, or according to the degree of parallelism of operation corresponding with the child node and the degree of parallelism impact factor of the current operation, meter
Calculate the degree of parallelism of the current operation, comprising:
If the current operation belongs to the 4th action type, by the degree of parallelism of operation corresponding with the child node and described
The result of product of the degree of parallelism impact factor of current operation, the degree of parallelism as the current operation;
If the current operation belongs to the 5th action type, using the degree of parallelism of operation corresponding with the child node as institute
State the degree of parallelism of current operation;
If the current operation belongs to the 6th action type, by the parallel of all operations corresponding with all child nodes
Degree cumulative and as a result, or the maximum parallelism degree value in the degree of parallelism of all operations corresponding with all child nodes, as
The degree of parallelism of the current operation.
8. the method according to the description of claim 7 is characterized in that the 4th action type includes at least one of following: mistake
Filter operation, broadcast attended operation and pre-polymerization closing operation;
5th action type includes at least one of following: converging operation, projection operation, sorting operation and fast resampling behaviour
Make;
6th action type includes at least one of following: attended operation, joint operation, intersection operation, difference operation and silent
Recognize operation.
9. the method according to claim 1, wherein described according to preset Cost Model and the distribution
The data statistics of calculating task determine the initial degree of parallelism for sweeping table handling, comprising:
According to the data processing amount for mapping end in preset Cost Model and the distributed computing task, table handling is swept in calculating
Initial degree of parallelism.
10. -9 described in any item methods according to claim 1, which is characterized in that calculating separately and the executive plan tree
In the corresponding operation of each node degree of parallelism after, further includes:
It is adjusted, is obtained according to the degree of parallelism of systemic presupposition parameter pair operation corresponding with node each in the executive plan tree
The final degree of parallelism of operation corresponding with node each in the executive plan tree.
11. a kind of determining device of degree of parallelism characterized by comprising
Executive plan tree obtains module, for obtaining the executive plan tree of distributed computing task, wherein the executive plan tree
Root node correspond to the output in the executive plan operation, the executive plan tree at least one leaf node correspondence described in
Table handling is swept in executive plan;
Degree of parallelism impact factor determining module, for the data according to preset Cost Model and the distributed computing task
Statistical information determines the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree respectively;
The initial degree of parallelism determining module of table handling is swept, for according to preset Cost Model and the distributed computing task
Data statistics, determine and sweep the initial degree of parallelism of table handling;
Degree of parallelism determining module, for sweeping the initial degree of parallelism of table handling according to, according to postorder traversal sequence, according to institute
The degree of parallelism impact factor for stating the corresponding operation of each node in executive plan tree, calculates separately and respectively saves with the executive plan tree
The degree of parallelism of the corresponding operation of point.
12. a kind of equipment including memory, processor and stores the computer journey that can be run on a memory and on a processor
Sequence, which is characterized in that the processor realizes the method as described in any in claim 1-10 when executing described program.
13. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The method as described in any in claim 1-10 is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811436295.1A CN109558232B (en) | 2018-11-28 | 2018-11-28 | Determination method, apparatus, equipment and the medium of degree of parallelism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811436295.1A CN109558232B (en) | 2018-11-28 | 2018-11-28 | Determination method, apparatus, equipment and the medium of degree of parallelism |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109558232A true CN109558232A (en) | 2019-04-02 |
CN109558232B CN109558232B (en) | 2019-08-23 |
Family
ID=65867926
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811436295.1A Active CN109558232B (en) | 2018-11-28 | 2018-11-28 | Determination method, apparatus, equipment and the medium of degree of parallelism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109558232B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112187285A (en) * | 2020-09-18 | 2021-01-05 | 中科院计算技术研究所南京移动通信与计算创新研究院 | Processing method of barrel shifter based on DVB-S2 decoder and barrel shifter |
CN113535354A (en) * | 2021-06-30 | 2021-10-22 | 深圳市云网万店电子商务有限公司 | Method and device for adjusting parallelism of Flink SQL operator |
WO2024078080A1 (en) * | 2022-10-14 | 2024-04-18 | 华为技术有限公司 | Database query method and apparatus, and device and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514229A (en) * | 2012-06-29 | 2014-01-15 | 国际商业机器公司 | Method and device used for processing database data in distributed database system |
US20140282585A1 (en) * | 2013-03-13 | 2014-09-18 | Barracuda Networks, Inc. | Organizing File Events by Their Hierarchical Paths for Multi-Threaded Synch and Parallel Access System, Apparatus, and Method of Operation |
US20150193270A1 (en) * | 2014-01-06 | 2015-07-09 | International Business Machines Corporation | Constructing a logical tree topology in a parallel computer |
CN107025273A (en) * | 2017-03-17 | 2017-08-08 | 南方电网科学研究院有限责任公司 | The optimization method and device of a kind of data query |
CN108319722A (en) * | 2018-02-27 | 2018-07-24 | 北京小度信息科技有限公司 | Data access method, device, electronic equipment and computer readable storage medium |
-
2018
- 2018-11-28 CN CN201811436295.1A patent/CN109558232B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514229A (en) * | 2012-06-29 | 2014-01-15 | 国际商业机器公司 | Method and device used for processing database data in distributed database system |
US20140282585A1 (en) * | 2013-03-13 | 2014-09-18 | Barracuda Networks, Inc. | Organizing File Events by Their Hierarchical Paths for Multi-Threaded Synch and Parallel Access System, Apparatus, and Method of Operation |
US20150193270A1 (en) * | 2014-01-06 | 2015-07-09 | International Business Machines Corporation | Constructing a logical tree topology in a parallel computer |
CN107025273A (en) * | 2017-03-17 | 2017-08-08 | 南方电网科学研究院有限责任公司 | The optimization method and device of a kind of data query |
CN108319722A (en) * | 2018-02-27 | 2018-07-24 | 北京小度信息科技有限公司 | Data access method, device, electronic equipment and computer readable storage medium |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112187285A (en) * | 2020-09-18 | 2021-01-05 | 中科院计算技术研究所南京移动通信与计算创新研究院 | Processing method of barrel shifter based on DVB-S2 decoder and barrel shifter |
CN112187285B (en) * | 2020-09-18 | 2024-02-27 | 南京中科晶上通信技术有限公司 | Barrel shifter processing method based on DVB-S2 decoder and barrel shifter |
CN113535354A (en) * | 2021-06-30 | 2021-10-22 | 深圳市云网万店电子商务有限公司 | Method and device for adjusting parallelism of Flink SQL operator |
WO2024078080A1 (en) * | 2022-10-14 | 2024-04-18 | 华为技术有限公司 | Database query method and apparatus, and device and medium |
Also Published As
Publication number | Publication date |
---|---|
CN109558232B (en) | 2019-08-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109558232B (en) | Determination method, apparatus, equipment and the medium of degree of parallelism | |
CN103678520B (en) | A kind of multi-dimensional interval query method and its system based on cloud computing | |
US11314808B2 (en) | Hybrid flows containing a continous flow | |
CN109388791B (en) | Dynamic diagram display method and device, computer equipment and storage medium | |
KR101773574B1 (en) | Method for chart visualizing of data table | |
WO2022057303A1 (en) | Image processing method, system and apparatus | |
CN106339252B (en) | Self-adaptive optimization method and device for distributed DAG system | |
US20200311100A1 (en) | Generating varied-scale topological visualizations of multi-dimensional data | |
CN109324796A (en) | Quick interface arrangement method and device | |
CN112002021B (en) | Aggregation dotting visualization method and device based on unity3d | |
WO2019233089A1 (en) | Method and device for large-ratio scale reduction of internet testbed topology | |
CN110633959A (en) | Method, device, equipment and medium for creating approval task based on graph structure | |
CN111966597B (en) | Test data generation method and device | |
WO2022036596A1 (en) | Decomposition method and apparatus for production order | |
CN110888672B (en) | Expression engine implementation method and system based on metadata architecture | |
CN110784377A (en) | Method for uniformly managing cloud monitoring data in multi-cloud environment | |
CN115310420A (en) | Simulation analysis report generation method, device, equipment and storage medium | |
WO2020093718A1 (en) | Training data re-sampling method and apparatus, and storage medium and electronic device | |
CN111679808A (en) | RPA robot application requirement evaluation method and device | |
CN109325015A (en) | A kind of extracting method and device of the feature field of domain model | |
JP5600693B2 (en) | Clustering apparatus, method and program | |
CN111858059A (en) | Graph calculation method, device, equipment and storage medium | |
CN112015714A (en) | Database-based data model generation method and device | |
CN104901703A (en) | Integer sequence fast compression storage algorithm | |
CN106156065B (en) | A kind of file persistence method, delet method and relevant apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder | ||
CP01 | Change in the name or title of a patent holder |
Address after: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai Patentee after: Star link information technology (Shanghai) Co.,Ltd. Address before: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai Patentee before: TRANSWARP TECHNOLOGY (SHANGHAI) Co.,Ltd. |