CN109558232A - Determination method, apparatus, equipment and the medium of degree of parallelism - Google Patents

Determination method, apparatus, equipment and the medium of degree of parallelism Download PDF

Info

Publication number
CN109558232A
CN109558232A CN201811436295.1A CN201811436295A CN109558232A CN 109558232 A CN109558232 A CN 109558232A CN 201811436295 A CN201811436295 A CN 201811436295A CN 109558232 A CN109558232 A CN 109558232A
Authority
CN
China
Prior art keywords
parallelism
degree
node
executive plan
impact factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811436295.1A
Other languages
Chinese (zh)
Other versions
CN109558232B (en
Inventor
陈振强
熊仲健
刘汪根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Transwarp Technology Shanghai Co Ltd
Original Assignee
Star Link Information Technology (shanghai) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Star Link Information Technology (shanghai) Co Ltd filed Critical Star Link Information Technology (shanghai) Co Ltd
Priority to CN201811436295.1A priority Critical patent/CN109558232B/en
Publication of CN109558232A publication Critical patent/CN109558232A/en
Application granted granted Critical
Publication of CN109558232B publication Critical patent/CN109558232B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Complex Calculations (AREA)

Abstract

The embodiment of the invention discloses determination method, apparatus, equipment and the media of a kind of degree of parallelism.This method comprises: obtaining the executive plan tree of distributed computing task;Determine the degree of parallelism impact factor of operation corresponding with node each in executive plan tree respectively according to the data statistics of preset Cost Model and the distributed computing task;According to preset Cost Model and the data statistics of distributed computing task, the initial degree of parallelism that table handling is swept in executive plan tree is determined;According to the initial degree of parallelism for sweeping table handling, the degree of parallelism of operation corresponding with node each in executive plan tree is calculated separately according to the degree of parallelism impact factor of operation corresponding with node each in executive plan tree according to postorder traversal sequence.The above method avoids the drawbacks of degree of parallelism control program in the prior art, improves the performance, stability and availability of distributed computing engine, realizes the adaptive of degree of parallelism control.

Description

Determination method, apparatus, equipment and the medium of degree of parallelism
Technical field
The present embodiments relate to field of computer technology more particularly to determination method, apparatus, the equipment of a kind of degree of parallelism And medium.
Background technique
Common distributed computing engine framework containment mapping (map)-fast resampling (shuffle)-specification (reduce) three phases represent the processing that data first pass through the end map, using the end shuffle by data according to certain rule It regroups, the process calculated finally by the end reduce.Wherein, degree of parallelism determines in system how many map simultaneously Task and how many reduce tasks are run simultaneously, be the key that influence the parallel execution efficiency of distributed system and stability because Element.
Degree of parallelism is too big or the too small execution efficiency and stability that can all influence system: then meaning at individual task greatly very much The data volume very little of reason exists simultaneously that resource is occupied, scheduling so that the ability of each execution unit is not fully utilized The drawbacks such as the large and small file of expense is excessive;And the too small individual task that will lead to is overweight, system CPU/MEM/IO pressure is big, in turn Slow performance is dragged in appearance, or even the problem of influence system stability, simultaneously because task concentrates on limited child node, can also exist The problem of other concurrent execution units cannot efficiently use.Therefore, how to control degree of parallelism is distributed computing engine application In major issue.
The degree of parallelism scheme of distributed computing engine focuses primarily upon the control at the end reduce at present, and lacks the end map Control, common several ways are as follows:
(1) number of fixed degree of parallelism control, the i.e. fixed unified reducer of setting is parallel come the default for determining system Degree.
Which is simply direct, but does not adapt to the different data distribution of different computing tasks and the need of different calculating types It asks, and then will affect system performance and stability.Meanwhile manual intervention is that fixed degree of parallelism is separately provided in specific calculation task When, which availability and scalability are poor.
(2) the degree of parallelism control calculated based on absolute value, i.e., according to the shuffle total amount of data at the end map of estimation The data volume dataSizePerReducer of totalShuffleSize and predetermined system list reducer processing, determines Reducer several N, wherein N=totalShuffleSize/dataSizePerReducer.Reducer in such mode Number is calculated by absolute value, canonical system such as Hive.
Which needs the data volume parameter of default list reducer processing, does not adapt to the different numbers of different computing tasks According to distribution and the different demands for calculating types, while there is estimation not in while estimating the shuffle total amount of data of absolute figure form Quasi- problem.
(3) the degree of parallelism control merged based on specification that is, after map task, utilizes system runtime environment record Data fragmentation information is carried out subregion merging and is divided multiple small task mergings at the reduce task of default size with this Reducer, certain extensions of canonical system such as Spark.
Reducer merges to be formed according to the subregion of map in which, and there is no solve resource occupation and map stage small text The problems such as part is excessive, and dynamic modification executive plan is needed, it is invasive to distributed computing engine framework strong.
Summary of the invention
The embodiment of the present invention provides determination method, apparatus, equipment and the medium of a kind of degree of parallelism, in the prior art Degree of parallelism control method optimizes, and realizes the adaptive of degree of parallelism control, improves performance, the stability of distributed computing engine And availability.
In a first aspect, the embodiment of the invention provides a kind of determination methods of degree of parallelism, comprising:
Obtain the executive plan tree of distributed computing task, wherein hold described in the root node of the executive plan tree is corresponding The output operation of row in the works, at least one leaf node of the executive plan tree correspond in the executive plan and sweep table behaviour Make;
According to the data statistics of preset Cost Model and the distributed computing task respectively determine with it is described The degree of parallelism impact factor of the corresponding operation of each node in executive plan tree;
According to preset Cost Model and the data statistics of the distributed computing task, table handling is swept in determination Initial degree of parallelism;
According to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to in the executive plan tree The degree of parallelism impact factor of the corresponding operation of each node, calculates separately operation corresponding with node each in the executive plan tree Degree of parallelism.
Second aspect, the embodiment of the invention also provides a kind of determining devices of degree of parallelism, comprising:
Executive plan tree obtains module, for obtaining the executive plan tree of distributed computing task, wherein described to execute meter The root node for drawing tree corresponds to the operation of the output in the executive plan, at least one leaf node of the executive plan tree is corresponding Table handling is swept in the executive plan;
Degree of parallelism impact factor determining module, for according to preset Cost Model and the distributed computing task Data statistics determine the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree respectively;
The initial degree of parallelism determining module of table handling is swept, for according to preset Cost Model and the distributed computing The data statistics of task determine the initial degree of parallelism for sweeping table handling;
Degree of parallelism determining module, for sweeping the initial degree of parallelism of table handling according to, according to postorder traversal sequence, according to The degree of parallelism impact factor of operation corresponding with node each in the executive plan tree, calculate separately in the executive plan tree The degree of parallelism of the corresponding operation of each node.
The third aspect the embodiment of the invention also provides a kind of equipment, including memory, processor and is stored in memory Computer program that is upper and can running on a processor, the processor is realized when executing described program to be implemented as the present invention is any The determination method of degree of parallelism provided by example.
Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer Program realizes the determination method such as degree of parallelism provided by any embodiment of the invention when the program is executed by processor.
Determination method, apparatus, equipment and the medium of a kind of degree of parallelism provided in an embodiment of the present invention, the degree of parallelism really Determine in method, the executive plan tree of distributed computing task is obtained first, further according to preset Cost Model and the distribution The data statistics of formula calculating task determine that corresponding with each node operation is used to indicate influence in executive plan tree respectively Then the degree of parallelism impact factor of subsequent operation degree of parallelism is united according to the data of preset Cost Model and distributed computing task Count information, determine the initial degree of parallelism for sweeping table handling in executive plan tree, so according to the initial degree of parallelism for sweeping table handling and In executive plan tree the degree of parallelism impact factor of operation corresponding with each node respectively determine executive plan tree in each node pair The degree of parallelism for the operation answered, it follows that the degree of parallelism control program of distributed computing task.The determination method of above-mentioned degree of parallelism The drawbacks of avoiding fixed degree of parallelism, merging degree of parallelism control program based on absolute value or specification, improves distributed computing and draws Performance, stability and the availability held up realize the adaptive of degree of parallelism control.
Detailed description of the invention
Fig. 1 is the flow chart of the determination method of one of the embodiment of the present invention one degree of parallelism;
Fig. 2 is the exemplary diagram of one of embodiment of the present invention one executive plan tree;
Fig. 3 is the flow chart of the determination method of one of the embodiment of the present invention two degree of parallelism;
Fig. 4 is the exemplary diagram for the executive plan tree that one of embodiment of the present invention two generates degree of parallelism impact factor;
Fig. 5 is the exemplary diagram for the executive plan tree that one of embodiment of the present invention two generates degree of parallelism;
Fig. 6 is the structural schematic diagram of the determining device of one of the embodiment of the present invention three degree of parallelism;
Fig. 7 is the hardware structural diagram of one of the embodiment of the present invention four equipment.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
It should be mentioned that some exemplary embodiments are described as before exemplary embodiment is discussed in greater detail The processing or method described as flow chart.Although operations (or step) are described as the processing of sequence by flow chart, Many of these operations can be implemented concurrently, concomitantly or simultaneously.In addition, the sequence of operations can be pacified again Row.The processing can be terminated when its operations are completed, it is also possible to have the additional step being not included in attached drawing.Institute Stating processing can correspond to method, function, regulation, subroutine, subprogram etc..
Embodiment one
Fig. 1 is a kind of flow chart of the determination method for degree of parallelism that the embodiment of the present invention one provides, and is applicable to distribution The case where degree of parallelism control is carried out in calculating task, this method can be by the determining device of degree of parallelism provided in an embodiment of the present invention It executes, the mode which can be used software and/or hardware is realized, and generally can be integrated in the processor.As shown in Figure 1, The method of the present embodiment specifically includes:
S110, the executive plan tree for obtaining distributed computing task, wherein the root node of the executive plan tree corresponds to institute The output operation in executive plan is stated, at least one leaf node of the executive plan tree corresponds to sweeping in the executive plan Table handling.
The executive plan of distributed computing task is usually generated by SQL compiler, is to each rank of distributed computing task The abstractdesription of section, by taking following SQL statement as an example:
select i_manager_id,i_size,count(sr_ticket_number)
From store_sales ss join item i on ss.ss_item_sk=i.i_item_sk
Join store_returns sr on ss.ss_customer_sk=sr.sr_customer_sk
Where ss_store_sk > 10and i_item_id=' A '
group by i_manager_id,i_size;
Executive plan tree corresponding with above-mentioned SQL statement is as shown in Fig. 2, indicate " store_sales table " and " item table " Broadcast connection is carried out after filtering projection respectively, is then attached with " store_returns table ", the process finally polymerizeing again. Wherein shuffle is not present in broadcast connection (mj1) operation, and connection (cj1) operation and polymerization (g1) operation need shuffle.
That is, the leaf node of executive plan tree is corresponded to " store_sales table ", " item table " and " store_ Returns table " sweeps table handling.
S120, according to the data statistics of preset Cost Model and the distributed computing task respectively determine with The degree of parallelism impact factor of the corresponding operation of each node in executive plan tree.
Degree of parallelism impact factor refers to that current operation influences the scale factor of subsequent operation degree of parallelism, such as shown in Fig. 2 Executive plan tree, the degree of parallelism impact factor of filtering (f1) operation refer to influencing the ratio of its subsequent operations degree of parallelism because Son.
In this step, the degree of parallelism impact factor respectively operated in executive plan tree is determined respectively, wherein degree of parallelism influences The determination method of the factor is depending on specific action type.
Specifically, can be the degree of parallelism decision model for pre-establishing a generation degree of parallelism impact factor, the degree of parallelism Decision model executes degree of parallelism provided in an embodiment of the present invention and determines in method according to preset Cost Model and the distribution The data statistics of formula calculating task determine the step of degree of parallelism impact factor of each operation respectively.
S130, according to the data statistics of preset Cost Model and distributed computing task, table handling is swept in determination Initial degree of parallelism.
Sweep the initial degree of parallelism of table handling determination be entire distributed computing task degree of parallelism control basis because The degree of parallelism of subsequent computational task, which controls, to be determined based on the initial degree of parallelism for sweeping table handling.
Most of existing degree of parallelism scheme determine to sweep according to file fragmentation number table handling i.e. initiating task and Row degree may result in the problems such as initiating task is excessive, occupancy concurrent computation resource and small documents are excessive in this way.
In the present embodiment, be using the data statistics according to preset Cost Model and distributed computing task come Determine the initial degree of parallelism for sweeping table handling.
It, can will be according to preset Cost Model and distributed computing as a kind of optional embodiment of the present embodiment The data statistics of task determine the initial degree of parallelism for sweeping table handling, specifically: according to preset Cost Model and distribution The data processing amount that end is mapped in formula calculating task, calculates the initial degree of parallelism for sweeping table handling.
That is, sweeping table according to the data processing amount for mapping end in preset Cost Model and distributed computing task to estimate The initial degree of parallelism of operation.
Assuming that the output of Cost Model is cost, the data volume of single MAP processing is dataSizePerMap, then sweeps table handling Initial degree of parallelism IP can pass through following formula calculate obtain: IP=cost/dataSizePerMap.It is true by such method The fixed initial degree of parallelism for sweeping table handling, not fixed value, but it is different with the data volume that list MAP is handled according to Cost Model and Difference can achieve and sweep the effect that the initial degree of parallelism of table handling adaptively determines.
Compared to the degree of parallelism control program merged based on specification, above-mentioned technical proposal is carried out at the end MAP based on cost estimation Initial degree of parallelism control, compensate for lack in existing degree of parallelism scheme the end MAP control defect, avoid system resource and account for With the drawbacks such as, small documents are excessive, and invasive change is not present to distributed computing engine.
S140, according to sweeping the initial degree of parallelism of table handling, according to postorder traversal sequence, according to respectively saved in executive plan tree The degree of parallelism impact factor of the corresponding operation of point, calculates separately the degree of parallelism of operation corresponding with node each in executive plan tree.
Postorder traversal is carried out to executive plan tree, determines the degree of parallelism of the operation on each node.With execution as shown in Figure 2 For plan tree, according to arrow direction, the degree of parallelism of each operation is successively determined.
Specifically, can be the degree of parallelism decision model for pre-establishing a determining degree of parallelism, the degree of parallelism decision model Degree of parallelism provided in an embodiment of the present invention is executed to determine according to the initial degree of parallelism for sweeping table handling in method, it is suitable according to postorder traversal Sequence, according to the degree of parallelism impact factor of operation corresponding with node each in executive plan tree, calculate separately in executive plan tree The step of degree of parallelism of the corresponding operation of each node.
Wherein, the degree of parallelism for sweeping table handling is related with its initial degree of parallelism and degree of parallelism impact factor;In addition to sweeping table handling, The degree of parallelism of other operations degree impact factor in parallel and the degree of parallelism of the operation where it in child node of node are related, Or it is only related to the degree of parallelism of operation in the child node of node where it.
It is worth noting that degree of parallelism decision model and the determining degree of parallelism of generation degree of parallelism impact factor mentioned above Degree of parallelism decision model can be the same degree of parallelism decision model, i.e., first according to preset Cost Model and the distribution The data statistics of calculating task determine the degree of parallelism impact factor of each operation respectively, influence subsequent operation in this, as measuring Degree of parallelism scale factor, further according to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to executive plan The degree of parallelism impact factor of the corresponding operation of each node, is calculated and determined the degree of parallelism of each operation in tree.
A kind of determination method of degree of parallelism provided in an embodiment of the present invention obtains the execution meter of distributed computing task first Tree is drawn, determines executive plan respectively further according to the data statistics of preset Cost Model and the distributed computing task The degree of parallelism impact factor for being used to indicate influence subsequent operation degree of parallelism of operation corresponding with each node in tree, then according to pre- If Cost Model and distributed computing task data statistics, determine and sweep the initial parallel of table handling in executive plan tree Degree, and then influenced according to the degree of parallelism of operation corresponding with each node in the initial degree of parallelism and executive plan tree for sweeping table handling The factor respectively determine executive plan tree in operation corresponding with each node degree of parallelism, it follows that distributed computing task and Row degree control program.The determination method of above-mentioned degree of parallelism avoids fixed degree of parallelism, merges degree of parallelism based on absolute value or specification The drawbacks of control program, improves the performance, stability and availability of distributed computing engine, realizes degree of parallelism control oneself It adapts to.
Embodiment two
Fig. 3 is the flow chart that a kind of degree of parallelism provided by Embodiment 2 of the present invention determines method, and the present embodiment is with above-mentioned reality It applies and is embodied based on example, wherein
It will and institute determining respectively according to the data statistics of preset Cost Model and the distributed computing task State the degree of parallelism impact factor of the corresponding operation of each node in executive plan tree, specifically: according to preset Cost Model and The data statistics of the distributed computing task determine the initial of operation corresponding with node each in executive plan tree respectively Degree of parallelism impact factor;Each initial degree of parallelism impact factor is fitted according to preset mapping ruler, obtain and executes meter Draw the degree of parallelism impact factor of the corresponding operation of each node in tree.
By according to sweeping the initial degree of parallelism of table handling, according to postorder traversal sequence, according to each node in executive plan tree The degree of parallelism impact factor of corresponding operation calculates separately the degree of parallelism of operation corresponding with node each in executive plan tree, tool Body are as follows: according to postorder traversal sequence, successively obtain operation corresponding with a node and be used as current operation;If current operation is Table handling is swept, then according to the initial degree of parallelism of table handling and the degree of parallelism impact factor of current operation is swept, calculates current operation Degree of parallelism;If current operation is non-to sweep table handling, it is determined that the child node of node corresponding with current operation, according to institute State the degree of parallelism of the corresponding operation of child node, or according to the degree of parallelism of operation corresponding with child node and current operation and Row degree impact factor, calculates the degree of parallelism of current operation;It returns and executes according to postorder traversal sequence, successively obtain and a node Corresponding operation is used as current operation, until completing the processing to all operationss in the executive plan tree.
Further, after the degree of parallelism for calculating separately operation corresponding with node each in executive plan tree, this implementation The determination method for the degree of parallelism that example provides further include:
It is adjusted, is obtained according to the degree of parallelism of systemic presupposition parameter pair operation corresponding with node each in executive plan tree The final degree of parallelism of operation corresponding with node each in executive plan tree.
As shown in figure 3, the method for the present embodiment specifically includes:
S310, the executive plan tree for obtaining distributed computing task, wherein the root node of the executive plan tree corresponds to institute The output operation in executive plan is stated, at least one leaf node of the executive plan tree corresponds to sweeping in the executive plan Table handling.
S320, according to the data statistics of preset Cost Model and the distributed computing task respectively determine with The initial degree of parallelism impact factor of the corresponding operation of each node in executive plan tree.
By taking executive plan tree shown in Fig. 2 as an example, the first of operation corresponding with node each in executive plan tree is determined respectively Beginning degree of parallelism impact factor as determines the initial degree of parallelism impact factor of the operation in executive plan tree on each node respectively, Such as sweep table (ts1:ss) operation, sweep table (ts2:i) operation, sweep table (ts3:sr) operation etc. operations initial degree of parallelism influence because Son.
As a kind of optional embodiment of the present embodiment, appointed according to preset Cost Model and the distributed computing The initial degree of parallelism impact factor of the determining operation corresponding with node each in executive plan tree of the data statistics of business, including under State at least one:
If operation belongs to the first action type, believed according to the data statistics of Cost Model and distributed computing task Breath, the initial degree of parallelism impact factor of calculating operation;
It, will be according to the Cost Model and the distributed computing task if operation belongs to the second action type Initial degree of parallelism impact factor of the baseline impact factor that data statistics determine as operation;
If operation belongs to third action type, it is determined that the child node of node corresponding with operation, it will be with child node pair Initial degree of parallelism impact factor of the initial degree of parallelism impact factor for the operation answered as the operation.
Behaviour is calculated according to Cost Model and the data statistics of distributed computing task for the first action type The initial degree of parallelism impact factor made.Wherein, the first action type includes at least one of following: filter operation and prepolymerization behaviour Make.
Firstly the need of the preparatory basic statistics information for collecting distributed computing task related data, typical basic statistics letter Breath such as table type, subregion divide the non-duplicate value number (NDV) of bucket information, table size, table item number, column maximin, column.
The initial degree of parallelism impact factor of filtering (f1) operation is indicated with IPIF (f1), according to Cost Model and can be divided The data statistics of cloth calculating task calculate IPIF (f1), and the value of IPIF (f1) is determined as to filter the selection of (f1) operation Rate (Selectivity), numerical value can be obtained by the statistical information of pre-collecting and the estimation of preset Cost Model, on single-row c1 Equivalent filtering for, Selectivity (c1)=1/NDV (c1), wherein NDV (c1) is the non-duplicate value on single-row c1 Number.
Prepolymerization (pg1) operation initial degree of parallelism impact factor can be determined according to the aggregate rate of Aggregation field, In, the aggregate rate of Aggregation field is specifically obtained by statistical information and Cost Model estimation.
For the second action type, can will be united according to the data of the Cost Model and the distributed computing task Count initial degree of parallelism impact factor of the determining baseline impact factor of information as operation.Wherein, under the second action type includes It states at least one: sweeping table handling, attended operation, converging operation, joint operation, intersection operation and difference operation.
That is, table (ts1:ss) operation will be swept in such as Fig. 2, and sweep table (ts2:i) operation, sweep table (ts3:sr) operation, connect The initial degree of parallelism impact factor for connecing (cj1) operation and polymerization (g1) operation is determined as the baseline impact factor.For example, according to described The data statistics of Cost Model and the distributed computing task determine the baseline impact factor be 1.0, it is above-mentioned these The initial degree of parallelism impact factor of operation is 1.0.
For third action type, the child node of node corresponding with operation is first determined, it will operation corresponding with child node Initial degree of parallelism impact factor of the initial degree of parallelism impact factor as the operation, wherein under third action type includes State at least one: broadcast attended operation, projection operation, sorting operation, fast resampling operation and default action.
By taking projection (pj4) operation as an example, the child node of corresponding node is the node connected where (cj1) operation, So IPIF (pj4)=IPIF (cj1);By taking projection (pj3:sr) operation as an example, the child node of corresponding node is to sweep table (ts3:sr) node where operation, then IPIF (pj3:sr)=IPIF (ts3:sr)=baseline impact factor.
By taking broadcast connection (mj1) operation as an example, the broadcast table child node of corresponding node operates institute for projection (pj2) Node, table child node is the node projected where (pj1) operation for the fact that corresponding node, then will then broadcast table Initial degree of parallelism of the initial degree of parallelism impact factor of projection (pj2) operation in child node as broadcast connection (mj1) operation Impact factor, i.e. IPIF (mj1)=IPIF (broadcast table)=IPIF (pj2).
The degree of parallelism decision model example of the generation of table 1 IPIF
Specifically, the degree of parallelism decision model as shown in Table 1 for generating initial degree of parallelism impact factor can be pre-established, And the initial degree of parallelism impact factor respectively operated in executive plan tree is determined according to the degree of parallelism decision model.
In the above-mentioned technical solutions, according to the statistical information and default Cost Model collected in advance, according to degree of parallelism decision Model calculates the initial degree of parallelism impact factor of each operation, to indicate that present node operation influences subsequent node operation degree of parallelism Scale factor.Because degree of parallelism impact factor is different from absolute figure, also avoid to carry out degree of parallelism control based on absolute value The problem for needing parameter preset and estimation to be not allowed when processed.
S330, each initial degree of parallelism impact factor is fitted according to preset mapping ruler, is obtained and executive plan The degree of parallelism impact factor of the corresponding operation of each node in tree.
After obtaining initial degree of parallelism impact factor, also need to be fitted these initial degree of parallelism impact factors to generate simultaneously Row degree impact factor PIF, in this way can number ratio drop between smooth mapper and reducer, prevent excessively amplification or Reduce degree of parallelism.
For example, initial degree of parallelism impact factor IPIF (f2) is the Selectivity for filtering (f2), it is assumed that filtering (f2) is Equivalence filters and filtering is classified as c1, then IPIF (f2)=Selectivity=1/NDV (c1).In the very big situation of NDV (c1) Under, if be not fitted, reducer quantity will narrow down to the 1/NDV (c1) of map quantity, cause reducer very few.
It, can will be according to preset mapping ruler to each initial degree of parallelism as a kind of optional embodiment of the present embodiment Impact factor is fitted, and obtains the degree of parallelism impact factor of operation corresponding with node each in executive plan tree, specifically: root Each initial degree of parallelism impact factor is adjusted to preset interval range according to preset mapping ruler, by each result adjusted Degree of parallelism impact factor as operation corresponding with node each in executive plan tree.
That is, fit procedure can be realized by mapping ruler, such as setting minIPIF and maxIPIF is to define IPIF Minimum value and maximum value, preset interval range be [minIPIF, maxIPIF], then corresponding mapping ruler can be such as table 2 It is shown.The degree of parallelism impact factor respectively operated in executive plan tree as shown in Figure 2 is as shown in figure 4, with " filtering (f2) in Fig. 4 For PIF (0.1) ", indicate that the degree of parallelism impact factor PIF of filtering (f2) operation is 0.1.
The mapping ruler of the initial degree of parallelism impact factor of table 2 fitting
Mapping ruler IPIF PIF
R1 IPIF≤minIPIF minIPIF
R2 MinIPIF < IPIF≤maxIPIF IPIF
R3 IPIF>maxIPIF maxIPIF
The above-mentioned method being fitted to initial degree of parallelism impact factor is only a kind of optional embodiment of the present embodiment, It can also adopt and initial degree of parallelism impact factor is fitted with other methods, this present embodiment is not specifically limited.
S340, according to the data statistics of preset Cost Model and distributed computing task, table handling is swept in determination Initial degree of parallelism.
Specifically, can estimate according to the data processing amount at end is mapped in preset Cost Model and distributed computing task Calculate the initial degree of parallelism for sweeping table handling.
Assuming that the output of Cost Model is cost, the data volume of single MAP processing is dataSizePerMap, then sweeps table handling Initial degree of parallelism IP can pass through following formula calculate obtain: IP=cost/dataSizePerMap.
S350, according to postorder traversal sequence, successively obtain operation corresponding with a node as current operation, judge institute State whether current operation is to sweep table handling, if so, S360 is executed, if it is not, then executing S370.
After the initial degree of parallelism for sweeping table handling determines, postorder traversal executive plan tree determines the operation on each node Degree of parallelism, wherein the degree of parallelism for sweeping table handling is directly related with the initial degree of parallelism of table handling is swept, non-to sweep the parallel of table handling Whether the initial degree of parallelism spent and sweep table handling is not directly relevant to, therefore be to sweep table handling by distinguishing, and is calculated to determine how Its degree of parallelism.
S360, basis sweep the initial degree of parallelism of table handling and the degree of parallelism impact factor of current operation, calculate current behaviour The degree of parallelism of work executes S380.
Specifically, the degree of parallelism for sweeping table handling be sweep table handling initial degree of parallelism and sweep table handling degree of parallelism influence because The result of product of son.
The child node of S370, determination node corresponding with current operation, according to operation corresponding with the child node and Row degree, or according to the degree of parallelism of operation corresponding with child node and the degree of parallelism impact factor of current operation, calculate current The degree of parallelism of operation executes S380.
As a kind of optional embodiment of the present embodiment, by according to the degree of parallelism of operation corresponding with the child node, Or according to the degree of parallelism of operation corresponding with child node and the degree of parallelism impact factor of current operation, calculate current operation Degree of parallelism, specifically:
If current operation belongs to the 4th action type, by the degree of parallelism and current operation of operation corresponding with child node Degree of parallelism impact factor result of product, the degree of parallelism as current operation;
If current operation belongs to the 5th action type, using the degree of parallelism of operation corresponding with child node as current behaviour The degree of parallelism of work;
If current operation belongs to the 6th action type, by the degree of parallelism of all operations corresponding with all child nodes It is cumulative and as a result, or the maximum parallelism degree value in the degree of parallelism of all operations corresponding with all child nodes, as current operation Degree of parallelism.
For the 4th action type, by the degree of parallelism of the degree of parallelism of operation corresponding with child node and current operation influence because The result of product of son, the degree of parallelism as current operation.Wherein, the 4th action type includes at least one of following: filter operation, Broadcast attended operation and pre-polymerization closing operation.
Referring to Fig. 2, such as current operation is filtering (f1) operation, then where filtering (f1) operation in the child node of node Operation be sweep table (ts1:ss) operation, then will sweep table (ts1:ss) operation degree of parallelism and filtering (f1) operation degree of parallelism shadow The result of product for ringing the factor, the degree of parallelism as filtering (f1) operation.The degree of parallelism of filtering (f1) operation is indicated with P (f1), then P (f1)=P (ts1:ss) × PIF (f1).
Such as current operation is broadcast connection (mj1) operation, the broadcast table child node of corresponding node is projection (pj2) the fact that the node where operation, corresponding node, table child node be the node for projecting (pj1) operation place, that Then by true table child node projection (pj1) operation degree of parallelism and broadcast connection (mj1) operation degree of parallelism influence because Degree of parallelism of the product of son as broadcast connection (mj1) operation, P (mj1)=P (pj1) × PIF (mj1).
For the 5th action type, using the degree of parallelism of operation corresponding with child node as the degree of parallelism of current operation.Its In, the 5th action type includes at least one of following: converging operation, projection operation, sorting operation and fast resampling operation.
Such as current operation is polymerization (g1) operation, then the operation where polymerization (g1) operation in the child node of node is Fast resampling (rs3) operation, then the degree of parallelism for polymerizeing (g1) operation is the degree of parallelism of fast resampling (rs3) operation, P (g1) =P (rs3).
For the 6th action type, by the cumulative of the degree of parallelism of all operations corresponding with all child nodes and as a result, or Maximum parallelism degree value in the degree of parallelism of all operations corresponding with all child nodes, the degree of parallelism as current operation.Wherein, 6th action type includes at least one of following: attended operation, joint operation, intersection operation, difference operation and default action.
Such as current operation is connection (cj1) operation, the operation in all child nodes of place node is that data are divided again Cloth (rs1) operation and fast resampling (rs2) operation, then the degree of parallelism of fast resampling (rs1) operation and data are divided again Degree of parallelism of the maximum value as connection (cj1) operation in the degree of parallelism of cloth (rs2) operation, P (cj1)=MAX (P (rs1), P (rs2))。
Specifically, the degree of parallelism decision model of generation degree of parallelism as shown in table 3 can be constructed in advance, and parallel according to this Degree decision model determines the degree of parallelism respectively operated in executive plan tree.According to the degree of parallelism decision model, execution as shown in Figure 3 The calculated result of degree of parallelism that is respectively operated in plan tree as shown in figure 5, by taking " filtering (f2) PIF (0.1) P (1) " in Fig. 5 as an example, It indicates that the degree of parallelism impact factor PIF of filtering (f2) operation is 0.1, and degree of parallelism P is 1.
The degree of parallelism decision model example of the generation degree of parallelism of table 3
S380, judge whether to complete the processing to all operationss in executive plan tree, execute S350 if it is not, then returning, if It is then to execute S390.
Successively the degree of parallelism of each operation in executive plan tree is determined according to the above method, until being fully completed.
S390, it is adjusted according to the degree of parallelism of systemic presupposition parameter pair operation corresponding with node each in executive plan tree It is whole, obtain the final degree of parallelism of operation corresponding with node each in executive plan tree.
Further, after degree of parallelism determines, it can be combined with system environments and parameter preset determine on executive plan tree The final degree of parallelism respectively operated.These parameter presets may include the fixed degree of parallelism of the system under specific condition or system most Small degree of parallelism (SPMIN) etc..For example, if systemic presupposition minimum degree of parallelism SPMIN, final degree of parallelism should be MAX (P, SPMIN)。
In conclusion after the determination method building degree of parallelism decision model of the degree of parallelism provided according to embodiments of the present invention, By the data statistics of pre-collecting distributed computing task, the executive plan of distributed computing task and preset cost mould Type is as input information, i.e., the exportable executive plan for carrying optimization degree of parallelism.
In the above-mentioned technical solutions, it is based on data statistics such as selection rate, aggregate rate and default Cost Model, is calculated each The degree of parallelism impact factor PIF of type tasks influences the scale factor of subsequent operation degree of parallelism in this, as measurement, further according to The final degree of parallelism P of each task is calculated and determined in PIF information.And then carry out the degree of parallelism determined according to embodiments of the present invention simultaneously The drawbacks of control of row degree, avoids fixed degree of parallelism, merges degree of parallelism control program based on absolute value or specification, improves distribution The performance of formula computing engines, stability and availability.
Embodiment three
Fig. 6 is a kind of structural schematic diagram of the determining device for degree of parallelism that the embodiment of the present invention three provides, and is applicable to point The case where degree of parallelism control is carried out in cloth calculating task, the mode which can be used software and/or hardware is realized, and general It can be integrated in the processor.
As shown in fig. 6, the determining device of the degree of parallelism specifically includes: executive plan tree obtains module 610, degree of parallelism influences Factor determining module 620, the initial degree of parallelism determining module 630 and degree of parallelism determining module 640 for sweeping table handling.Wherein,
Executive plan tree generation module 610, for obtaining the executive plan tree of distributed computing task, wherein described to hold The root node of row plan tree corresponds to the operation of the output in the executive plan, at least one leaf node of the executive plan tree Table handling is swept in the corresponding executive plan;
Degree of parallelism impact factor determining module 620, for being appointed according to preset Cost Model and the distributed computing The data statistics of business determine the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree respectively;
The initial degree of parallelism determining module 630 of table handling is swept, for according to preset Cost Model and the distribution The data statistics of calculating task determine the initial degree of parallelism for sweeping table handling;
Degree of parallelism determining module 640, for sweeping the initial degree of parallelism of table handling according to, according to postorder traversal sequence, According to the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree, calculate separately and the executive plan The degree of parallelism of the corresponding operation of each node in tree.
A kind of determining device of degree of parallelism provided in this embodiment obtains the executive plan of distributed computing task first Tree determines executive plan tree further according to the data statistics of preset Cost Model and the distributed computing task respectively In operation corresponding with each node be used to indicate the degree of parallelism impact factor for influencing subsequent operation degree of parallelism, then according to default Cost Model and distributed computing task data statistics, determine and sweep the initial parallel of table handling in executive plan tree Degree, and then influenced according to the degree of parallelism of operation corresponding with each node in the initial degree of parallelism and executive plan tree for sweeping table handling The factor respectively determine executive plan tree in operation corresponding with each node degree of parallelism, it follows that distributed computing task and Row degree control program.The determination method of above-mentioned degree of parallelism avoids fixed degree of parallelism, merges degree of parallelism based on absolute value or specification The drawbacks of control program, improves the performance, stability and availability of distributed computing engine, realizes degree of parallelism control oneself It adapts to.
Further, degree of parallelism impact factor determining module 620 specifically includes: initial degree of parallelism factor of influence determining unit With initial degree of parallelism impact factor fitting unit, wherein
Initial degree of parallelism factor of influence determining unit, for being appointed according to preset Cost Model and the distributed computing The data statistics of business determine respectively the initial degree of parallelism influence of corresponding with node each in executive plan tree operation because Son;
Initial degree of parallelism impact factor fitting unit, is used for according to preset mapping ruler to each initial degree of parallelism shadow It rings the factor to be fitted, obtains the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree.
Further, initial degree of parallelism factor of influence determining unit, including at least one of following:
If the operation belongs to the first action type, according to the Cost Model and the distributed computing task Data statistics, calculate the initial degree of parallelism impact factor of the operation;
If the operation belongs to the second action type, will be appointed according to the Cost Model and the distributed computing Initial degree of parallelism impact factor of the baseline impact factor that the data statistics of business determine as the operation;
If the operation belongs to third action type, it is determined that the child node of node corresponding with the operation, it will be with Initial degree of parallelism impact factor of the initial degree of parallelism impact factor of the corresponding operation of the child node as the operation.
Specifically, first action type includes at least one of following: filter operation and pre-polymerization closing operation;
Second action type includes at least one of following: sweep table handling, attended operation, converging operation, joint operation, Intersection operation and difference operation;
The third action type includes at least one of following: broadcast attended operation, projection operation, sorting operation, data Redistribution operation and default action.
Further, initial degree of parallelism impact factor fitting unit be specifically used for according to preset mapping ruler will it is each described in Initial degree of parallelism impact factor is adjusted to preset interval range, using each result adjusted as with the executive plan tree In the corresponding operation of each node degree of parallelism impact factor.
Further, degree of parallelism determining module 640 specifically includes:
Acquiring unit, for according to postorder traversal sequence, successively obtaining operation corresponding with a node as current behaviour Make;
First computing unit sweeps the initial of table handling if being to sweep table handling for the current operation according to The degree of parallelism impact factor of degree of parallelism and the current operation, calculates the degree of parallelism of the current operation;
Second computing unit, if sweeping table handling for the current operation to be non-, it is determined that with the current operation pair The child node for the node answered, according to the degree of parallelism of operation corresponding with the child node, or according to corresponding with the child node Operation degree of parallelism and the current operation degree of parallelism impact factor, calculate the degree of parallelism of the current operation;
Cycling element, executes for returning according to postorder traversal sequence, successively obtains operation corresponding with a node and makees For current operation, until completing the processing to all operationss in the executive plan tree.
Further, the second computing unit specifically includes: second calculate the first subelement, second calculate the second subelement and Second calculates third subelement, wherein
Second calculates the first subelement will be with the son if belonging to the 4th action type for the current operation The result of product of the degree of parallelism impact factor of the degree of parallelism and current operation of the corresponding operation of node, as the current behaviour The degree of parallelism of work;
Second calculates the second subelement will be with the son if belonging to the 5th action type for the current operation Degree of parallelism of the degree of parallelism of the corresponding operation of node as the current operation;
Second calculates third subelement will be with all institutes if belonging to the 6th action type for the current operation State the cumulative of the degree of parallelism of the corresponding all operations of child node and as a result, or corresponding with all child nodes all operations Maximum parallelism degree value in degree of parallelism, the degree of parallelism as the current operation.
Further, the 4th action type includes at least one of following: filter operation, broadcast attended operation and pre-polymerization Closing operation;
5th action type includes at least one of following: converging operation, projection operation, sorting operation and data are divided again Cloth operation;
6th action type includes at least one of following: attended operation, joint operation, intersection operation, difference operation And default action.
Further, sweep table handling initial degree of parallelism determining module 630 be specifically used for according to preset Cost Model with And the data processing amount at end is mapped in the distributed computing task, calculate the initial degree of parallelism for sweeping table handling.
Further, the determining device of above-mentioned degree of parallelism further include: degree of parallelism adjusts module, for calculating separately and institute After the degree of parallelism for stating the corresponding operation of each node in executive plan tree, according to systemic presupposition parameter pair and the executive plan tree In the degree of parallelism of the corresponding operation of each node be adjusted, obtain operation corresponding with node each in the executive plan tree most Whole degree of parallelism.
The determination method of degree of parallelism provided by any embodiment of the invention can be performed in the determining device of above-mentioned degree of parallelism, tool The standby corresponding functional module of determination method and beneficial effect for executing degree of parallelism.
Example IV
Fig. 7 is a kind of hardware structural diagram for equipment that the embodiment of the present invention four provides, as shown in fig. 7, the equipment packet It includes:
One or more processors 710, in Fig. 7 by taking a processor 710 as an example;
Memory 720;
The equipment can also include: input unit 730 and output device 740.
Processor 710, memory 720, input unit 730 and output device 740 in the equipment can pass through bus Or other modes connect, in Fig. 7 for being connected by bus.
Memory 720 be used as a kind of non-transient computer readable storage medium, can be used for storing software program, computer can Program and module are executed, such as the corresponding program instruction of the determination method of one of embodiment of the present invention degree of parallelism/module (example Such as, attached executive plan tree shown in fig. 6 obtains module 610, degree of parallelism impact factor determining module 620, sweeps the initial of table handling Degree of parallelism determining module 630 and degree of parallelism determining module 640).Processor 710 is stored in soft in memory 720 by operation Part program, instruction and module realize above-mentioned side thereby executing the various function application and data processing of computer equipment A kind of determination method of degree of parallelism of method embodiment.
Memory 720 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function;Storage data area can be stored to be created according to using for computer equipment Data etc..In addition, memory 720 may include high-speed random access memory, it can also include non-transitory memory, such as At least one disk memory, flush memory device or other non-transitory solid-state memories.In some embodiments, it stores Optional device 720 includes the memory remotely located relative to processor 710, these remote memories can be by being connected to the network extremely Terminal device.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication and its group It closes.
Input unit 730 can be used for receiving the number or character information of input, and generate the user with computer equipment Setting and the related key signals input of function control.Output device 740 may include that display screen etc. shows equipment.
Embodiment five
The embodiment of the present invention five also provides a kind of storage medium comprising computer executable instructions, and the computer can be held Row is instructed when being executed by computer processor for executing a kind of determination method of degree of parallelism, this method comprises:
Obtain the executive plan tree of distributed computing task, wherein hold described in the root node of the executive plan tree is corresponding The output operation of row in the works, at least one leaf node of the executive plan tree correspond in the executive plan and sweep table behaviour Make;
According to the data statistics of preset Cost Model and the distributed computing task respectively determine with it is described The degree of parallelism impact factor of the corresponding operation of each node in executive plan tree;
According to preset Cost Model and the data statistics of the distributed computing task, table handling is swept in determination Initial degree of parallelism;
According to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to in the executive plan tree The degree of parallelism impact factor of the corresponding operation of each node, calculates separately operation corresponding with node each in the executive plan tree Degree of parallelism.
Optionally, which can be also used for executing the present invention times when being executed by computer processor A kind of technical solution of the determination method of degree of parallelism provided by embodiment of anticipating.
By the description above with respect to embodiment, it is apparent to those skilled in the art that, the present invention It can be realized by software and required common hardware, naturally it is also possible to which by hardware realization, but in many cases, the former is more Good embodiment.Based on this understanding, technical solution of the present invention substantially in other words contributes to the prior art Part can be embodied in the form of software products, which can store in computer readable storage medium In, floppy disk, read-only memory (Read-Only Memory, ROM), random access memory (Random such as computer Access Memory, RAM), flash memory (FLASH), hard disk or CD etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
It is worth noting that, included each unit and module are only in the embodiment of the determining device of above-mentioned degree of parallelism It is to be divided according to the functional logic, but be not limited to the above division, as long as corresponding functions can be realized;Separately Outside, the specific name of each functional unit is also only for convenience of distinguishing each other, the protection scope being not intended to restrict the invention.
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims (13)

1. a kind of determination method of degree of parallelism characterized by comprising
Obtain the executive plan tree of distributed computing task, wherein the root node of the executive plan tree, which corresponds to, described executes meter Output operation in drawing, at least one leaf node of the executive plan tree, which corresponds in the executive plan, sweeps table handling;
According to the determining and execution respectively of the data statistics of preset Cost Model and the distributed computing task The degree of parallelism impact factor of the corresponding operation of each node in plan tree;
According to preset Cost Model and the data statistics of the distributed computing task, the initial of table handling is swept in determination Degree of parallelism;
According to the initial degree of parallelism for sweeping table handling, according to postorder traversal sequence, according to respectively saved in the executive plan tree The degree of parallelism impact factor of the corresponding operation of point, calculates separately the parallel of operation corresponding with node each in the executive plan tree Degree.
2. the method according to claim 1, wherein described according to preset Cost Model and the distribution The data statistics of calculating task determine that the degree of parallelism of operation corresponding with node each in the executive plan tree influences respectively The factor, comprising:
According to the determining and execution respectively of the data statistics of preset Cost Model and the distributed computing task The initial degree of parallelism impact factor of the corresponding operation of each node in plan tree;
Each initial degree of parallelism impact factor is fitted according to preset mapping ruler, is obtained and the executive plan tree In the corresponding operation of each node degree of parallelism impact factor.
3. according to the method described in claim 2, it is characterized in that, according to preset Cost Model and the distributed computing The data statistics of task determine the initial degree of parallelism impact factor of operation corresponding with node each in the executive plan tree, Including at least one of following:
If the operation belongs to the first action type, according to the Cost Model and the number of the distributed computing task Information according to statistics calculates the initial degree of parallelism impact factor of the operation;
It, will be according to the Cost Model and the distributed computing task if the operation belongs to the second action type Initial degree of parallelism impact factor of the baseline impact factor that data statistics determine as the operation;
If the operation belongs to third action type, it is determined that the child node of node corresponding with the operation, will with it is described Initial degree of parallelism impact factor of the initial degree of parallelism impact factor of the corresponding operation of child node as the operation.
4. according to the method described in claim 3, it is characterized in that, first action type includes at least one of following: mistake Filter operation and pre-polymerization closing operation;
Second action type includes at least one of following: sweeping table handling, attended operation, converging operation, joint operation, intersection Operation and difference operation;
The third action type includes at least one of following: broadcast attended operation, projection operation, sorting operation, data are divided again Cloth operation and default action.
5. according to the method described in claim 2, it is characterized in that, it is described according to preset mapping ruler to it is each it is described it is initial simultaneously Row degree impact factor is fitted, and obtains the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree, Include:
Each initial degree of parallelism impact factor is adjusted to preset interval range according to preset mapping ruler, will be adjusted Degree of parallelism impact factor of each result afterwards as operation corresponding with node each in the executive plan tree.
6. the method according to claim 1, wherein the initial degree of parallelism for sweeping table handling according to, is pressed It is counted respectively according to postorder traversal sequence according to the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree Calculate the degree of parallelism of operation corresponding with node each in the executive plan tree, comprising:
According to postorder traversal sequence, successively obtains operation corresponding with a node and be used as current operation;
If the current operation is to sweep table handling, the initial degree of parallelism and the current operation of table handling are swept according to Degree of parallelism impact factor, calculate the degree of parallelism of the current operation;
If the current operation sweeps table handling to be non-, it is determined that the child node of node corresponding with the current operation, according to The degree of parallelism of operation corresponding with the child node, or according to the degree of parallelism of operation corresponding with the child node and described The degree of parallelism impact factor of current operation, calculates the degree of parallelism of the current operation;
It returns and executes according to postorder traversal sequence, successively obtain operation corresponding with a node and be used as current operation, until complete The processing of all operationss in the pairs of executive plan tree.
7. according to the method described in claim 6, it is characterized in that, basis operation corresponding with the child node it is parallel Degree, or according to the degree of parallelism of operation corresponding with the child node and the degree of parallelism impact factor of the current operation, meter Calculate the degree of parallelism of the current operation, comprising:
If the current operation belongs to the 4th action type, by the degree of parallelism of operation corresponding with the child node and described The result of product of the degree of parallelism impact factor of current operation, the degree of parallelism as the current operation;
If the current operation belongs to the 5th action type, using the degree of parallelism of operation corresponding with the child node as institute State the degree of parallelism of current operation;
If the current operation belongs to the 6th action type, by the parallel of all operations corresponding with all child nodes Degree cumulative and as a result, or the maximum parallelism degree value in the degree of parallelism of all operations corresponding with all child nodes, as The degree of parallelism of the current operation.
8. the method according to the description of claim 7 is characterized in that the 4th action type includes at least one of following: mistake Filter operation, broadcast attended operation and pre-polymerization closing operation;
5th action type includes at least one of following: converging operation, projection operation, sorting operation and fast resampling behaviour Make;
6th action type includes at least one of following: attended operation, joint operation, intersection operation, difference operation and silent Recognize operation.
9. the method according to claim 1, wherein described according to preset Cost Model and the distribution The data statistics of calculating task determine the initial degree of parallelism for sweeping table handling, comprising:
According to the data processing amount for mapping end in preset Cost Model and the distributed computing task, table handling is swept in calculating Initial degree of parallelism.
10. -9 described in any item methods according to claim 1, which is characterized in that calculating separately and the executive plan tree In the corresponding operation of each node degree of parallelism after, further includes:
It is adjusted, is obtained according to the degree of parallelism of systemic presupposition parameter pair operation corresponding with node each in the executive plan tree The final degree of parallelism of operation corresponding with node each in the executive plan tree.
11. a kind of determining device of degree of parallelism characterized by comprising
Executive plan tree obtains module, for obtaining the executive plan tree of distributed computing task, wherein the executive plan tree Root node correspond to the output in the executive plan operation, the executive plan tree at least one leaf node correspondence described in Table handling is swept in executive plan;
Degree of parallelism impact factor determining module, for the data according to preset Cost Model and the distributed computing task Statistical information determines the degree of parallelism impact factor of operation corresponding with node each in the executive plan tree respectively;
The initial degree of parallelism determining module of table handling is swept, for according to preset Cost Model and the distributed computing task Data statistics, determine and sweep the initial degree of parallelism of table handling;
Degree of parallelism determining module, for sweeping the initial degree of parallelism of table handling according to, according to postorder traversal sequence, according to institute The degree of parallelism impact factor for stating the corresponding operation of each node in executive plan tree, calculates separately and respectively saves with the executive plan tree The degree of parallelism of the corresponding operation of point.
12. a kind of equipment including memory, processor and stores the computer journey that can be run on a memory and on a processor Sequence, which is characterized in that the processor realizes the method as described in any in claim 1-10 when executing described program.
13. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The method as described in any in claim 1-10 is realized when execution.
CN201811436295.1A 2018-11-28 2018-11-28 Determination method, apparatus, equipment and the medium of degree of parallelism Active CN109558232B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811436295.1A CN109558232B (en) 2018-11-28 2018-11-28 Determination method, apparatus, equipment and the medium of degree of parallelism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811436295.1A CN109558232B (en) 2018-11-28 2018-11-28 Determination method, apparatus, equipment and the medium of degree of parallelism

Publications (2)

Publication Number Publication Date
CN109558232A true CN109558232A (en) 2019-04-02
CN109558232B CN109558232B (en) 2019-08-23

Family

ID=65867926

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811436295.1A Active CN109558232B (en) 2018-11-28 2018-11-28 Determination method, apparatus, equipment and the medium of degree of parallelism

Country Status (1)

Country Link
CN (1) CN109558232B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112187285A (en) * 2020-09-18 2021-01-05 中科院计算技术研究所南京移动通信与计算创新研究院 Processing method of barrel shifter based on DVB-S2 decoder and barrel shifter
CN113535354A (en) * 2021-06-30 2021-10-22 深圳市云网万店电子商务有限公司 Method and device for adjusting parallelism of Flink SQL operator
WO2024078080A1 (en) * 2022-10-14 2024-04-18 华为技术有限公司 Database query method and apparatus, and device and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514229A (en) * 2012-06-29 2014-01-15 国际商业机器公司 Method and device used for processing database data in distributed database system
US20140282585A1 (en) * 2013-03-13 2014-09-18 Barracuda Networks, Inc. Organizing File Events by Their Hierarchical Paths for Multi-Threaded Synch and Parallel Access System, Apparatus, and Method of Operation
US20150193270A1 (en) * 2014-01-06 2015-07-09 International Business Machines Corporation Constructing a logical tree topology in a parallel computer
CN107025273A (en) * 2017-03-17 2017-08-08 南方电网科学研究院有限责任公司 The optimization method and device of a kind of data query
CN108319722A (en) * 2018-02-27 2018-07-24 北京小度信息科技有限公司 Data access method, device, electronic equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514229A (en) * 2012-06-29 2014-01-15 国际商业机器公司 Method and device used for processing database data in distributed database system
US20140282585A1 (en) * 2013-03-13 2014-09-18 Barracuda Networks, Inc. Organizing File Events by Their Hierarchical Paths for Multi-Threaded Synch and Parallel Access System, Apparatus, and Method of Operation
US20150193270A1 (en) * 2014-01-06 2015-07-09 International Business Machines Corporation Constructing a logical tree topology in a parallel computer
CN107025273A (en) * 2017-03-17 2017-08-08 南方电网科学研究院有限责任公司 The optimization method and device of a kind of data query
CN108319722A (en) * 2018-02-27 2018-07-24 北京小度信息科技有限公司 Data access method, device, electronic equipment and computer readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112187285A (en) * 2020-09-18 2021-01-05 中科院计算技术研究所南京移动通信与计算创新研究院 Processing method of barrel shifter based on DVB-S2 decoder and barrel shifter
CN112187285B (en) * 2020-09-18 2024-02-27 南京中科晶上通信技术有限公司 Barrel shifter processing method based on DVB-S2 decoder and barrel shifter
CN113535354A (en) * 2021-06-30 2021-10-22 深圳市云网万店电子商务有限公司 Method and device for adjusting parallelism of Flink SQL operator
WO2024078080A1 (en) * 2022-10-14 2024-04-18 华为技术有限公司 Database query method and apparatus, and device and medium

Also Published As

Publication number Publication date
CN109558232B (en) 2019-08-23

Similar Documents

Publication Publication Date Title
CN109558232B (en) Determination method, apparatus, equipment and the medium of degree of parallelism
CN103678520B (en) A kind of multi-dimensional interval query method and its system based on cloud computing
US11314808B2 (en) Hybrid flows containing a continous flow
CN109388791B (en) Dynamic diagram display method and device, computer equipment and storage medium
KR101773574B1 (en) Method for chart visualizing of data table
WO2022057303A1 (en) Image processing method, system and apparatus
CN106339252B (en) Self-adaptive optimization method and device for distributed DAG system
US20200311100A1 (en) Generating varied-scale topological visualizations of multi-dimensional data
CN109324796A (en) Quick interface arrangement method and device
CN112002021B (en) Aggregation dotting visualization method and device based on unity3d
WO2019233089A1 (en) Method and device for large-ratio scale reduction of internet testbed topology
CN110633959A (en) Method, device, equipment and medium for creating approval task based on graph structure
CN111966597B (en) Test data generation method and device
WO2022036596A1 (en) Decomposition method and apparatus for production order
CN110888672B (en) Expression engine implementation method and system based on metadata architecture
CN110784377A (en) Method for uniformly managing cloud monitoring data in multi-cloud environment
CN115310420A (en) Simulation analysis report generation method, device, equipment and storage medium
WO2020093718A1 (en) Training data re-sampling method and apparatus, and storage medium and electronic device
CN111679808A (en) RPA robot application requirement evaluation method and device
CN109325015A (en) A kind of extracting method and device of the feature field of domain model
JP5600693B2 (en) Clustering apparatus, method and program
CN111858059A (en) Graph calculation method, device, equipment and storage medium
CN112015714A (en) Database-based data model generation method and device
CN104901703A (en) Integer sequence fast compression storage algorithm
CN106156065B (en) A kind of file persistence method, delet method and relevant apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai

Patentee after: Star link information technology (Shanghai) Co.,Ltd.

Address before: 200233 11-12 / F, building B, 88 Hongcao Road, Xuhui District, Shanghai

Patentee before: TRANSWARP TECHNOLOGY (SHANGHAI) Co.,Ltd.