CN110413389A

CN110413389A - A kind of task schedule optimization method under the unbalanced Spark environment of resource

Info

Publication number: CN110413389A
Application number: CN201910669809.6A
Authority: CN
Inventors: 胡亚红; 盛夏; 毛家发; 吴寅超; 邱圆圆
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2019-11-05
Anticipated expiration: 2039-07-24
Also published as: CN110413389B

Abstract

The present invention relates to the task schedule optimization methods under a kind of unbalanced Spark environment of resource, the present invention optimizes the bottom dispatching algorithm of Spark, propose the Spark dynamic self-adapting dispatching algorithm (Spark Dynamic Adaptive Scheduling Algorithm, SDASA) based on node priority.SDASA indicates its computing capability using the priority of node, and in task operational process real-time perfoming priority update, situations such as the abundant isomerism for considering node, the utilization of resources and load.Experiments have shown that SDASA can be improved the operational efficiency of Spark system, shorten the job execution time.When executing the task of the same race of different data amount, 6.99% is averagely promoted using SDASA algorithm clustering performance；When the not task of the same race of execution, 6.32% is averagely promoted using SDASA algorithm clustering performance.

Description

A kind of task schedule optimization method under the unbalanced Spark environment of resource

Technical field

The present invention relates to the task schedules under big data processing field more particularly to a kind of unbalanced Spark environment of resource Optimization method.

Background technique

Update and high performance unit with structural establishments such as each large data center, Supercomputer Center and Internet companies The introducing of (such as GPU), each node gradually becomes isomery in cluster, calculate node CPU, memory in terms of different property Their processing capacity can be caused difference occur.Thus there is biggish difference, entire cluster in the COMPREHENSIVE CALCULATING ability of each node In the unbalanced state of resource.Since the ability of node each in cluster is different, same task is assigned to different nodes will be to section Point load generates different influences.The task schedule of Spark default is not examined based on the idealized design of clustered node isomorphism Consider cluster isomerism and node resource utilizes and loads the case where changing, therefore is unable to satisfy the effect of system under resource heterogeneous schemas The requirement such as rate and load balancing.

The task schedule research under parallel frame focuses primarily upon Hadoop platform at present, unbalanced for resource Task scheduling algorithm research is relatively fewer under Spark environment.A kind of self-adapting task scheduling method is loaded by detection node The runnability of cluster is improved with resource utilization.But the algorithm considers that real estate impact factor is not comprehensive enough, weight is excessively The threshold value of setting is relied on, it is subjective.Some task schedule optimization algorithms based on artificial intelligence and biological information, such as ant colony Algorithm, genetic algorithm etc., though being able to carry out multiple-objection optimization, these algorithm principles are more complicated, implement calculation amount It is larger, thus dispatching efficiency is lower.Therefore, it is the performance for improving Spark under the unbalanced environment of resource, needs to propose efficiently to appoint Business dispatching algorithm.

Summary of the invention

The present invention is to overcome above-mentioned shortcoming, and it is an object of the present invention to provide appointing under a kind of unbalanced Spark environment of resource Be engaged in method for optimizing scheduling, the present invention by the computing capability of each node in analysis cluster, to the bottom dispatching algorithm of Spark into Row optimization, proposes Spark dynamic self-adapting dispatching algorithm (the Spark Dynamic Adaptive based on node priority Scheduling Algorithm, SDASA).SDASA fully considers situations such as isomerism, the utilization of resources and load of node, energy The operational efficiency for enough improving Spark system, shortens the job execution time.

The present invention is to reach above-mentioned purpose by the following technical programs: the task under a kind of unbalanced Spark environment of resource Method for optimizing scheduling includes the following steps:

(1) screening influences the Static implicit method and dynamic factor of node priority, establishes node priority assessment indicator system, And calculate the weight of each index；

(2) distributed type assemblies resource monitoring Ganglia is disposed in the cluster, when cluster starts, triggering monitoring starting Heartbeat；

(3) when cluster is established or has new node that cluster is added, Master node calculates the nature static of each Slave node The static performance index value of energy index value or newly added node；

(4) Master node calculates the dynamic performance index value of each Slave node；

(5) Master node calculates the priority of each Slave node；

(6) Master node reads the priority of each Slave node, and according to the value of each Slave node priority to section Point is ranked up；

(7) Master node selects Slave node according to ranking results, traverses, will need to selected node The task of operation distributes to the highest Slave node of localization degree；

(8) if task execution finishes, task action result is returned；Otherwise return step (3).

Preferably, the step (1) is specific as follows:

(1.1) determine that the Static implicit method of node is that the CPU speed of node, CPU core number, memory are big using Principal Component Analysis Small and disk size；

(1.2) using Principal Component Analysis determine node dynamic factor be the CPU surplus ratio of node, memory surplus ratio, Disk size surplus ratio and cpu load；

(1.3) the analysis result based on step (1.1) and (1.2) establishes node priority assessment indicator system, and to each The importance of index is assessed；

(1.4) weight of each Static implicit method, dynamic factor is obtained using analytic hierarchy process (AHP).

Preferably, the step (3) is specific as follows:

(3.1) each Slave node obtains the Static implicit method value of oneself using Ganglia cluster resource monitoring system, packet Include CPU speed s_{cpu_speed}, CPU core number s_{cpu_num}, memory size s_memWith disk size s_disk；

(3.2) Slave node uses unicast by tidal data recovering to Master node；

(3.3) Master node calculates the static performance index S of i-th of Slave node using formula (1)_i, i=1 To h, h are the number of slave node in cluster；

Wherein n₁, n₂, n₃, n₄The respectively power of the Static implicit methods such as CPU speed, CPU core number, memory size and disk size Value, and n₁+n₂+n₃+n₄=1；n₁, n₂, n₃, n₄Value be calculated using analytic hierarchy process (AHP).

Preferably, the step (4) is specific as follows:

(4.1) the period timing that each Slave node gives according to Ganglia cluster resource monitoring system configuration file obtains It is derived from oneself dynamic factor value, including node cpu surplus ratio d_cpu, memory surplus ratio d_mem, disk size surplus ratio d_diskAnd Cpu load d_length；

(4.2) Slave node uses unicast by tidal data recovering to Master node；

(4.3) Master node calculates the dynamic performance index D of i-th of Slave node using formula (2)_i, i=1 To h, h are the number of slave node in cluster；

Wherein, m₁, m₂, m₃, m₄Respectively indicate CPU surplus ratio, memory surplus ratio, disk size surplus ratio and cpu load The weight of equal dynamic factors, and m₁+m₂+m₃+m₄=1；m₁, m₂, m₃, m₄Value be calculated using analytic hierarchy process (AHP).

Preferably, the step (5) specifically: Master node is saved using each Slave that step (3) and (4) obtain The Static State Index value S of point_iWith dynamic indicator value D_i, the priority of each node is calculated using formula (3):

P_i=α D_i+βS_i (3)

Wherein α and β is D respectively_iAnd S_iWeight, be calculated using analytic hierarchy process (AHP).

Preferably, the step (7) is specific as follows:

(7.1) Master node successively traverses the node set WorkerOffer by the sequence of node priority size；

(7.2) each task in set of tasks is traversed in turn in each node, circulation executes step (7.3)；

(7.3) localization parameter of the task on present node is obtained；If parameter is the largest, then follow the steps (7.4), no to then follow the steps (7.2)；

(7.4) Task is distributed to the node.

The beneficial effects of the present invention are: use priority of the present invention describes the unbalanced isomeric group interior joint of resource Computing capability, and task schedule is carried out according to the priority of node.In cluster operational process, each Slave node is obtained in real time Dynamic factor value, and the priority value of more new node.The algorithm of proposition can complete task tune according to the current performance of node Degree, effectively improves the performance of cluster, shortens the execution time of task.

Detailed description of the invention

Fig. 1 is method flow schematic diagram of the invention；

Fig. 2 is node priority assessment indicator system schematic diagram of the invention；

Fig. 3 is that SDASA algorithm of the invention implements architecture diagram；

Fig. 4 is the task completion time of the same race that SDASA algorithm and Spark default algorithm of the invention execute different data amount Comparison schematic diagram；

Fig. 5 is that SDASA algorithm and Spark default algorithm execution task completion time not of the same race of the invention compare signal Figure.

Specific embodiment

The present invention is described further combined with specific embodiments below, but protection scope of the present invention is not limited in This:

Embodiment: the present invention for Spark default task schedule be the idealized design based on clustered node isomorphism this One problem, the present invention optimize the bottom dispatching algorithm of Spark by the computing capability of each node in analysis cluster, Propose Spark dynamic self-adapting dispatching algorithm (the Spark Dynamic Adaptive based on node priority Scheduling Algorithm, SDASA).SDASA fully considers situations such as isomerism, the utilization of resources and load of node, energy The operational efficiency for enough improving Spark system, shortens the job execution time.

The computing capability of node indicates that priority is higher, and the node computing capability that represents is stronger with node priority, is selected The probability of execution task is bigger.The index (i.e. joint behavior index) that node priority describes joint behavior by one group calculates It arrives.Joint behavior index includes static performance index and dynamic performance index.Static performance index refer to execution status of task without The index of pass, value are determined by multiple Static implicit methods.Node dynamic performance index then refers to that value can be with execution status of task And the index changed, value are determined by multiple dynamic factors.

As shown in Figure 1, the task schedule optimization method under a kind of unbalanced Spark environment of resource, includes the following steps:

(1) screening influences the Static implicit method and dynamic factor of node priority, establishes node priority assessment indicator system And calculate the weight of each index.

(1.1) factor for influencing joint behavior is analyzed, establishes the priority assessment indicator system of node, such as attached drawing Shown in 2；Wherein, carrying out analysis includes determining that the Static implicit method of node is the CPU speed of node, CPU using Principal Component Analysis Nucleus number, memory size and disk size.Using Principal Component Analysis determine node dynamic factor be node CPU surplus ratio, Memory surplus ratio, disk size surplus ratio and cpu load (i.e. the length that CPU uses queue).

(1.2) domain expert assesses the importance of each index；

(1.3) weight of each static performance index and dynamic performance index is calculated using analytic hierarchy process (AHP).

(2) distributed type assemblies resource monitoring Ganglia is disposed, in the cluster to complete to Slave each in cluster The monitoring of the information such as memory, CPU, hard disk, the network flow of node.When cluster starts, triggering monitoring starting heartbeat.

(3) when cluster is established or has new node that cluster is added, Master node calculates the nature static of each Slave node The static performance index value of energy index value or newly added node.(3.1) when cluster is established or has new node that cluster is added, respectively Slave node (or the Slave node being newly added) obtains the Static implicit method value of oneself, including CPU speed using Ganglia s_{cpu_speed}, CPU core number s_{cpu_num}, memory size s_memWith disk size s_disk；

(3.2) each Slave node uses unicast by tidal data recovering to Master node；

(3.3) Master node calculates the static performance index S of i-th of Slave node using formula (1)_i, i=1 To h, h are the number of slave node in cluster.

(4) Master node calculates the dynamic performance index value of each Slave node.

(4.1), the period timing acquisition oneself that each Slave node gives according to Ganglia system configuration file is dynamic State factor value, including node cpu surplus ratio d_cpu, memory surplus ratio d_mem, disk size surplus ratio d_diskAnd cpu load d_length；

(4.2), Slave node uses unicast by tidal data recovering to Master node；

(4.3), Master node calculates the dynamic performance index D of i-th of Slave node using formula (2)_i, i=1 To h, h are the number of slave node in cluster.

(5) Master node calculates the priority of each node.

When there is node sequencing request, Master node reads the Static State Index value S of each node from database_iWith it is dynamic State index value D_i, the priority of each node is calculated using formula (3).

(6) Master node reads the priority of each Slave node, and is ranked up according to the value of priority to node.

(7) Master node selects Slave node according to priority size, then again to selected node progress time out It goes through, running for task will be needed to distribute to the highest Slave node of localization degree.

Wherein the above method is what the framework based on Fig. 3 was implemented, the Spark task scheduling algorithm of the method for the present invention and default Experimental result comparison it is as shown in Figure 4 and Figure 5.

In conclusion the present invention is on the basis of establishing node priority assessment indicator system, it is true using analytic hierarchy process (AHP) The weight of fixed each Static implicit method and dynamic factor.SDASA algorithm obtains the dynamic indicator value of each Slave node in real time, carries out The calculating of node priority, and according to the distribution of the priority of each node completion task.Experiment shows silent compared to Spark Recognize dispatching algorithm, algorithm proposed by the present invention can effectively improve the performance of group system.When execute different data amount it is of the same race When task, 6.99% is averagely promoted using SDASA algorithm clustering performance；When the not task of the same race of execution, SDASA set of algorithms is used Group's performance averagely promotes 6.32%.

It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to of the invention Protection scope.

Claims

1. the task schedule optimization method under a kind of unbalanced Spark environment of resource, which comprises the steps of:

(1) screening influences the Static implicit method and dynamic factor of node priority, establishes node priority assessment indicator system, and count Calculate the weight of each index；

(3) when cluster is established or has new node that cluster is added, the static properties that Master node calculates each Slave node refers to The static performance index value of scale value or newly added node；

(5) Master node calculates the priority of each Slave node；

(6) Master node reads the priority of each Slave node, and according to the value of each Slave node priority to node into Row sequence；

(7) Master node selects Slave node according to ranking results, traverses to selected node, will need to run Task distribute to the highest Slave node of localization degree；

2. the task schedule optimization method under the unbalanced Spark environment of a kind of resource according to claim 1, feature exist In: the step (1) is specific as follows:

(1.1) using Principal Component Analysis determine node Static implicit method be the CPU speed of node, CPU core number, memory size and Disk size；

(1.2) determine that the dynamic factor of node is CPU surplus ratio, the memory surplus ratio, disk of node using Principal Component Analysis Capacity surplus ratio and cpu load；

(1.3) the analysis result based on step (1.1) and (1.2) establishes node priority assessment indicator system, and to each index Importance assessed；

3. the task schedule optimization method under the unbalanced Spark environment of a kind of resource according to claim 1, feature exist In: the step (3) is specific as follows:

(3.1) each Slave node obtains the Static implicit method value of oneself, including CPU using Ganglia cluster resource monitoring system Speed s_{cpu_speed}, CPU core number s_{cpu_num}, memory size s_memWith disk size s_disk；

(3.2) Slave node uses unicast by tidal data recovering to Master node；

(3.3) Master node calculates the static performance index S of i-th of Slave node using formula (1)_i, i=1 to h, h For the number of slave node in cluster；

Wherein n₁, n₂, n₃, n₄The respectively weight of the Static implicit methods such as CPU speed, CPU core number, memory size and disk size, and And n₁+n₂+n₃+n₄=1；n₁, n₂, n₃, n₄Value be calculated using analytic hierarchy process (AHP).

4. the task schedule optimization method under the unbalanced Spark environment of a kind of resource according to claim 1, feature exist In: the step (4) is specific as follows:

(4.1) each Slave node according to Ganglia cluster resource monitoring system configuration file give period timing acquisition from Oneself dynamic factor value, including node cpu surplus ratio d_cpu, memory surplus ratio d_mem, disk size surplus ratio d_diskAnd CPU is negative Carry d_length；

(4.2) Slave node uses unicast by tidal data recovering to Master node；

(4.3) Master node calculates the dynamic performance index D of i-th of Slave node using formula (2)_i, i=1 to h, h For the number of slave node in cluster；

Wherein, m₁, m₂, m₃, m₄It is dynamic to respectively indicate CPU surplus ratio, memory surplus ratio, disk size surplus ratio and cpu load etc. The weight of state factor, and m₁+m₂+m₃+m₄=1；m₁, m₂, m₃, m₄Value be calculated using analytic hierarchy process (AHP).

5. the task schedule optimization method under the unbalanced Spark environment of a kind of resource according to claim 1, feature exist In: the step (5) specifically: the Static State Index value for each Slave node that Master node is obtained using step (3) and (4) S_iWith dynamic indicator value D_i, the priority of each node is calculated using formula (3):

P_i=α D_i+βS_i (3)

6. the task schedule optimization method under the unbalanced Spark environment of a kind of resource according to claim 1, feature exist In: the step (7) is specific as follows:

(7.3) localization parameter of the task on present node is obtained；If parameter is the largest, then follow the steps (7.4), it is no Then follow the steps (7.2)；

(7.4) Task is distributed to the node.