CN111813512B - High-energy-efficiency Spark task scheduling method based on dynamic partition - Google Patents

High-energy-efficiency Spark task scheduling method based on dynamic partition Download PDF

Info

Publication number
CN111813512B
CN111813512B CN202010578245.8A CN202010578245A CN111813512B CN 111813512 B CN111813512 B CN 111813512B CN 202010578245 A CN202010578245 A CN 202010578245A CN 111813512 B CN111813512 B CN 111813512B
Authority
CN
China
Prior art keywords
task
scheduling
time
partitions
energy consumption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010578245.8A
Other languages
Chinese (zh)
Other versions
CN111813512A (en
Inventor
李鸿健
魏尧俊
熊渝
谭港凡
王菁精
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Dayu Chuangfu Technology Co ltd
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202010578245.8A priority Critical patent/CN111813512B/en
Publication of CN111813512A publication Critical patent/CN111813512A/en
Application granted granted Critical
Publication of CN111813512B publication Critical patent/CN111813512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the field of big data processing, in particular to a high-energy-efficiency Spark task scheduling method based on dynamic partitioning, which comprises the following steps: initializing a server to obtain a task information reference table; a user initiates a task scheduling request; the server receives the task scheduling request and then acquires task data information; judging whether the task limits the task completion time and whether the task is executed for the first time; determining the number of the partitions according to the judgment result; scheduling and running tasks of the server according to the number of the partitions; after the task operation is finished, calculating the data use energy consumption of the task operation condition by adopting an energy consumption evaluation model, and recording the data use energy consumption into a task information reference table to finish task scheduling; the invention improves the condition that the node performance can not be effectively utilized due to the unified treatment and allocation of the tasks in the operation process of the Spark native scheduling strategy, realizes better scheduling of the nodes with different CPU and I/O performances according to the task condition, and achieves the effect of energy saving.

Description

High-energy-efficiency Spark task scheduling method based on dynamic partition
Technical Field
The invention relates to the field of big data processing and energy efficiency, in particular to a high-energy-efficiency Spark task scheduling method based on dynamic partitioning.
Background
Under the rapid development of big data and cloud computing, the amount of data generated per second is exponentially increasing, so that a large amount of computing resources are needed to ensure that the data can be processed in time, and thus a centralized data processing mode using a cloud computing center is increasingly common. The total amount of data in china is increasing at a rate of 50% per year, and is expected to account for 21% of the world in 2020, and china is becoming a true world of data resources. Cloud computing is providing support for this enormous data processing with its powerful computing power and increasingly sophisticated architecture.
Currently, spark has become a mainstream framework of data processing, so it is meaningful to perform energy saving optimization on energy consumption in a data processing process in Spark environment, and similarly, it is also important to study energy efficiency of performance and energy consumption balance. The Spark computing framework is widely applied to distributed cloud computing, and application programs processed by the Spark computing framework can be divided into two types: CPU intensive and I/O intensive. When the Spark Task runs, each partition is converted into a Task to run on the node. When the CPU intensive tasks run on more partitions, the number of nodes and the number of corresponding machines for running the tasks are increased, and the calculation speed can be increased by increasing the number of the partitions, so that the program can run more quickly.
However, when the number of partitions is too large, the utilization rate of the CPU is reduced, the node performance cannot be sufficiently exerted, and resources and energy consumption are wasted. For an I/O-intensive application, many I/O operations need to be performed, and when a large number of partitions are used, the I/O operations need to consider a network problem among multiple nodes, a problem of aggregation of different nodes of a file, and the like, so that the I/O utilization rate of the nodes is also reduced when the number of partitions is too large, and the performance cannot be fully exerted. Each partition is finally converted into a task to run on the node, that is, when the number of partitions is larger, more energy consumption is caused, and therefore, the number of partitions needs to be in a proper range.
Therefore, the adjustment of the number of the partitions can be used for improving the parallelism problem of the Spark at present, so that the energy consumption problem is optimized. How to reasonably control the task parallelism of the Spark on the premise of ensuring the running efficiency (time) of the application program so as to achieve the purpose of energy saving is a problem to be solved urgently.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a high-energy-efficiency Spark task scheduling method based on dynamic partitioning, which comprises the following steps:
s1: initializing a server to obtain a task information reference table;
s2: a user initiates a task scheduling request;
s3: the server receives the task scheduling request and then obtains task data information;
s4: judging whether the task limits the task completion time or not;
if the task completion time is limited, performing preliminary determination on the partition number; determining the final task partition number according to the preliminarily determined partition number;
if the time for completing the task is not limited, judging whether the task exists in a task information reference table or not;
if not, performing Spark default task scheduling and task execution;
if yes, finally determining the number of the task partitions;
s5: scheduling and executing the tasks of the server according to the final task partition number to complete task scheduling;
s6: calculating the server use energy consumption of the task operation condition by adopting an energy consumption evaluation model, recording data information for executing the task and the server use energy consumption into a task information reference table, and updating the task information reference table;
s7: and when the user initiates a new task scheduling request, returning to the step S3.
Preferably, the task data information includes: the running condition and the task data parameters of the Application program Application; the task data parameters include: CPU usage, I/O usage, task size, and application runtime.
Preferably, the preliminary determination of the number of partitions includes:
step 1: dividing the task types by adopting a K-means algorithm;
and 2, step: and determining the number of the preliminary partitions according to the task type division result and the data in the task operation information reference table.
Preferably, the process of dividing the task types includes:
step 11: taking all prime numbers and sequencing Sort between 2 and 20000000 as initial clustering center points of task clustering;
step 12: taking the CPU utilization rate and the I/O utilization rate as characteristic values, and representing Application program Application as two-dimensional coordinates by adopting the characteristic values;
step 13: performing initialization test on tasks to be executed, calculating the difference degree between the tasks according to the initialization test result, and taking the value of the difference degree as the Euclidean distance D between different application programs;
step 14: performing task clustering on task data information according to the initial clustering center point obtained in the step 11 and the Euclidean distance in the step 13; and obtains the task type.
Further, the application is represented in two-dimensional coordinates as:
Figure BDA0002552081900000031
further, the calculation formula of the euclidean distance D between different applications is:
Figure BDA0002552081900000032
preferably, the process of finally determining the task partition number includes:
step 1: selecting information of different historical tasks from the task information reference table as a factor for adjusting the number of partitions under different requirements of whether the task deadline is limited or not;
in the case of a defined task deadline: the task running time is a main factor, and the energy efficiency ratio is a secondary factor;
in case of non-defined task deadlines: the energy efficiency ratio is a main factor, and the number of partitions in the task information reference table is a secondary factor;
step 2: acquiring the resource use condition and the running state of each server in the cluster;
and step 3: adjusting the number of the partitions by using a partition adjusting formula to obtain the final number of the partitions; and if the available resources of the server are insufficient, delaying the task and then re-determining the final partition number.
Further, the partition adjustment formula is:
Figure BDA0002552081900000033
preferably, the task scheduling and running process includes:
step 1: taking a server as a node, and dividing each node into two types of nodes with stronger CPU performance and I/O performance; dividing the two types of nodes into 2 sets, and recording the sets into a local library;
step 2: scheduling tasks of different types to two types of nodes with different performances;
and step 3: and after the task scheduling is finished, entering task operation.
Preferably, the energy consumption evaluation model includes an energy consumption model E and an energy efficiency model η;
the expression of the energy consumption model is as follows:
Figure BDA0002552081900000041
the expression of the energy efficiency model is as follows:
Figure BDA0002552081900000042
the method optimizes the energy consumption problem in Spark operation from the parallel angle and the partition number angle, and provides reference for determining the partition number of the subsequent task by taking the operation result as reference data; the method improves the condition that the node performance cannot be effectively utilized due to the fact that tasks are uniformly treated in the operation process of the Spark native scheduling strategy, achieves better use of the nodes with different CPU and I/O (input/output) performances according to the task condition, ensures that the Spark tasks can utilize the difference of the node performances, and achieves the effect of improving the energy efficiency.
Drawings
FIG. 1 is a block diagram of the present invention;
FIG. 2 is a flow chart of Spark task scheduling based on the number of task partitions and an energy efficiency model according to the present invention;
FIG. 3 is a table of task information references used by the present invention;
FIG. 4 is a schematic diagram of the RDD process of Spark according to the present invention;
fig. 5 is a node architecture diagram of Spark.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to better adjust the number of the task operation partitions, the task is better completed within the limited task time, the energy consumption is reduced, and the energy efficiency is improved; the invention provides a high-energy-efficiency Spark task scheduling method based on dynamic partitioning, as shown in fig. 2, comprising the following steps:
s1: initializing a server to obtain a task information reference table;
s2: a user initiates a task scheduling request;
s3: the server receives the task scheduling request and then acquires task data information;
s4: judging whether the task limits the task completion time or not;
if the task completion time is limited, performing preliminary determination on the partition number; determining the final task partition number according to the preliminarily determined partition number;
if the task completion time is not limited, judging whether the task exists in the task information reference table or not;
if not, performing Spark default task scheduling and task execution;
if yes, finally determining the number of the task partitions;
s5: scheduling and executing the tasks of the server according to the final task partition number to complete task scheduling;
s6: calculating the server use energy consumption of the task operation condition by adopting an energy consumption evaluation model, recording data information for executing the task and the server use energy consumption into a task information reference table, and updating the task information reference table;
s7: and when the user initiates a new task scheduling request, returning to the step S3.
Wherein Spark is a secure, formally defined programming language.
The updated task information reference table enriches the data in the task information reference table, and when a new task is executed, the updated task information reference table is adopted to determine the partition number.
As shown in FIG. 1, the module of the present invention includes a CPU and I/O information collection module, a Spark running log extraction module, and a partition number adjustment module. The CPU, the I/O information acquisition module and the Spark running log extraction module are compiled by Shell scripts under Linux and are used for preliminarily acquiring data and simply filling a task information reference table to provide data support for subsequent operations such as task clustering, partition number adjustment and the like. The task data information is mainly acquired in the CPU/IO operation monitoring module. The energy efficiency calculation and task information filling reference table mainly uses a Spark running log extraction module to extract data from different running task logs and uses the data as subsequent task reference data, so that the Spark running log extraction module mainly works.
In the invention, different task data have different task scheduling processes, including: if the task deadline is not limited and the task information does not exist in the task information reference table, performing Spark default task scheduling and execution; limiting the task deadline, and performing preliminary determination and final determination of the number of task partitions; the task deadline is not limited, but the task information already exists in the task information reference table, and the final determination of the task partition number is directly entered.
The rated running frequency of the CPU is F, and the CPU is started at the time t i Has a CPU usage rate of
Figure BDA0002552081900000061
Time t 0 Has a CPU usage rate of
Figure BDA0002552081900000062
At time t i Power P of i Expressed as:
Figure BDA0002552081900000063
time t 0 CPU power P of 0 Expressed as:
Figure BDA0002552081900000064
the specified running time is t, and the energy consumption E of a certain time period temp Comprises the following steps:
Figure BDA0002552081900000065
then the application energy consumption is the power integral during the task run, i.e. the energy consumption model can be expressed as:
Figure BDA0002552081900000066
in the embodiment, a CPU/IO real-time operation monitoring module is constructed through Shell scripts written under Linux to monitor the operation process of an application program in Spark, and Spark applications can be divided into CPU intensive computing and I/O intensive applications; the example here selects the more common Spark benchmark application for testing, such as finding all prime numbers between 2 and 20000000, sort. The partition number is adjusted to each of the following available values for different types of applications: 1,2,4,8, 16. And performing multiple experiments, collecting the task scale, the CPU utilization rate of different nodes, the I/O utilization rate, the running time, the running log and the like as characteristic data, calculating the energy consumption and the energy efficiency of each application program under different partition numbers through the energy consumption and energy efficiency model, and storing the energy consumption and the energy efficiency into a task running information table so as to provide a reference value for selecting the optimal partition number for the subsequent application program.
In the cluster, different applications have different emphasis on the use of node resources, for example, the use of prime number calculation is more CPU resources, the use of Sort sorting is more I/O resources, any application can be divided into two different types by using different applications on the node resources, the prime number calculation belongs to CPU-bound, and the Sort sorting belongs to I/O-bound. The CPU-bound type refers to that the application needs more operations in the running process to cause higher CPU utilization rate, and the read-write operation of a disk and a memory is relatively less; the I/O-bound type is contrary to the I/O-bound type, and when the I/O-bound type is applied to operation, more read-write operations can be carried out on a disk and a memory, and the I/O utilization rate is higher and the relative CPU utilization rate is lower when the read-write operations are too much.
Through prime number calculation and Sort of Sort under the conditions of running time and energy consumption of different task partition numbers, the fact that the running time can be shortened and the energy consumption is relatively low under the condition that the partition number is large for the CPU intensive application of prime number calculation can be found, the I/O intensive application of Sort is observed reversely, the large partition number not only increases the energy consumption, but also has little effect on shortening the task running time, and the method can be regarded as waste of resources. Therefore, for different application types, the number of partitions should be reasonably set, so that resource waste caused by too many nodes occupied by Sort-type I/O intensive applications is avoided, and better energy consumption is achieved on the premise of limiting the task running time.
More nodes consume more energy to operate than less nodes. This can be interpreted here as: when more partitions run, the number of nodes where the partitions are located is larger, and the number of machines corresponding to the nodes is larger.
For the CPU-intensive application program, the calculation speed can be increased by a large number of partitions, so that the program can be run and completed within a limited time, but the energy consumption problem is also caused, and when the number of the partitions is in a proper number, the completion within the limited time can be met, and the problem of overhigh node energy consumption is solved.
For an I/O intensive application program, more I/O operations need to be carried out, when more partitions run, the I/O operations need to consider the network problem among multiple nodes, the problem of different node convergence of files and the like, at the moment, the I/O utilization rate of the nodes is reduced when the partitions are too many, the performance cannot be fully used, more energy consumption is caused when the number of the partitions is more, and therefore the number of the partitions does not need to be too many.
The process of preliminary determination of the number of partitions includes:
step 1: dividing the task types by adopting a K-means algorithm;
step 11: selecting and searching all prime numbers between 2 and 20000000 and sequencing Sort as an initial clustering center point of the task clustering;
step 12: selecting CPU utilization rate and I/O utilization rate as characteristic values, and expressing the Application program Application by a two-dimensional coordinate system as follows:
Figure BDA0002552081900000081
wherein the content of the first and second substances,
Figure BDA0002552081900000082
for application at t i The rate of use of the CPU at the time,
Figure BDA0002552081900000083
for application t i I/O usage at runtime.
Step 13: performing initialization test on a task to be executed, wherein the initialization test comprises the acquisition of the CPU utilization rate and the I/O utilization rate during the running of the task; calculating the difference degree of the utilization rates of the CPU and the I/O between tasks according to the utilization rate of the CPU and the utilization rate of the I/O; taking the value of the difference degree as the Euclidean distance D between different application programs;
Figure BDA0002552081900000084
wherein the content of the first and second substances,
Figure BDA0002552081900000085
for application at t i The rate of use of the CPU at the time,
Figure BDA0002552081900000086
indicating that the application is at t t The rate of use of the CPU at the time,
Figure BDA0002552081900000087
running t for an application i The rate of I/O usage at the time,
Figure BDA0002552081900000088
representing application runs t t I/O usage at the time.
Step 14: performing task clustering on task data information according to the initial clustering center point obtained in the step 11 and the Euclidean distance in the step 13; and obtains the task type.
And 2, step: and determining the number of the preliminary partitions according to the task type division result and the data in the task operation information reference table.
The process of finally determining the number of task partitions includes:
step 1: selecting different historical task information from the task running information table as partition number adjustment references under different requirements of whether the task deadline is limited or not; specific considerations are energy efficiency ratio, running time, available resources, partition number, and the like;
in case a task deadline is defined: the main factor is task running time, and the secondary factor is energy efficiency ratio;
in case of non-defined task deadlines: the main factor is the energy efficiency ratio, and the secondary factor is the number of partitions;
and 2, step: collecting the resource condition and the running state of each node of the cluster for task scheduling;
and 3, step 3: adjusting the partition number by using a partition adjusting formula to obtain a final partition number; if the available resources of the server are insufficient, the task is delayed and the final partition number is determined again; the specific partition number adjustment strategy is as follows:
Figure BDA0002552081900000091
wherein, N is the number of partitions in the similar condition, min (.) represents the minimum value, and T c Time taken to accomplish similar tasks, T t To limit the difference between the task deadline and the current time, r is an energy efficiency value, C is a constant, r N An effective value, r, representing the number of partitions N N-1 Representing an energy efficiency value, r, representing the number of partitions N-1 N+1 Representing an energy efficiency value r of N +1 L And representing the energy efficiency value of the remaining available nodes, wherein L is the number of the remaining available nodes, and M is the adjusted final partition number.
The upper part of the above formula is the condition of limiting the task deadline, the lower part is the condition of not limiting the task deadline, L is required to be larger than N, otherwise, retry is carried out after a certain time delay.
The task scheduling and running process comprises the following steps:
step 1: dividing the nodes into two types of nodes with stronger CPU performance and I/O performance; dividing the two types of nodes into 2 sets, and recording the sets into a local library;
step 2: scheduling tasks of different types to two types of nodes with different performances;
and step 3: and after the task scheduling is finished, entering task operation.
The formula for calculating the energy consumption E and the energy efficiency eta is as follows:
Figure BDA0002552081900000092
wherein F is the rated operating frequency of the CPU, and at the time t i The CPU usage rate of the time is
Figure BDA0002552081900000093
t is the application program running time;
Figure BDA0002552081900000101
wherein, C and C 1 、C 2 Is constant, E is energy consumption, U C Is CPU utilization rate, U IO I/O usage, T application runtime.
And calculating the total energy consumption and energy efficiency of the subareas by adopting an energy consumption evaluation model and filling the energy consumption and energy efficiency into a task operation information table.
As shown in fig. 4, the RDD process diagram of Spark is shown, where each arrow is a Task, and since each Partition runs as a Task when the RDD in Spark is calculated, the number of tasks is determined by the number of partitions in the RDD. Wherein RDD represents an elastic distributed data set and Task represents a Task.
As shown in fig. 5, it can be seen from the Spark architecture diagram that the number of tasks executed in parallel in Spark is determined by the product of the number of compute nodes (executors) in application and the core number of each compute node.
As shown in FIG. 3, the invention selects task size, CPU utilization, I/O utilization, running time, task type, partition number, and energy efficiency as an information recording table to record the running condition of the task.
The run information table reflects some run information of a certain task during execution.
According to the invention, the initial data and the task operation data acquired after operation are used as basic input data as reference, so that the partition number adjustment of the subsequent task is guided.
And recording the information acquired by the monitoring module into a task operation information reference table. After the task is completed, a Spark running log extraction module is used for extracting the task running log, then the data after the energy consumption and energy efficiency model calculation is filled into a task information table, when subsequent similar tasks arrive, the task information table can be analyzed, then appropriate partition number arrangement is calculated, and the partition number of the target task is set, so that the energy consumption is optimized.
The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A method for scheduling an energy-efficient Spark task based on dynamic partitioning is characterized by comprising the following steps:
s1: initializing a server to obtain a task information reference table;
s2: a user initiates a task scheduling request;
s3: the server receives the task scheduling request and then acquires task data information;
s4: judging whether the task limits the task completion time or not;
if the task completion time is limited, primarily determining the number of the partitions; determining the final task partition number according to the preliminarily determined partition number; the process of preliminary determination of the number of partitions includes:
step 1: dividing the task types by adopting a K-means algorithm;
step 11: all prime numbers and sequencing Sort between 2 and 20000000 are used as initial clustering center points of the task clusters;
step 12: taking the CPU utilization rate and the I/O utilization rate as characteristic values, and representing Application program Application as two-dimensional coordinates by adopting the characteristic values;
step 13: performing initialization test on tasks to be executed, calculating the difference degree between the tasks according to the initialization test result, and taking the value of the difference degree as the Euclidean distance D between different application programs;
step 14: performing task clustering on task data information according to the initial clustering center point obtained in the step 11 and the Euclidean distance in the step 13; and obtaining a task type;
step 2: determining the number of the preliminary partitions according to the task type division result and the data in the task operation information reference table;
the process of determining the final task partition number comprises the following steps:
step 1: selecting information of different historical tasks from the task information reference table as a factor for adjusting the number of partitions under different requirements of whether the task deadline is limited or not;
in the case of a defined task deadline: the task running time is a main factor, and the energy efficiency ratio is a secondary factor;
in the case of non-defined task deadlines: the energy efficiency ratio is a main factor, and the number of partitions in the task information reference table is a secondary factor;
step 2: acquiring the resource use condition and the running state of each server in the cluster;
and step 3: adjusting the partition number by using a partition adjusting formula to obtain a final partition number; if the available resources of the server are insufficient, the task is delayed and the final partition number is determined again;
if the task completion time is not limited, judging whether the task exists in the task information reference table or not;
if not, performing Spark default task scheduling and task execution;
if yes, finally determining the number of the task partitions;
s5: scheduling and executing the tasks of the server according to the final task partition number to complete task scheduling; the task scheduling and running process comprises the following steps:
step 1: taking a server as a node, and dividing each node into two types of nodes with stronger CPU performance and I/O performance; dividing two types of nodes into 2 sets;
step 2: scheduling tasks of different types to two types of nodes with different performances;
and step 3: after the task scheduling is finished, entering task operation;
s6: calculating the server use energy consumption of the task operation condition by adopting an energy consumption evaluation model, recording the data information for executing the task and the server use energy consumption into a task information reference table, and updating the task information reference table;
s7: and when the user initiates a new task scheduling request, returning to the step S3.
2. The method according to claim 1, wherein the task data information includes: the running condition and the task data parameters of the Application program Application; the task data parameters include: CPU usage, I/O usage, task size, and application runtime.
3. The method for scheduling the efficient Spark task based on the dynamic partition as claimed in claim 1, wherein the application program is represented by two-dimensional coordinates as:
Figure FDA0003878886750000021
wherein the content of the first and second substances,
Figure FDA0003878886750000022
for application at t i The rate of use of the CPU at the time,
Figure FDA0003878886750000023
representing application runs t i I/O usage at a time.
4. The method of claim 1, wherein the Euclidean distance D between different applications is calculated according to the formula:
Figure FDA0003878886750000031
wherein the content of the first and second substances,
Figure FDA0003878886750000032
for application at t i The rate of use of the CPU at the time,
Figure FDA0003878886750000033
indicating that the application is at t t The rate of use of the CPU at the time,
Figure FDA0003878886750000034
running t for an application i The rate of I/O usage at the time,
Figure FDA0003878886750000035
representing application runs t t I/O usage at the time.
5. The method according to claim 1, wherein the partition adjustment formula is:
Figure FDA0003878886750000036
where N is the number of partitions in which similar conditions exist, T c Time taken to accomplish similar tasks, T t And in order to limit the difference value between the task cutoff time and the current time, r is an energy effective value, L is the number of the remaining available nodes, C is a constant, M is the adjusted final partition number, and Min (.) represents the selected minimum value.
6. The dynamic partition-based energy-efficient Spark task scheduling method according to claim 1, wherein the energy consumption evaluation model includes an energy consumption model E and an energy efficiency model η;
the expression of the energy consumption model is as follows:
Figure FDA0003878886750000037
wherein F is the rated operating frequency of the CPU at time t i The CPU usage rate of the time is
Figure FDA0003878886750000038
t is the application program running time;
the expression of the energy efficiency model is as follows:
Figure FDA0003878886750000039
wherein, C and C 1 、C 2 Is constant, E is energy consumption, U C Is CPU utilization rate, U IO IO usage and T application runtime.
CN202010578245.8A 2020-06-23 2020-06-23 High-energy-efficiency Spark task scheduling method based on dynamic partition Active CN111813512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010578245.8A CN111813512B (en) 2020-06-23 2020-06-23 High-energy-efficiency Spark task scheduling method based on dynamic partition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010578245.8A CN111813512B (en) 2020-06-23 2020-06-23 High-energy-efficiency Spark task scheduling method based on dynamic partition

Publications (2)

Publication Number Publication Date
CN111813512A CN111813512A (en) 2020-10-23
CN111813512B true CN111813512B (en) 2022-11-25

Family

ID=72845460

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010578245.8A Active CN111813512B (en) 2020-06-23 2020-06-23 High-energy-efficiency Spark task scheduling method based on dynamic partition

Country Status (1)

Country Link
CN (1) CN111813512B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117319236A (en) * 2022-06-22 2023-12-29 华为云计算技术有限公司 Resource allocation method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108762921A (en) * 2018-05-18 2018-11-06 电子科技大学 A kind of method for scheduling task and device of the on-line optimization subregion of Spark group systems
CN109582119A (en) * 2018-11-28 2019-04-05 重庆邮电大学 The double-deck Spark energy-saving scheduling method based on dynamic voltage frequency adjustment
CN110209494A (en) * 2019-04-22 2019-09-06 西北大学 A kind of distributed task dispatching method and Hadoop cluster towards big data
CN110704542A (en) * 2019-10-15 2020-01-17 南京莱斯网信技术研究院有限公司 Data dynamic partitioning system based on node load
CN111176832A (en) * 2019-12-06 2020-05-19 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8887163B2 (en) * 2010-06-25 2014-11-11 Ebay Inc. Task scheduling based on dependencies and resources
WO2016078008A1 (en) * 2014-11-19 2016-05-26 华为技术有限公司 Method and apparatus for scheduling data flow task
US10176092B2 (en) * 2016-09-21 2019-01-08 Ngd Systems, Inc. System and method for executing data processing tasks using resilient distributed datasets (RDDs) in a storage device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108762921A (en) * 2018-05-18 2018-11-06 电子科技大学 A kind of method for scheduling task and device of the on-line optimization subregion of Spark group systems
CN109582119A (en) * 2018-11-28 2019-04-05 重庆邮电大学 The double-deck Spark energy-saving scheduling method based on dynamic voltage frequency adjustment
CN110209494A (en) * 2019-04-22 2019-09-06 西北大学 A kind of distributed task dispatching method and Hadoop cluster towards big data
CN110704542A (en) * 2019-10-15 2020-01-17 南京莱斯网信技术研究院有限公司 Data dynamic partitioning system based on node load
CN111176832A (en) * 2019-12-06 2020-05-19 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
An energy-aware scheduling algorithm for big data applications in Spark;Hongjian Li等;《Cluster Computing》;20190604;593-609 *
基于DVFS的节能调度策略在Spark on YARN中的研究与实现;马恩杰;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200215(第2期);I137-30 *
基于数据特性的Spark任务性能优化;柴宁等;《计算机应用与软件》;20180115;第35卷(第1期);52-84 *

Also Published As

Publication number Publication date
CN111813512A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN110096349B (en) Job scheduling method based on cluster node load state prediction
WO2021159638A1 (en) Method, apparatus and device for scheduling cluster queue resources, and storage medium
WO2015051685A1 (en) Task scheduling method, device and system
CN106055311A (en) Multi-threading Map Reduce task parallelizing method based on assembly line
WO2017005115A1 (en) Adaptive optimization method and device for distributed dag system
CN102193830A (en) Many-core environment-oriented division mapping/reduction parallel programming model
WO2020248227A1 (en) Load prediction-based hadoop computing task speculative execution method
CN112882818A (en) Task dynamic adjustment method, device and equipment
Li et al. An energy-aware scheduling algorithm for big data applications in Spark
CN111813512B (en) High-energy-efficiency Spark task scheduling method based on dynamic partition
Yang et al. Improving Spark performance with MPTE in heterogeneous environments
CN117271143B (en) Data center optimization energy-saving scheduling method and system
Dai et al. Research and implementation of big data preprocessing system based on Hadoop
Wang et al. A model driven approach towards improving the performance of apache spark applications
CN109582119B (en) Double-layer Spark energy-saving scheduling method based on dynamic voltage frequency adjustment
CN104503820B (en) A kind of Hadoop optimization methods based on asynchronous starting
CN110888713A (en) Trusted virtual machine migration algorithm for heterogeneous cloud data center
CN116974994A (en) High-efficiency file collaboration system based on clusters
CN115982230A (en) Cross-data-source query method, system, equipment and storage medium of database
Ismaeel et al. A systematic cloud workload clustering technique in large scale data centers
CN114860449A (en) Data processing method, device, equipment and storage medium
Jiang et al. An optimized resource scheduling strategy for Hadoop speculative execution based on non-cooperative game schemes
CN115033389A (en) Energy-saving task resource scheduling method and device for power grid information system
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
Lin et al. An Energy-Efficient Tuning Method for Cloud Servers Combining DVFS and Parameter Optimization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230803

Address after: Room 801, 85 Kefeng Road, Huangpu District, Guangzhou City, Guangdong Province

Patentee after: Guangzhou Dayu Chuangfu Technology Co.,Ltd.

Address before: 400065 Chongwen Road, Nanshan Street, Nanan District, Chongqing

Patentee before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS