KR101639947B1

KR101639947B1 - Hadoop preemptive deadline constraint scheduling method, execution program thereof method and recorded medium of the program

Info

Publication number: KR101639947B1
Application number: KR1020150052368A
Authority: KR
Inventors: 윤희용; 김경태; 김만윤; 류연중; 이산 울라; 이병준
Original assignee: 성균관대학교산학협력단
Priority date: 2015-04-14
Filing date: 2015-04-14
Publication date: 2016-07-15

Abstract

A scheduling method applied to a Hadoop system, the scheduling method comprising the steps of: (a) scheduling a scheduling method for a deadline constrained scheduling method of a Hadoop deadline constraint scheduling method, a computer program for performing the scheduling method, and a medium on which the program is recorded, Acquiring and initializing information on a plurality of jobs by scanning; (b) the scheduler scans a queue to find and schedule available slots; (c) calculating a remaining time up to the execution completion time of the job in which the scheduler is running, comparing the remaining time of the new job with the deadline period in accordance with the priority order, Determining whether to preempt the new job in the slot; And (d) allocating and assigning a slot to the job according to the determination of whether or not the scheduler preempts the preemption.
The present invention maximizes the number of jobs performed under the deadline by effectively using the slots in the Hadoop environment and improves the performance by improving the preemption criteria for supporting the preemption and reducing the overhead of preemption Hadoop preemption deadline constraint scheduling method.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a Hadoop preemption deadline constraint scheduling method and a computer program for performing the method,

The present invention relates to a preemptive deadline constraint scheduling method, and more particularly, to a preemptive deadline constraint scheduling method that reduces the overhead of a preemptive scheme by considering the remaining execution time and deadlines of a running job based on Hadoop MapReduce, A deadline constraint scheduling method, a computer program for performing the method, and a medium on which the program is recorded.

Recently, the size of data to be processed is rapidly increasing due to the emergence of social networks, the development of sensor technology, and the proliferation of smartphones. Google proposed a distributed parallel processing model and a runtime environment called MapReduce, which operate in a cluster of multiple computers.

In addition, Hadoop is developed and distributed by the open source group as a distributed computing platform that includes the MapReduce programming model and the Hadoop Distributed File System (HDFS). Hadoop consists of HDFS and MapReduce, an open source framework that runs in a cluster environment for processing large amounts of data. MapReduce provides a distributed data processing model and an execution environment that are performed in a cluster, and it is divided into a map step and a reuse step.

One of the important issues in job execution is to shorten the time to generate the execution result, that is, the completion time. To do this, the job tracker performs scheduling to determine the priority of the job to perform the map de-duplication job in the job queue. Hadoop HDFS distributes data within clusters. Therefore, FIFO-based scheduling is an effective approach to delay work of high-priority process because data is transmitted from another node when data is not present in the node where the mapping task is performed.

The First In First Out (FIFO) scheme, which is used by default in Hadoop MapReduce, is too simple, and more efficient and applicable schedulers have recently been studied

The existing research was a deadline - constrained scheduling algorithm in the cloud environment. Typically deadlines are created when a resource is requested, and the preparation of a slot is made for a Map task. No slots are allocated until all Map tasks have been completed. To complete the map and reduce tasks, the task completion time is the total time.

In other words, there is a problem that the slot allocated until the end of all other maps is not disassembled, and therefore, execution can not be performed and the entire completion time becomes long. In the case of applying the conventional preemption method, excessive preemption The overhead is increased.

Korean Patent Laid-Open Publication No. 10-2014-0080795 (published on July 01, 2014) Korean Patent Laid-Open Publication No. 10-2013-0063825 (public date: June 17, 2013)

The Hadoop preemptive deadline constraint scheduling method according to the present invention has the following problems.

First, the present invention provides a scheduling method that maximizes the number of jobs performed under a deadline by effectively using slots.

Second, the present invention aims to improve the preemption decision criterion to support the preemption and to reduce the overhead of the preemption.

The solution of the present invention is not limited to those mentioned above, and other solutions not mentioned can be clearly understood by those skilled in the art from the following description.

According to an aspect of the present invention, there is provided a scheduling method applied to a Hadoop system, the scheduling method comprising the steps of: (a) acquiring and initializing information on a plurality of jobs by scanning a queue with a scheduler ; (b) the scheduler scans a queue to find and schedule available slots; (c) calculating a remaining time up to the execution completion time of the job in which the scheduler is running, comparing the remaining time of the new job with the deadline period in accordance with the priority order, Determining whether to preempt the new job in the slot; And (d) allocating and assigning a slot to the job according to the determination of whether or not the scheduler preempts the job.

Here, the information on the job includes at least the minimum number of slots requested by the job, the number of slots required by the job initialized with the minimum number, the number of slots already allocated to the job, Wherein the step of scanning the queue in the step (b) comprises the step of scanning the queue until there is no slot available, .

Preferably, the step (b) includes the step of assigning a slot to a job satisfying a case where the number of slots already allocated to the job is smaller than the minimum number of slots requested by the job until there is no available slot And determining whether or not to preempt the job in the step (c), wherein priority is given to each of the jobs, a higher priority is given to a job having a shorter deadline period, . &Lt; / RTI >

In the step (c), it is preferable that the remaining time is calculated as a time obtained by subtracting the current calculated period from the maximum calculation period of the job being executed, and the maximum calculation period is the worst It is preferable that the execution calculation period is set in a worst case.

In addition, in the step (c), if the remaining time is shorter than the deadline period of the new job, the scheduler does not determine the preemption of the already allocated slot for the new job, Wherein the step (c) comprises the steps of: when the already allocated slot is in a free state after the execution of the job is completed, The method comprising the steps of:

In the step (d), in the case of new jobs that do not allocate slots, it is preferable to wait in a queue according to the deadline order, and the scheduling method may include a step of allocating a minimum number of slots .

A second aspect of the present invention is a computer program stored in a medium for executing the Hadoop occupation deadline constrained scheduling method in combination with hardware, and a third aspect of the present invention is a computer program for causing a computer to execute the Hadoop occupation deadline constraint And is a computer-readable medium on which a program for executing a scheduling method is recorded.

The Hadoop preemption deadline constraint scheduling method according to the present invention has the following effects.

First, the present invention provides a preemptive deadline scheduling method that improves performance by maximizing the number of jobs performed in a cluster within a given request time.

Second, the present invention provides a scheduling method for maximizing a usage rate of a time slot under the restriction of a deadline in a Hadoop environment and reducing an average wait time of submitted jobs.

Third, the present invention provides an efficient preemption scheduling method by supporting the preemption under the preemption deadline constraint and improving the preemption criterion that can reduce the overhead of preemption.

Fourth, the present invention provides a scheduling method for improving average execution times and completion times.

The effects of the present invention are not limited to those mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the following description.

1 is a flowchart illustrating a Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention.
2 is a schematic diagram showing physical and virtual experimental environments of a Hadoop cluster.
FIG. 3 is a diagram illustrating an algorithm for performing the Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention.
FIG. 4 is a graph comparing the Hadoop preemptive deadline constraint scheduling method (HPDCS) and the conventional PDCS average execution time according to an embodiment of the present invention.
FIG. 5 is a graph comparing the Hadoop preemptive deadline constraint scheduling method (HPDCS) according to an embodiment of the present invention with the job completion time of a conventional PDCS.
FIG. 6 is a graph comparing the average waiting time of a conventional Hadoop deadline constrained scheduling method (HPDCS) according to an embodiment of the present invention.
FIG. 7 is a graph comparing the total waiting time of a conventional Hadoop deadline constrained scheduling method (HPDCS) according to an embodiment of the present invention.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Wherever possible, the same or similar parts are denoted using the same reference numerals in the drawings.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular forms as used herein include plural forms as long as the phrases do not expressly express the opposite meaning thereto.

Means that a particular feature, region, integer, step, operation, element and / or component is specified and that other specific features, regions, integers, steps, operations, elements, components, and / It does not exclude the existence or addition of a group.

All terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Predefined terms are further interpreted as having a meaning consistent with the relevant technical literature and the present disclosure, and are not to be construed as ideal or very formal meanings unless defined otherwise.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

1 is a flowchart illustrating a Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention. As shown in FIG. 1, a scheduling method applied to a Hadoop system according to an embodiment of the present invention is a scheduling method applied to a Hadoop system, wherein a scheduler scans a queue to generate a plurality of jobs, (S100); (b) the scheduler scans a queue to find and schedule available slots (S200); (c) calculating a remaining time until the execution completion time of the job in which the scheduler is executing, comparing the remaining time with the period of the deadline of the new job according to the order of the priority, Determining whether the job is preempted (S300); And (d) assigning and assigning a slot to the job according to the determination of whether or not the scheduler preempts the job (S400).

As described above, the present invention proposes a scheduling method for introducing a policy that allows a task to be preempted and allocated to another task in order to solve a constraint of a deadline, so that when the task of the map is completed, We propose an efficient and improved Hadoop deadline preemption constraint scheduling method that immediately releases and the output generated by the map task is sent to the reduction task to reduce the overall completion time.

Here, Hadoop is open source software developed as a database for the purpose of distributed file system (HDFS). Hadoop (Hadoop) was developed as a distributed file system database that distributes data to multiple computers because it is not suitable to use existing relational database to process large data of big data efficiently and quickly.

Hadoop processes the work in multiple clusters by using the method called MapReduce, which acts as an interface for representing the processed data using data visualization tools.

2 is a schematic diagram showing physical and virtual experimental environments of a Hadoop cluster. As shown in FIG. 2, the Hadoop system forms a MapReduce framework to process data. Basically, the MapReduce framework consists of a job tracker and a plurality of test trackers do.

Developers write map and reduce functions to use MapReduce, and in MapReduce work consists of map, shuffle-sort, and reduce. The task tracker is run on generic machines called slaves, and each slave performs the data by performing a user-defined map sum and a reduce function. The job tracker runs on one machine and handles tasks such as error detection and load balancing of the task trackers.

In the map phase, the input file is divided by logical division, and the split file is called split. Splits are typically data chunk sizes and these splits are assigned to each slave node by the job tracker. Each slave node applies a map function to the key-value pair (key1, value1) record in a split and maps it to the intermediate result (key2, value2). Intermediate results (key2, value2) are kept in memory. When the memory is full, the result is divided, sorted and stored according to the key. In some cases, a combine function is applied to perform additional processing on the intermediate result.

In the redist step, each slave node performs a redess function with a list of values for one key value as a mul- tiplex variable. Thus, the function uses the list of values to generate the final result value value3 and derives the final result pair (key3, value3) for the intermediate key key2.

Hereinafter, a Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention will be described step by step.

The scheduler of the Hadoop based Deadline Constraint Preemption method according to the embodiment of the present invention is a scheduler applicable to the Hadoop system. The scheduler maximizes the number of jobs completed under the deadline, supports a preemption scheme, And provides a method of improving the preemption to reduce the head.

The Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention includes an initialization step, a scheduling step, a preemption step, and an allocation step.

The following variables are used to implement the scheduling method according to an embodiment of the present invention.

QremainingSlots: Number of available slots in the queue

QtotalSlots: the total number of slots in the queue

J.minSlots: Job J Minimum number of request slots

J.regSlots: Number of slots required by job J initialized with J.minslots

J.allocatedSlots: Number of slots already assigned to Job J

S (J) .starTime: Start time of slot S allocated to job J

J. Deadline: Deadline of Job J

J _o remainingTime: Remaining time for jobs already running in the queue

First, step (a) 100 is an initialization step. The scheduler scans a queue to acquire and initialize information on a plurality of jobs. That is, the information of J.minSlots, J.reqSlots, J.allocatedSlots and recalculated Q.remainingSlots is scanned and initialized for a new job in the queue. Here, the scheduler refers to a subject that performs the Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention. In general, job scheduling is performed by a job tracker.

(b) Step S200 is a step of scheduling, in which a scheduler scans a queue and finds available slots and schedules them. The scheduler checks whether the number of available slots (Q.remainingSlots) is 0, and if not, it scans the queue and finds an available slot. That is, it scans the queue to allocate slots to jobs until all are allocated and there are no longer available slots. If available slots are still greater than zero, jobs that satisfy the condition that the number of already allocated slots is less than the number of slots required by the initiated job will grant the slot until no more available slots are available.

(C) Step S300 is a preemption step in which the scheduler calculates the remaining time up to the execution completion time of the job being executed and updates the remaining time of the new job and the dead time And determining whether to preempt the new job in the slot of the currently executing job.

In the preemption step, a job satisfying J.allocatedSlots <J.minSlots is found in the queue, and the remaining execution time of the job currently being executed, that is, the execution remaining time, is calculated. If the remaining time is less than the deadline of the new job, the slots allocated to the job do not need to be preempted and the new job is queued until the current job is completed So that a preemption scheme for appropriately allocating slots in consideration of the remaining time of execution of the currently executed job and the deadline of the new job is applied.

That is, the scheduler scans the queue to find out that the job that satisfies the case where the number of already allocated slots (J.allocatedSlots) is smaller than the minimum number of slots (J. minSlots) requested by the job, If the remaining time is less than the deadline, it is not necessary to preempt the slot allocated to the job. Therefore, the new job is queued in the queue until the execution of the job is completed to limit the preemption, .

Further, when the job is completed and the slots become free, the scheduler grants the slots to the new job J. The scheduler calculates the number of slots that need to be preempted for job J. This is obtained by subtracting the number of slots already allocated (J.allocatedSlots) from the minimum number of slots requested by the job (J.minSlots).

The scheduler then sorts the slots used for task execution in descending order of slot start time and preempts the slots to jobs requiring preemption until the number of allocated slots is minimized.

The model used in the scheduling method according to the embodiment of the present invention assumes that there is at least a minimum number of slots to be allocated to all jobs in the queue. That is, all preempted or preempted jobs satisfy the condition that the number of slots allocated after preemption is equal to or greater than the requested number of slots.

Here, the scheduling of the preemption scheme refers to a scheduling scheme in which another process can stop the current process and forcibly deprive the CPU even if a process is allocated and executed by the CPU. In other words, preemption scheduling can be executed concurrently with processes that use CPUs for a long time with little input / output, and other processes that are not used, thereby becoming the basis of multiple programming.

In addition, the preemption method is mainly adopted in the time division system requiring fast response. This approach has the advantage that processes with high priority that need to be urgently handled can be processed quickly, but inter-process context switching can occur frequently, which can increase the overhead of the operating system.

Therefore, in order to solve the constraint condition of the deadline, the present invention proposes a scheduling method of introducing a policy that allows a recent task to be assigned to another task and to preempt a slot, thereby reducing the overhead of preemption, It provides a new Hadoop preemption scheduling method that maximizes the number of completed jobs and supports the preemption scheme.

(d) Step S400 is an allocation step, in which the scheduler allocates and assigns a slot to a job in accordance with the determination of the preemption. That is, a slot is assigned to the task through the determination of the preemption method or the determination result. New jobs are placed in the wait queue in the order of the deadline. Also, the start time of each slot allocated to the start time (S (J) .startTime) of the slot allocated to the job is recorded, and the preempted jobs are suspended.

FIG. 3 is a diagram illustrating an algorithm for performing the Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention. As described above with reference to FIG. 3 and FIG. 3, it can be seen that the process of the scheduling method is proceeding with the flow comprising the initialization step, the scheduling step, the preemption step, and the allocation step. The algorithm illustrated in FIG. 3 exemplifies the above-described step-by-step process.

Hereinafter, a mathematical model of the Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention will be described.

The variables used in the mathematical model of the Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention are defined as follows.

n : Total number of tasks in the task set

τ _i : Task i identifier

C _i : worst-case computing time of τ _i

T _i : period of τ _i

D _i : deadline of τ _i

P _i : Priority of τ _i (small value is high priority)

W _i : Cumulative latency of τ _{i in} the ready queue

A _i : τ _i Arrival time of

R _i : time consumed to execute τ _i

Relative measurement of preemption method

τ _r is the currently performed task and the priority of the newly entered task is τ _r . If the new task does not preempt τ _r , it matches the (C _r -R _r ) timeslot and τ _r All tasks with a higher priority are calculated through Equation (1) below when they meet their deadlines. Q in Equation (1) means a ready queue.

Task Execution time estimation

When a job enters the queue, the input data of the job < RTI ID = 0.0 > The size is S _m And is expressed by the following equation (2).

α _i Is the size of the i ^th map task (Equation 3).

In this model, the present invention considers two types of deadline-based jobs, a high priority job and a low priority job. Minimum number of map slots

And Reduce Slot

And C _m and C _r represent the cost of the data units of the map and the reduce task.

The time of completion of the task is shown in Equation (4) below.

Mathematical model

We considered the Poisson process of M / G / 1 queue of preemption method and jobs of different priority class, λ _k rate. In the present invention, it is assumed that the service time and the arrival time are independent of each other.

ρ _k (λ _k rate / μ _k ) is the traffic weight for class- k (measuring the average server or resource usage over a specific time period), λ _k and μ _k denote arrival rate and service rate, respectively. The purpose of this modeling is to calculate the expected value of the residence time, the expected time, and the number of jobs in each class in the system. Each job arrival rate follows a Poisson distribution, and the requestor's service time is independent of each other.

Here are the variables used for modeling.

W _k : The time used to wait for class - k jobs.

X (t): The number of jobs in the queue at time t .

R (t): Remaining service time of jobs in the server at time t .

t _n : arrival time of nth task.

σ _k : class- k Service time of Jobs. E [

] = l / _k .

N _k : number of jobs in class - k in the queue and in the system in the server.

T _k : expected value for the job response time of class - k .

The wait time of jobs in the queue can be obtained by applying the formula of Little [Equation 5].

Hereinafter, computer simulation results using the Hadoop preemptive deadline constraint scheduling method according to an embodiment of the present invention will be described. The results of the Hadoop Preemptive Deadline Constraint Scheduler according to the embodiment of the present invention through computer simulation show that the performance is improved by reducing the completion time and maximizing the utilization of the time slot.

Performance evaluation

The performance evaluation environment of the Hadoop preemptive deadline constraint scheduling method according to the embodiment of the present invention is implemented using the Google file system developed in the cloud-based Yahoo and the open source Google MapReduce. Hadoop's master node is a job tracker that assigns tasks to task trackers that perform job scheduling and map and reduce functions.

Experiments according to embodiments of the present invention comprise two different virtual clusters and physical clusters. A virtual cluster consists of a single physical node, three task trackers (TaskTracker), and a host system JobTracker. Each node runs in an environment of 4GB memory, a 64-bit Intel dual-core 2.93GHz processor, and Ubuntu 12.4 and Hadoop 1.2.1.

The physical cluster consists of four task trackers (TaskTracker) and one job tracker (JobTracker), and is executed in the environment of two map slots and two reduction slots. The Hadoop Distributed File System has a block size of 64MB and a measured network capacity of 12MB / s.

FIG. 4 is a graph comparing the Hadoop preemptive deadline constraint scheduling method (HPDCS) and the conventional PDCS average execution time according to an embodiment of the present invention. As shown in FIG. 4, the HPDCS according to the embodiment of the present invention can reduce the time of 16.8%, 15%, 13.5%, 12.3% and 11.2% compared to the conventional PDCS for five different workloads . This shows that the HPDCS proposed in the present invention has a higher occupancy performance than the conventional PDCS.

FIG. 5 is a graph comparing the Hadoop preemptive deadline constraint scheduling method (HPDCS) according to an embodiment of the present invention with the job completion time of a conventional PDCS. As shown in FIG. 5, it can be seen that the job completion time of the HPDCS according to the embodiment of the present invention is small as compared with the conventional PDCS. Despite the limitations of the deadlines, it can be seen that the performance is consistently better than the conventional PDCS.

In particular, it can be seen that, as the deadline increases, the time to complete job execution in the conventional PDCS increases with the number of allocated slots decreasing, and at the same time, The advantage of HPDCS is that it has advantages.

FIG. 6 is a graph comparing the average waiting time of a conventional Hadoop deadline constrained scheduling method (HPDCS) according to an embodiment of the present invention. As shown in FIG. 6, regardless of the number of jobs, the HPDCS according to the embodiment of the present invention conventionally has a mean waiting time of PDCS of about 19% less.

That is, it can be clearly seen that the performance improvement of the HPDCS according to the embodiment of the present invention is improved as the workload increases from 10% to 28% and the number of jobs increases from 200 to 1600. Hence, requests to reduce the number of slots from small jobs can be met on time, resulting in a reduction in the wait time of the resume task.

As a result, it can be seen that the HPDCS according to the embodiment of the present invention has room for performance improvement, and no noticeable scheduling overhead appears in the job tracker even though the number of jobs increases.

FIG. 7 is a graph comparing the total waiting time of a conventional Hadoop deadline constrained scheduling method (HPDCS) according to an embodiment of the present invention. As shown in FIG. 7, it can be seen that the HPDCS according to the embodiment of the present invention shows a lower waiting time than the PDCS regardless of the number of jobs.

The embodiments and the accompanying drawings described in the present specification are merely illustrative of some of the technical ideas included in the present invention. Accordingly, the embodiments disclosed herein are for the purpose of describing rather than limiting the technical spirit of the present invention, and it is apparent that the scope of the technical idea of the present invention is not limited by these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A scheduling method applied to a Hadoop system,
(a) acquiring and initializing information on a plurality of jobs by scanning a queue by a scheduler;
(b) the scheduler scans a queue to find and schedule available slots;
(c) calculating a remaining time up to the execution completion time of the job in which the scheduler is being executed, comparing the remaining time of the new job with the deadline period according to the order of priority, Determining whether to preempt the new job; And
(d) assigning and assigning a slot to a job according to the determination of whether or not the scheduler determines the preemption,
The step (c)
Determining whether or not to preempt the job is performed by assigning priority to each job and assigning a higher priority to a job having a shorter deadline period to determine whether or not to preempt the job in accordance with the priority,
Wherein the scheduler does not determine a preemption of a slot already allocated to the new job when the remaining time is shorter than a deadline period of the new job,
And assigning the new job to a slot in the free state if the already allocated slot is free and the execution of the job is completed.

The method according to claim 1,
The information on the job includes at least the minimum number of slots requested by the job, the number of slots required by the job initialized with the minimum number, the number of slots already allocated to the job, and the usage in the recalculated queue Lt; RTI ID = 0.0 > Hadoop < / RTI > predefined deadline constraint scheduling method.

The method of claim 2,
In the step (b)
Wherein the step of the scheduler scanning the queue is to scan the queue until there is no available slot.

The method according to claim 1,
The step (b)
And assigning a slot to a job that satisfies a case where the number of slots already allocated to the job is smaller than the minimum number of slots requested by the job until there is no usable slot.

delete

The method according to claim 1,
In the step (c)
Wherein the remaining time is calculated as a time obtained by subtracting a current calculated period from a maximum calculation period of the job being executed.

The method of claim 6,
Wherein the maximum calculation period includes:
And the execution calculation period set in the worst case of the job being executed.

delete

The method according to claim 1,
The step (d)
And in the case of new jobs that do not provide slots, queues in the queue in the deadline order.

The method according to claim 1,
Wherein the scheduling method includes at least a minimum number of slots to be allocated to all jobs in the queue.

A computer program stored on a medium for execution in accordance with the Hadoop preemptive deadline constraint scheduling method of claim 1, in combination with hardware.

A computer-readable medium having recorded thereon a program for causing a computer to execute the Hadoop preemptive deadline constraint scheduling method of claim 1.