CN107038069B

CN107038069B - Dynamic label matching DLMS scheduling method under Hadoop platform

Info

Publication number: CN107038069B
Application number: CN201710181055.0A
Authority: CN
Inventors: 毛韦; 竹翠
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-03-24
Filing date: 2017-03-24
Publication date: 2020-05-08
Anticipated expiration: 2037-03-24
Also published as: CN107038069A

Abstract

The invention discloses a DLMS (dynamic label matching) scheduling method under a Hadoop platform, belongs to the field of computer software, and provides a scheduler for dynamically matching a node performance label (hereinafter referred to as a node label) and an operation type label (hereinafter referred to as an operation label) aiming at the problems of large performance difference, resource allocation randomness and overlong execution time of a Hadoop cluster node. The nodes are initially classified and given original node labels, the nodes detect performance indexes of the nodes to generate dynamic node labels, the jobs are classified according to part of running information to generate job labels, and the resource scheduler allocates the node resources to the jobs corresponding to the labels. The experimental result shows that the operation execution time is greatly shortened compared with that of the self-contained scheduler in the YARN.

Description

Dynamic label matching DLMS scheduling method under Hadoop platform

Technical Field

The invention belongs to the field of computer software, and relates to design and implementation of a dynamic label matching DLMS scheduling method based on a Hadoop platform.

Background

In the early Hadoop version, because the resource scheduling management and the MapReduce framework are integrated in one module, the decoupling performance of codes is poor, the codes cannot be well expanded, and various frameworks are not supported. The Hadoop open-source community design realizes a new generation Hadoop system with a brand-new architecture, the system is Hadoop2.0 version, and a new resource scheduling framework, namely a new generation of YARN (Hadoop system) is constructed by extracting and scheduling resources. As is well known, a proper scheduling algorithm under a certain determined environment can meet the operation request of a user and effectively improve the overall performance of the Hadoop operation platform and the resource utilization rate of the system. Default to three schedulers in YARN: first-in-first-out (fifo), fair Scheduler (FairScheduler), and computing power Scheduler (Capacity Scheduler). The default of Hadoop is a fifo scheduler, the algorithm adopts a first-in first-out scheduling strategy, is simple and easy to implement, is not beneficial to the execution of short jobs, and does not support shared cluster and multi-user management; the fair scheduling algorithm proposed by Facebook considers the difference between different users and the configuration requirements of the job resources, and supports the users to fairly share the resources of the cluster, but the configuration strategy of the job resources is not flexible enough, so that the resource waste is easily caused, and the job preemption is not supported; the computing power scheduling algorithm proposed by Yahoo supports multi-user sharing of multiple queues, is flexible in computing power, but does not support job preemption and is easy to fall into local optimization.

However, in actual enterprise production, as the data volume of the enterprise increases, new nodes are added to the cluster every year, but the performance difference of the cluster nodes is significant, and the heterogeneous cluster is common in the enterprise production environment. It is assumed that if a machine learning task with a large calculation amount is distributed to a machine node with a poor CPU calculation capability, the overall execution time of the job is obviously affected. The invention provides a resource scheduling method (DLMS) with node performance and job category label dynamically matched, wherein a machine with better CPU performance is attached with a CPU label, a machine with better disk IO performance is attached with an IO label or a common label common to the machine with better disk IO performance, the job can be attached with a CPU label, an IO label task or a common label according to classification, then the job enters different label queues, and the scheduler allocates resources of corresponding label nodes to corresponding label jobs as much as possible, so that the operation time of the job can be reduced, the resource utilization rate of a system is improved, and the overall efficiency of the system is improved.

Disclosure of Invention

The dispatching method provided by the invention initially classifies the cluster nodes and gives corresponding labels. Before sending heartbeat, the NodeManager performs self detection and dynamically adjusts an original label, classifies the operation by using a machine learning classification algorithm and endows the operation with a corresponding label, dynamically realizes the sequencing of the operation according to attributes such as operation priority, operation waiting time and the like set by a user, and allocates resources of the corresponding label to the operation in a corresponding label queue.

The scheduling method provided by the invention mainly comprises the following modules:

(1) original classification of cluster nodes and dynamic classification label thereof

The cluster nodes need to be initially classified firstly, and classification is carried out according to the performance of CPUs and disks IO of the nodes. Each node in the cluster needs to independently run a task of a specified type and record the time of the node for running the type of operation, and the nodes are divided into CPU type nodes, disk IO type nodes and common type nodes according to the size relation between the time of the node for running a single task and the average value of the running time of all the nodes in the cluster.

In the running process of the cluster nodes, if the load is overlarge due to the fact that a part of jobs are run by one node, the label of the node is degraded, and the node is directly degraded into a common node. A node initial label is a CPU type label, a CPU type task is operated in the node, although partial resources of the node are not used, the performance advantage of the CPU of the node in the environment is lost at the moment, in order to avoid the situation, a dynamic label method is adopted, the CPU and IO utilization rate of the node machine are dynamically detected when the NodeManager sends heartbeat to a ResourceManager, if the utilization rate exceeds a threshold value, the node label is pasted with a common label, and the detection is needed once when the heartbeat is sent each time, so that the node dynamic label is realized. This threshold may be self-configurable in a configuration file, and may be referred to by a system default if not configured by the user.

(2) Obtaining and returning Map execution information

The Hadoop job is generally divided into a Map phase and a Reduce phase, the number of large job maps is usually hundreds or more, one job is mainly time spent on the calculation of the Map phase, but each Map is identical to execution logic, so that running information of the first Map process of the job running is collected, the information is transmitted to a scheduler when the NodeManager sends a heartbeat to a ResourceManager, and the scheduler classifies the job according to the transmitted information.

In an enterprise production environment, jobs with the same content logic are operated every day, namely, a user knows a label to which the job should belong, a job type label is set for the job in a command line or a code, a scheduler checks the job in scheduling, and if the user labels the job, a job classification link is omitted and the job is directly scheduled.

(3) Multi-priority queue

In order to meet the requirements of different users and prevent the phenomenon of hunger of small jobs, a job priority scheme is adopted. 5 queues are newly built in the scheduler: the system comprises an original queue, a waiting priority queue, a CPU priority queue, an IO priority queue and a common priority queue. The method comprises the steps that a user submits a job, the job part Map is firstly operated and the operation information of the part Map is collected, then the job enters a waiting priority queue to wait for the operation information of the Map to be returned and classified, and finally the job enters a queue corresponding to a label according to the classification type of the job.

(3) Job classification

The data needs to be preprocessed before being classified, and the data preprocessing refers to that some processing is performed on the data in a previous period. Data preprocessing techniques have been developed to improve the quality of data mining. There are several methods for data preprocessing: data cleaning, data integration, data transformation and data specification. The data processing technologies are used before data mining, so that the quality of a data mining mode is greatly improved, and the time required by actual mining is reduced. The data preprocessing is mainly in the aspect of data normalization. The data normalization is to linearly transform each variable data to a new scale, and after the transformation, the minimum value of the variable is 0 and the maximum value is 1, so that all the variable data are ensured to be less than or equal to 1.

And a naive Bayes classifier which is simple, common and good in classification effect is selected for classification in the aspect of operation classification. If the user has added the type of job in the command line and task code, this step is omitted and the queue is entered directly to wait for the allocation of resources.

(4) Data locality

One principle that is followed in Hadoop is that "moving a computation is better than moving data," moving to a compute node that places data is more cost effective and better performing than moving data to a compute node. The invention adopts a delay degradation scheduling strategy with respect to data locality.

The beneficial effects are as follows:

1. the invention provides a dynamic label matching scheduling method aiming at a heterogeneous cluster environment, which is characterized in that nodes and jobs are classified, job priorities are calculated by combining the characteristics of the jobs and the attributes of submitted users, the resources of the same type are matched with the nodes when the resources are distributed, and the node labels are dynamically adjusted by adopting a self-detection method in consideration of the relationship between the performance of the nodes and the task amount running at the current stage. And finally, performing comparative analysis on the performance of the algorithm through experiments.

2. Aiming at the problem of data locality, the invention provides a delay degradation algorithm, the degradation is divided into three types, namely the current local node, the local frame node and the random node, and the data locality is improved by reducing the locality level within a certain delay time.

3. The invention adopts a dynamic label method, firstly runs different types of jobs in advance, classifies comparison nodes according to the running time of a single node and the average time of all nodes of a cluster, and then carries out self-detection on the node performance according to the load condition of a cluster node running task and generates a corresponding new label.

4. The invention proposes to classify the jobs, and since the Map parts of the MapReduce job are the same processing logic, the jobs can be classified according to the partial information executed in advance by the jobs.

Drawings

FIG. 1 is a flowchart of an overall framework for job scheduling;

FIG. 2 is a flow chart of a scheduling algorithm;

FIG. 3 is a comparison graph of the total running time of three jobs under different scheduling algorithms;

FIG. 4 is a graph of Container distribution quantity under 500M data quantity under DLMS;

FIG. 5 is a graph of Container distribution quantity under 1G data quantity under DLMS;

FIG. 6 is a graph of Container distribution quantity under 1.5G data quantity under DLMS;

FIG. 7 is a run time comparison graph of a job group under different scheduling algorithms;

Detailed Description

In order to make the objects, technical solutions and features of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. The YARN scheduling framework is shown in fig. 1.

The individual steps are explained below:

(1) the user submits an application program to the YARN, wherein the application program comprises a user program and a command for starting the ApplicationMaster.

(2) ResourceManager assigns a first Container to the application and communicates with the corresponding NodeManager, asking it to start the ApplicationMaster of the application.

(3) After the applicationMaster registers to the Resourcemanager, the resource is applied for each task, and the running state of the tasks is monitored until the running is finished

(4) Before sending the heartbeat, the NodeManager performs self-detection to generate a dynamic node label and reports resources to the ResourceManager.

(5) And the tasks are classified into different label queues, and priority ordering is carried out to wait for resource allocation.

(6) The ApplicationMaster applies for and obtains resources from ResourceManager via RPC protocol.

(7) And according to the node labels and the resources reported by the NodeManager, the scheduler allocates the resources of the node to the jobs of the corresponding label queue.

(8) After the ApplicationMaster applies for the resource, the ApplicationMaster communicates with the corresponding NodeManager to request the ApplicationMaster to start the task.

(9) After the NodeManager sets an operating environment (environment variables, JAR packages, binary programs and the like) for the task, a task starting command is written into the script, and the task is started by operating the script.

(10) Each task reports the state and the progress of the task to the applicationMaster through a certain RPC protocol, and the task can be restarted when the task fails.

(11) After the application runs, the ApplicationMaster logs out to the ResourceManager and closes itself.

Firstly, initially classifying cluster physical nodes, wherein the classification method comprises the following steps:

(1) set cluster machine node set as N ═ N_i|i∈[1,n]N is the total number of nodes, i is a positive integer N starting from 1, N_iRepresenting the ith physical machine in the cluster.

(2) Executing a CPU, IO and common type operation with the same task amount on each node and recording operation execution time;T_cpu(i) is shown in the N_iThe time it takes to execute CPU jobs on individual nodes; t is_io(i) Is shown in the N_iTime taken to execute IO jobs on individual nodes, T_com(i) Is shown in the N_iIt takes time to execute a normal job on an individual node.

(3) Calculating the cluster average time of each job, wherein the calculation formula of the cluster average time is as follows:

j represents the type of job, and the time difference between each node and the average time under the job is calculated if T_cpu(i)＜Avg_cpuThe node is labeled with the original label of the CPU type node if T_cpu(i)>Avg_cpuAnd attaching a common original label to the node, comparing to obtain a plurality of labels on each node, and selecting the label with the most time saving as the last label of the node.

Let M be the operation information of Map, which includes the following information M ═ MIn, { MOut, Rate, act, Mcpu, Zcpu, Mrate } MIn, which needs to be collected, to represent Map input data amount, MOut to represent Map output data amount, Rate to represent input/output data amount, act to represent CPU average usage Rate, Mcpu to represent the median of CPU, Zcpu to represent the average of CPU usage Rate over 90%, and Mrate to represent memory usage amount, which will become the characteristic attribute of this job classification later. In the experimental process, the characteristics of the operation cannot be well reflected by simply calculating the average time of the CPU, the times that the CPU utilization rate of the CPU type operation is more than 90% are found through experiments, and the times that the CPU utilization rate of other types of operations is more than 90% are relatively less through experiments, so that the information is also added into the information returned by the map.

The method comprises the steps of adopting a user-defined double-layer weight design method on the queue priority, setting the weight occupied by a job size attribute as worthwum, dividing the attribute into three levels num, middle and short, dividing the weight occupied by a job owner attribute as worthwhile, dividing the attribute into two levels user, belonging to root, taking the weight occupied by the job emergency degree as worthwogence, dividing the attribute into three levels pro, LowProity, giving the waiting time as worthWait, giving the waiting calculation formula as waitTime-nowTime, calculating the priority number of each task, and sequencing in a corresponding queue, wherein the sum of the four task attributes is 100%, and the specific weight formula is as follows.

worthNum+worthUser+worthEmogence+worthWtait＝100％；

And finally, a weight calculation formula:

finalWort＝worthNum*num+worthUser*user+worthEmogence*prority+worthWait*waitTime

in the aspect of operation classification, a naive Bayes classifier is adopted, and the specific classification steps are as follows:

(1) respectively calculating the conditional probability of a job being CPU, IO or common type job under certain conditions:

P(job＝cpu|V₁,V₂...V_n)

P(job＝io|V₁,V₂...V_n)

P(job＝com|V₁,V₂...V_n)

wherein jobe ∈ { cpu, io, com } represents a job category label; v_iIs an attribute feature of the job.

(2) According to Bayesian formula P (B | A) ═ P (AB)/P (A):

suppose V_iAre relatively independent of each other according to independent assumptions therein

(3) P (V) in actual calculation₁,V₂,…,V_n) Is negligible regardless of operationDisregarding it, thus obtaining it

For the same reason have

Whether the job is a CPU type job, an IO type job, or a normal type job depends on which probability value is larger.

Locality documents adopt a delayed degradation scheduling strategy. The specific idea of the strategy is as follows:

adding a delay time attribute to each job, and setting T_iFor the current delay time of the ith job, i ∈ [1, n ∈]N is the number of nodes in the cluster, T_localIndicating local node delay time threshold, T_rackRepresenting a chassis node delay time threshold. When the scheduler allocates resources to a job, if the execution node and the data input node of the job are not on one node, T is the time_iIncrementing by 1, which means that the job has a delayed schedule, this resource is allocated to other appropriate jobs until T is reached_i>T_localWhen the operation is not available, the local property of the operation is reduced to the local property of the rack, and the node in the rack can allocate resources to the operation; when T is_i>T_rackWhen the job is executed, the locality of the job is reduced to a random node. Wherein T is_localAnd T_rackThe configuration file mode is adopted, and the user configures the configuration file according to the cluster condition. And a delayed scheduling strategy can ensure that better locality can be obtained within a certain delay time.

The basic idea of the DLMS scheduling method is to pre-allocate part of job execution, classify the job according to the information returned by the job, and then allocate the resource of the node tag to the task in the corresponding queue, and the basic process is as follows:

step 1, when a node reports resources to resource management through heartbeat, if an original queue is not empty, the operation in the original queue is traversed, the operation of which the operation type label is appointed in a command line or a program is distributed to a corresponding label priority queue, and the operation is removed by the original queue.

And step 2, scheduling the jobs without the specified job type labels in the original queue to a waiting queue.

And 3, if the waiting priority queue is not empty, classifying the jobs in the waiting priority queue into a corresponding label priority queue.

And 4, if the job type queue corresponding to the node performance label is not empty, allocating the resource of the node to the queue, and ending the allocation.

And 5, setting a variable for checking the resource access times, if the number of the resource access times exceeds the number of the clusters, distributing the resources of the nodes to corresponding queues according to the sequence of CPU, IO, common and waiting priorities, and finishing the scheduling. The step can prevent the situation that the CPU type node resources are exhausted due to excessive operation of the CPU queue, the nodes of other labels have the resources, but the operation can not distribute the resources. The flow chart of the algorithm is shown in fig. 2.

Experimental Environment

This section will verify experimentally the actual effect of the DLMS scheduler presented herein. The experimental environment is a Hadoop completely distributed cluster constructed by 5 PCs, and node machines of the cluster are uniformly configured into an operating system Ubuntu-12.04.1, JDK1.6, Hadoop2.5.1, a memory 2G and a hard disk 50G. Wherein the number of CPU cores of the NameNode is 2, the number of CPU cores of the dataNode1 is 2, the number of CPU cores of the dataNode2 is 4, the number of CPU cores of the dataNode3 is 2, and the number of CPU cores of the dataNode4 is 4.

Results and description of the experiments

First, a wordCount (IO type) with a data size of 128M and a kmean (CPU type) job are prepared, and the jobs are run on 4 nodes for 6 times, respectively, and the time of the job run is recorded. In table 1, s represents a unit of time, avg represents an average time of the node running the corresponding tag task, avavg represents a total average time of all nodes running the corresponding tag task, and a calculation formula of rate is as follows:

the negative sign indicates a decrease in the average time relative to the total average time, and the positive sign indicates an increase in the average time relative to the total average time.

From table 1, it can be seen that the time of the DataNode1 in running two tasks is time saving, we take the most CPU jobs saved as the original label of the machine, the DataNode2 is the IO label, and the DataNode3 and DataNode4 are the normal machines.

Table 1 original classification experimental table

Results of the experiments and analysis thereof

The method uses several kinds of operation which can obviously distinguish operation types, WordCount needs a large amount of read data and write intermediate data in the Map stage, and the Map stage and the Reduce stage basically have no arithmetic calculation, so the operation is characterized as IO type operation, Kmeans needs a large amount of calculation points and distances between the points in the Map stage and the Reduce stage, and does not have too much intermediate data writing, so the operation is characterized as CPU type operation, TopK does not have a large amount of data to be written into a disk in the Reduce stage, and does not have a large amount of calculation, and only involves simple comparison, so the operation is considered as a common task artificially.

The verification is carried out through two groups of experiments, the first group of experiments sets a scheduler as fifo, WordCount, Kmeans and Topk jobs are respectively operated for 3 times under the data volume of 500M, 1G and 1.5G, the average time of 3 times of each job is recorded as the final time, the scheduler is switched to carry out the same experiment operation for Capacity and DLMS schedulers, the distribution of a Container of each job under the DLMS scheduler in a cluster is recorded in the experiments, the Container is a dividing unit representing cluster resources, the distribution condition of job fragments in the cluster is recorded, and each Map and Reduce process in the YARN is represented by one Container. The Container distributes the proportion of each node in the cluster to indicate the proportion of the amount of the task of the node executing the job. The abscissa of fig. 3 is the data size of the job, and the ordinate is the total time for which 3 jobs WordCount, Kmeans, Topk are run together. In case of an increased amount of data, the DLMS scheduler saves about 10-20% of the time compared to other schedulers.

Since the DLMS will allocate the resources of the respective node label to the job of the respective label. The Map and reduce of the job are run on the node in the form of one Container, and fig. 3 to 5 are the Container numbers of the jobs of different data volumes under the DMLS scheduler. Node1 is a CPU type tag Node according to the original classification in the previous section, Node2 and Node3 are common tag nodes, and Node4 is an IO tag Node. WordCount is an IO type job, Topk is a normal type job, and Kmeans is a CPU type job. It can be seen from the figure that the distribution rule of the containers is that WordCount work distributes more containers on Node4, Tokp distributes more on common nodes Node2 and Node3, and Kmeans work distributes more on Node 1. The distribution of the containers of the different jobs on the cluster nodes shows that the DLMS scheduler improves the probability that the resources of the corresponding node labels are allocated to the corresponding label jobs.

In the second set of experiments, 5 jobs were prepared, i.e., WordCount job for 128M and 500M data size, Kmeans job for 128M and 500M data size, and Topk job for 500M to constitute one job group. 5 jobs are submitted to run simultaneously. And simulating the continuous job execution condition in the clusters of different schedulers, and recording the total time of the job group after being executed. The team was run 3 times under different schedulers and the total time the team ran was recorded. With specific results in FIG. 6, it is apparent from FIG. 6 that the time savings of the DLMS scheduler proposed herein over the Hadoop self-contained scheduler executing the same job set is apparent, the DMLS scheduler proposed herein saves about 20% of the time over the Hadoop self-contained Fifo scheduler and about 10% of the runtime over the Capacity scheduler.

Claims

The DLMS scheduling method for dynamic label matching under the Hadoop platform is characterized by comprising the following steps:

original classification of cluster nodes and dynamic classification labels thereof;

firstly, original classification is required to be carried out on cluster nodes, and classification is carried out according to the performance of CPUs (central processing units) and disks IO (input/output) of the nodes; each node in the cluster needs to independently run a task of a specified type and record the task time of the node for running the specified type, and the nodes are divided into CPU type nodes, disk IO type nodes and normal type nodes according to the size relation between the task time of the node for running the specified type and the average running time of all the nodes in the cluster;

in the running process of the cluster nodes, if the load is overlarge due to the fact that one node runs part of tasks, the label of the node is degraded to be directly degraded to be a common node; the node initial label is a CPU type label, a CPU type task runs in the node, a dynamic label method is adopted, when the NodeManager sends heartbeat to a ResourceManager, the CPU and IO utilization rate of a node machine are dynamically detected, if the utilization rate exceeds a threshold value, the node label is pasted with a common label, and detection is needed once when the heartbeat is sent each time, so that the node dynamic label is realized; the threshold value is configured in a configuration file, and if the user is not configured, the default value of the system is referred to;

(1) obtaining and transmitting Map process running information

Collecting the running information of a first Map process of task running, wherein the running information of the first Map process is transmitted to a scheduler when the NodeManager sends a heartbeat to a ResourceManager, and the scheduler classifies the tasks according to the transmitted running information of the first Map process;

if the user knows the label of the task, a task type label is set for the task in a command line or a code, a scheduler checks during scheduling, and if the user labels the task, a task classification link is omitted and the task is directly scheduled;

(2) multi-priority queue

5 queues are newly built in the scheduler: the system comprises an original queue, a waiting priority queue, a CPU priority queue, an IO priority queue and a common priority queue; the method comprises the steps that a user submits a task, firstly, the task enters an original queue, a part of Map processes of the task are operated, Map process operation information is collected, then, the task enters a waiting priority queue to wait for the Map process operation information to be returned and classified, and finally, the task enters a queue corresponding to a label according to the classification category of the task;

(3) task classification

Preprocessing data before classification, wherein the data preprocessing is to linearly transform variable data to a new scale in the aspect of data normalization, the minimum value of the transformed variable is 0, the maximum value of the transformed variable is 1, and all the variable data are ensured to be less than or equal to 1;

selecting a naive Bayes classifier for classification in the aspect of task classification; if the user adds the type of the task in the command line and the task code, the task classification can be omitted, and the task directly enters a corresponding queue to wait for resource allocation;

(4) data locality

The data locally adopts a delay degradation scheduling strategy.