CN112379990A

CN112379990A - Hadoop-based scheduling method

Info

Publication number: CN112379990A
Application number: CN202011390100.1A
Authority: CN
Inventors: 唐镜雯; 常胜; 陈俊; 唐渊; 陈卫民; 夏先喜; 向明; 范方彪; 徐高明; 邓志祥; 周能; 邵帅
Original assignee: State Grid Hunan Electric Power Co ltd Power Transmission Overhaul Branch; State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd
Current assignee: State Grid Hunan Electric Power Co ltd Power Transmission Overhaul Branch; State Grid Corp of China SGCC; State Grid Hunan Electric Power Co Ltd
Priority date: 2020-12-02
Filing date: 2020-12-02
Publication date: 2021-02-19

Abstract

The invention discloses a Hadoop-based scheduling method, which comprises the steps of normally scheduling tasks by a Fair scheduler; when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to the judgment result; and judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result. According to the invention, the timeliness of the Hadoop scheduling method is improved by carrying out the localization treatment on the Hadoop-based scheduling process, and the reliability of the Hadoop scheduling method is high.

Description

Hadoop-based scheduling method

Technical Field

The invention belongs to the field of cloud computing, and particularly relates to a Hadoop-based scheduling method.

Background

With the development of economic technology and the improvement of living standard of people, the internet is widely applied to the production and the life of people, and brings endless convenience to the production and the life of people. With the development of the internet, the accessed data terminals and users are explosively increased, and the data volume is also exponentially increased, which brings great difficulty to the traditional data processing mode. Since data is often processed in a terminal in conventional data processing, which is an intensive processing manner, for a large data, the processing takes a considerable time, which is obviously unreasonable. Under the circumstances, cloud computing has been proposed, and has many advantages such as super-large scale, virtualization, high reliability, versatility, high scalability, on-demand service, and extreme inexpensiveness, and thus has been unprecedentedly developed.

Meanwhile, for different application scenarios, the performance of the multiple Hadoop clusters has different requirements and different emphasis points, so that it is necessary to optimize the performance of the Hadoop for different application scenarios. In the Hadoop framework, all job-divided tasks are scheduled by a scheduler, which is a critical component. In the simultaneous distributed cluster, data transmission in the cluster should be avoided as much as possible, and the locality of the data is improved.

However, the current scheduling method based on Hadoop is general in timeliness and cannot meet the increasing data volume requirement.

Disclosure of Invention

The invention aims to provide a Hadoop-based scheduling method with good timeliness and high reliability.

The invention provides a Hadoop-based scheduling method, which comprises the following steps:

s1, normally scheduling tasks by a Fair scheduler;

s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result;

s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result.

The Hadoop-based scheduling method further comprises the following steps: all flags will be cleared when a new task request is submitted.

Step S2, when there is a Node requesting a task, determining whether there is a type of the requested task, and selecting a Container for Fair scheduler scheduling according to the determination result, specifically, selecting a Container for Fair scheduler scheduling by the following steps:

when a Node requests a task, judging whether the type of the requested task exists or not:

if the type of the request task exists, preferentially selecting the data to be stored in the Container of the Node sending the request;

if the type of the requested task does not exist, the determination of step S3 is performed.

Step S3, determining whether the Node requesting the task is marked, and selecting a Container scheduled by the Fair scheduler according to the determination result, specifically, selecting a Container scheduled by the Fair scheduler according to the following steps:

when the determination result of the step S2 is that the type of the request task does not exist, it is determined whether the Node of the request task is marked:

if the Node requesting the task is not marked, selecting a Container with the minimum transmission time or waiting time and the transmission time or waiting time less than 1 in the application program as a Container for the Fair scheduler scheduling; if no Container with the transmission time or the waiting time less than 1 exists, the task request is abandoned;

if the Node requesting the task is marked, the Container with the minimum transmission time or waiting time in the application program is selected as the Container scheduled by the Fair scheduler.

The waiting time is specifically calculated by adopting the following formula:

if all nodes are running at the same speed, then:

wherein v is the running speed of the node; p is progression; t is_eIs the running time; t is_lThe remaining run time;

if all nodes are not running at the same speed, then:

wherein f (X, Y) is the task remaining time; n is the number of stages into which the executed task is divided;

an ith phase representing a task being executed; x ═ X₁,x₂,...,x_i,...,x_nDenotes the P of each stage,

t representing each stage_e。

According to the Hadoop-based scheduling method, the timeliness of the Hadoop-based scheduling method is improved by performing localization processing on the Hadoop-based scheduling process, and the method is high in reliability.

Drawings

Fig. 1 is a schematic method flow diagram of a scheduling method of the present invention.

Fig. 2 is a schematic diagram of a cluster structure of an embodiment of a scheduling method of the present invention.

Fig. 3 is a diagram illustrating Map task numbers not executed locally according to a first embodiment of the scheduling method of the present invention.

Fig. 4 is a schematic diagram of job execution time according to a first embodiment of the scheduling method of the present invention.

Fig. 5 is a diagram illustrating Map task numbers not executed locally according to a second embodiment of the scheduling method of the present invention.

Fig. 6 is a schematic diagram of job execution time of the second embodiment of the scheduling method according to the present invention.

Fig. 7 is a diagram illustrating Map task numbers not executed locally in the third embodiment of the scheduling method of the present invention.

Fig. 8 is a schematic diagram of job execution time of a third embodiment of the scheduling method of the present invention.

Detailed Description

Fig. 1 is a schematic flow chart of a scheduling method of the present invention: the invention provides a Hadoop-based scheduling method, which comprises the following steps:

s1, normally scheduling tasks by a Fair scheduler;

s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result; specifically, the method comprises the following steps of selecting a Container scheduled by a Fair scheduler:

if the type of the request task does not exist, the judgment of the step S3 is carried out;

s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result; specifically, the method comprises the following steps of selecting a Container scheduled by a Fair scheduler:

if the Node requesting the task is marked, selecting the Container with the minimum transmission time or waiting time in the application program as the Container scheduled by the Fair scheduler;

in specific implementation, the waiting time is calculated by the following formula:

if all nodes are running at the same speed, then:

if all nodes are not running at the same speed, then:

t representing each stage_e；

And S4, when a new task request is submitted, all the marks are cleared.

The scheduling method of the present invention is further described with reference to an embodiment as follows:

the algorithm is realized and deployed to a Hadoop experiment platform, a plurality of jobs are simulated to run, and the advantages and disadvantages of the data nature row of the algorithm are compared with those of a Fair Scheduler. And analyzing the experimental result to obtain an experimental conclusion.

Firstly, downloading, compiling and debugging hadoop source codes, and comprising the following steps:

and (3) downloading and installing G + +, CMake, zlib1G-dev, Maven, protobuf2.5, findbugs and opennssl level, and configuring the environment variables.

Decompressing and compiling the source code into an eclipse project, and dividing into two steps. Firstly, switching to hadoop-maven-plugs under the source code root directory to execute mvninstal. After success, switching to a root directory to execute mvn eclipse, wherein eclipse-Dskiptests generate eclipse items. The item is then introduced into eclipse.

The source code is compiled by Maven, and the mvn package-PdIst-DskipTests-Dtar is executed in a root directory.

And deploying hadoop.

The deployed hadoop runs on top of the Linux operating system version Ubuntu-14.04.2.

Due to the lack of hardware conditions for large-scale clustering, we used simulations to simulate our experimental process. Mumak is adopted by simulation software, is a plug-in based on Hadoop, and mainly comprises four components: the simulation system comprises a simulation engine with a discrete event queue, a resource manager for simulating Job scheduling, a NodeManager for simulating task execution clusters and a Job submitting component Job Client. This already contains the basic components of a Hadoop cluster, with jobs submitted by the Job Client, resource manager allocating resources for each Job and storing Job partitioning tasks in its own internal data structure. The NodeManager is used as a processing node of the task, when the NodeManager is idle, the NodeManager requests the resource manager for the task, the request is sent to the resource manager through heartbeat, and the heartbeat generates a series of events to be processed by a scheduler in the resource manager. Mumak requires a system log to evaluate the processing time of the task.

FIG. 2 shows a simulated Hadoop cluster, divided into two racks. The rack has 11 nodes, one of which is a namenode of the HDFS and the other is a dataode. The namenode also serves as a resource manager, other nodes serve as node managers, and the host is connected by a gigabit switch.

The algorithm is mainly based on the comparison of latency and transmission time. The waiting time is calculated by the formula, the configuration of the remaining transmission time is crucial to us, if the configuration is too small, most tasks cannot be executed locally, and if the configuration is too large, the response time of the tasks is prolonged. Because the data is cascaded through two switches, the time for transmitting data between different racks is longer than the time for transmitting data between the same racks. Table 1 details the transfer times of the different sized blocks of data between and within the racks resulting from the test.

Table 1 cluster data block transmission schedule

The evaluation criteria for an excellent cluster are multifaceted, the most important of which include: job response time, job processing time, job data locality, cluster load balancing, and cluster network load. Here we mainly compare job processing time and data locality for both fairschedule and fairschedule times incorporating the improved data locality algorithm.

Three jobs in total, the data amount and the data block size of each job are different, wherein the task number of each job is 40, and the task number that a node can simultaneously process is a default value, which is specifically shown in table 2 below:

TABLE 2 case 1 Job data volume and task volume

The experimental results for case 1 are shown in fig. 3 and 4; as can be seen from the figure, the number of tasks not performed locally is reduced compared to the FairScheduler, which adds the improved data locality algorithm in case 1, by an average of 11%. While reducing the number of tasks that are not executed locally, we can see that the overall execution time of the job is also reduced, which is on average 10%.

In order to verify whether the experimental result can be optimized when the number of tasks is less as well as when the number of tasks is different from that of case 1, the embodiment considers case 2, where the number of tasks is 60, which is the only difference from case 1, and the number of tasks that a node can process at the same time is a default value, as shown in table 3 below:

TABLE 3 case 2 Job data volume and task volume

The experimental results of case 2 are shown in fig. 5 and 6 below. As can be seen from the graph, in case 2, the FairSchedule in which the number of tasks not executed locally adds to the improved data locality algorithm is reduced by 14% on average compared to the FairSchedule. While the execution time is reduced by 9% on average.

In the embodiment, a case 3 is considered, three jobs are considered again, the data volume and the data block size of each job are different, the number of tasks that can be processed by the nodes in the case 3 is configured to be 4, which is to compare with the case 1 differently, and it is desired to verify whether the experimental result can be optimized when the number of tasks that can be processed by a single node is different. The details are shown in table 4 below:

table 4 case 3 job data volume and task volume

It can be seen from fig. 7 that the number of tasks not performed locally is reduced by an average of 13% compared to fairschedule, which adds to the improved data locality algorithm in case 3. On the other hand, as shown in fig. 8, the execution time of the entire job is reduced by 8% on average.

As above, all experiments show that in case 1, case 2 and case 3, the number of tasks of the job is small, and the experimental results verify that our fairschedule scheduling algorithm added with the improved data locality algorithm can reduce the number of tasks not executed locally when the number of tasks of the job is small, and reduce the overall execution time of the job.

Claims

1. A Hadoop-based scheduling method comprises the following steps:

s1, normally scheduling tasks by a Fair scheduler;

2. The Hadoop-based scheduling method according to claim 1, wherein the Hadoop-based scheduling method further comprises the steps of: when a new task request is submitted, all the marks are cleared;

3. The Hadoop-based scheduling method according to claim 2, wherein the step S3 is performed to determine whether the Node requesting the task is marked, and select a Container for Fair scheduler scheduling according to the determination result, specifically, the following steps are performed to select the Container for Fair scheduler scheduling:

4. The Hadoop-based scheduling method according to claim 3, wherein the waiting time is calculated by the following equation:

if all nodes are running at the same speed, then:

if all nodes are not running at the same speed, then:

t representing each stage_e。