CN112379990A - Hadoop-based scheduling method - Google Patents

Hadoop-based scheduling method Download PDF

Info

Publication number
CN112379990A
CN112379990A CN202011390100.1A CN202011390100A CN112379990A CN 112379990 A CN112379990 A CN 112379990A CN 202011390100 A CN202011390100 A CN 202011390100A CN 112379990 A CN112379990 A CN 112379990A
Authority
CN
China
Prior art keywords
task
container
hadoop
node
fair scheduler
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011390100.1A
Other languages
Chinese (zh)
Inventor
唐镜雯
常胜
陈俊
唐渊
陈卫民
夏先喜
向明
范方彪
徐高明
邓志祥
周能
邵帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Hunan Electric Power Co ltd Power Transmission Overhaul Branch
State Grid Corp of China SGCC
State Grid Hunan Electric Power Co Ltd
Original Assignee
State Grid Hunan Electric Power Co ltd Power Transmission Overhaul Branch
State Grid Corp of China SGCC
State Grid Hunan Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Hunan Electric Power Co ltd Power Transmission Overhaul Branch, State Grid Corp of China SGCC, State Grid Hunan Electric Power Co Ltd filed Critical State Grid Hunan Electric Power Co ltd Power Transmission Overhaul Branch
Priority to CN202011390100.1A priority Critical patent/CN112379990A/en
Publication of CN112379990A publication Critical patent/CN112379990A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a Hadoop-based scheduling method, which comprises the steps of normally scheduling tasks by a Fair scheduler; when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to the judgment result; and judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result. According to the invention, the timeliness of the Hadoop scheduling method is improved by carrying out the localization treatment on the Hadoop-based scheduling process, and the reliability of the Hadoop scheduling method is high.

Description

Hadoop-based scheduling method
Technical Field
The invention belongs to the field of cloud computing, and particularly relates to a Hadoop-based scheduling method.
Background
With the development of economic technology and the improvement of living standard of people, the internet is widely applied to the production and the life of people, and brings endless convenience to the production and the life of people. With the development of the internet, the accessed data terminals and users are explosively increased, and the data volume is also exponentially increased, which brings great difficulty to the traditional data processing mode. Since data is often processed in a terminal in conventional data processing, which is an intensive processing manner, for a large data, the processing takes a considerable time, which is obviously unreasonable. Under the circumstances, cloud computing has been proposed, and has many advantages such as super-large scale, virtualization, high reliability, versatility, high scalability, on-demand service, and extreme inexpensiveness, and thus has been unprecedentedly developed.
Meanwhile, for different application scenarios, the performance of the multiple Hadoop clusters has different requirements and different emphasis points, so that it is necessary to optimize the performance of the Hadoop for different application scenarios. In the Hadoop framework, all job-divided tasks are scheduled by a scheduler, which is a critical component. In the simultaneous distributed cluster, data transmission in the cluster should be avoided as much as possible, and the locality of the data is improved.
However, the current scheduling method based on Hadoop is general in timeliness and cannot meet the increasing data volume requirement.
Disclosure of Invention
The invention aims to provide a Hadoop-based scheduling method with good timeliness and high reliability.
The invention provides a Hadoop-based scheduling method, which comprises the following steps:
s1, normally scheduling tasks by a Fair scheduler;
s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result;
s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result.
The Hadoop-based scheduling method further comprises the following steps: all flags will be cleared when a new task request is submitted.
Step S2, when there is a Node requesting a task, determining whether there is a type of the requested task, and selecting a Container for Fair scheduler scheduling according to the determination result, specifically, selecting a Container for Fair scheduler scheduling by the following steps:
when a Node requests a task, judging whether the type of the requested task exists or not:
if the type of the request task exists, preferentially selecting the data to be stored in the Container of the Node sending the request;
if the type of the requested task does not exist, the determination of step S3 is performed.
Step S3, determining whether the Node requesting the task is marked, and selecting a Container scheduled by the Fair scheduler according to the determination result, specifically, selecting a Container scheduled by the Fair scheduler according to the following steps:
when the determination result of the step S2 is that the type of the request task does not exist, it is determined whether the Node of the request task is marked:
if the Node requesting the task is not marked, selecting a Container with the minimum transmission time or waiting time and the transmission time or waiting time less than 1 in the application program as a Container for the Fair scheduler scheduling; if no Container with the transmission time or the waiting time less than 1 exists, the task request is abandoned;
if the Node requesting the task is marked, the Container with the minimum transmission time or waiting time in the application program is selected as the Container scheduled by the Fair scheduler.
The waiting time is specifically calculated by adopting the following formula:
if all nodes are running at the same speed, then:
Figure BDA0002812307880000031
wherein v is the running speed of the node; p is progression; t iseIs the running time; t islThe remaining run time;
if all nodes are not running at the same speed, then:
Figure BDA0002812307880000032
wherein f (X, Y) is the task remaining time; n is the number of stages into which the executed task is divided;
Figure BDA0002812307880000033
an ith phase representing a task being executed; x ═ X1,x2,...,xi,...,xnDenotes the P of each stage,
Figure BDA0002812307880000034
t representing each stagee
According to the Hadoop-based scheduling method, the timeliness of the Hadoop-based scheduling method is improved by performing localization processing on the Hadoop-based scheduling process, and the method is high in reliability.
Drawings
Fig. 1 is a schematic method flow diagram of a scheduling method of the present invention.
Fig. 2 is a schematic diagram of a cluster structure of an embodiment of a scheduling method of the present invention.
Fig. 3 is a diagram illustrating Map task numbers not executed locally according to a first embodiment of the scheduling method of the present invention.
Fig. 4 is a schematic diagram of job execution time according to a first embodiment of the scheduling method of the present invention.
Fig. 5 is a diagram illustrating Map task numbers not executed locally according to a second embodiment of the scheduling method of the present invention.
Fig. 6 is a schematic diagram of job execution time of the second embodiment of the scheduling method according to the present invention.
Fig. 7 is a diagram illustrating Map task numbers not executed locally in the third embodiment of the scheduling method of the present invention.
Fig. 8 is a schematic diagram of job execution time of a third embodiment of the scheduling method of the present invention.
Detailed Description
Fig. 1 is a schematic flow chart of a scheduling method of the present invention: the invention provides a Hadoop-based scheduling method, which comprises the following steps:
s1, normally scheduling tasks by a Fair scheduler;
s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result; specifically, the method comprises the following steps of selecting a Container scheduled by a Fair scheduler:
when a Node requests a task, judging whether the type of the requested task exists or not:
if the type of the request task exists, preferentially selecting the data to be stored in the Container of the Node sending the request;
if the type of the request task does not exist, the judgment of the step S3 is carried out;
s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result; specifically, the method comprises the following steps of selecting a Container scheduled by a Fair scheduler:
when the determination result of the step S2 is that the type of the request task does not exist, it is determined whether the Node of the request task is marked:
if the Node requesting the task is not marked, selecting a Container with the minimum transmission time or waiting time and the transmission time or waiting time less than 1 in the application program as a Container for the Fair scheduler scheduling; if no Container with the transmission time or the waiting time less than 1 exists, the task request is abandoned;
if the Node requesting the task is marked, selecting the Container with the minimum transmission time or waiting time in the application program as the Container scheduled by the Fair scheduler;
in specific implementation, the waiting time is calculated by the following formula:
if all nodes are running at the same speed, then:
Figure BDA0002812307880000051
wherein v is the running speed of the node; p is progression; t iseIs the running time; t islThe remaining run time;
if all nodes are not running at the same speed, then:
Figure BDA0002812307880000052
wherein f (X, Y) is the task remaining time; n is the number of stages into which the executed task is divided;
Figure BDA0002812307880000053
an ith phase representing a task being executed; x ═ X1,x2,...,xi,...,xnDenotes the P of each stage,
Figure BDA0002812307880000054
t representing each stagee
And S4, when a new task request is submitted, all the marks are cleared.
The scheduling method of the present invention is further described with reference to an embodiment as follows:
the algorithm is realized and deployed to a Hadoop experiment platform, a plurality of jobs are simulated to run, and the advantages and disadvantages of the data nature row of the algorithm are compared with those of a Fair Scheduler. And analyzing the experimental result to obtain an experimental conclusion.
Firstly, downloading, compiling and debugging hadoop source codes, and comprising the following steps:
and (3) downloading and installing G + +, CMake, zlib1G-dev, Maven, protobuf2.5, findbugs and opennssl level, and configuring the environment variables.
Decompressing and compiling the source code into an eclipse project, and dividing into two steps. Firstly, switching to hadoop-maven-plugs under the source code root directory to execute mvninstal. After success, switching to a root directory to execute mvn eclipse, wherein eclipse-Dskiptests generate eclipse items. The item is then introduced into eclipse.
The source code is compiled by Maven, and the mvn package-PdIst-DskipTests-Dtar is executed in a root directory.
And deploying hadoop.
The deployed hadoop runs on top of the Linux operating system version Ubuntu-14.04.2.
Due to the lack of hardware conditions for large-scale clustering, we used simulations to simulate our experimental process. Mumak is adopted by simulation software, is a plug-in based on Hadoop, and mainly comprises four components: the simulation system comprises a simulation engine with a discrete event queue, a resource manager for simulating Job scheduling, a NodeManager for simulating task execution clusters and a Job submitting component Job Client. This already contains the basic components of a Hadoop cluster, with jobs submitted by the Job Client, resource manager allocating resources for each Job and storing Job partitioning tasks in its own internal data structure. The NodeManager is used as a processing node of the task, when the NodeManager is idle, the NodeManager requests the resource manager for the task, the request is sent to the resource manager through heartbeat, and the heartbeat generates a series of events to be processed by a scheduler in the resource manager. Mumak requires a system log to evaluate the processing time of the task.
FIG. 2 shows a simulated Hadoop cluster, divided into two racks. The rack has 11 nodes, one of which is a namenode of the HDFS and the other is a dataode. The namenode also serves as a resource manager, other nodes serve as node managers, and the host is connected by a gigabit switch.
The algorithm is mainly based on the comparison of latency and transmission time. The waiting time is calculated by the formula, the configuration of the remaining transmission time is crucial to us, if the configuration is too small, most tasks cannot be executed locally, and if the configuration is too large, the response time of the tasks is prolonged. Because the data is cascaded through two switches, the time for transmitting data between different racks is longer than the time for transmitting data between the same racks. Table 1 details the transfer times of the different sized blocks of data between and within the racks resulting from the test.
Table 1 cluster data block transmission schedule
Figure BDA0002812307880000071
The evaluation criteria for an excellent cluster are multifaceted, the most important of which include: job response time, job processing time, job data locality, cluster load balancing, and cluster network load. Here we mainly compare job processing time and data locality for both fairschedule and fairschedule times incorporating the improved data locality algorithm.
Three jobs in total, the data amount and the data block size of each job are different, wherein the task number of each job is 40, and the task number that a node can simultaneously process is a default value, which is specifically shown in table 2 below:
TABLE 2 case 1 Job data volume and task volume
Figure BDA0002812307880000072
The experimental results for case 1 are shown in fig. 3 and 4; as can be seen from the figure, the number of tasks not performed locally is reduced compared to the FairScheduler, which adds the improved data locality algorithm in case 1, by an average of 11%. While reducing the number of tasks that are not executed locally, we can see that the overall execution time of the job is also reduced, which is on average 10%.
In order to verify whether the experimental result can be optimized when the number of tasks is less as well as when the number of tasks is different from that of case 1, the embodiment considers case 2, where the number of tasks is 60, which is the only difference from case 1, and the number of tasks that a node can process at the same time is a default value, as shown in table 3 below:
TABLE 3 case 2 Job data volume and task volume
Figure BDA0002812307880000081
The experimental results of case 2 are shown in fig. 5 and 6 below. As can be seen from the graph, in case 2, the FairSchedule in which the number of tasks not executed locally adds to the improved data locality algorithm is reduced by 14% on average compared to the FairSchedule. While the execution time is reduced by 9% on average.
In the embodiment, a case 3 is considered, three jobs are considered again, the data volume and the data block size of each job are different, the number of tasks that can be processed by the nodes in the case 3 is configured to be 4, which is to compare with the case 1 differently, and it is desired to verify whether the experimental result can be optimized when the number of tasks that can be processed by a single node is different. The details are shown in table 4 below:
table 4 case 3 job data volume and task volume
Figure BDA0002812307880000082
It can be seen from fig. 7 that the number of tasks not performed locally is reduced by an average of 13% compared to fairschedule, which adds to the improved data locality algorithm in case 3. On the other hand, as shown in fig. 8, the execution time of the entire job is reduced by 8% on average.
As above, all experiments show that in case 1, case 2 and case 3, the number of tasks of the job is small, and the experimental results verify that our fairschedule scheduling algorithm added with the improved data locality algorithm can reduce the number of tasks not executed locally when the number of tasks of the job is small, and reduce the overall execution time of the job.

Claims (4)

1. A Hadoop-based scheduling method comprises the following steps:
s1, normally scheduling tasks by a Fair scheduler;
s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result;
s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result.
2. The Hadoop-based scheduling method according to claim 1, wherein the Hadoop-based scheduling method further comprises the steps of: when a new task request is submitted, all the marks are cleared;
step S2, when there is a Node requesting a task, determining whether there is a type of the requested task, and selecting a Container for Fair scheduler scheduling according to the determination result, specifically, selecting a Container for Fair scheduler scheduling by the following steps:
when a Node requests a task, judging whether the type of the requested task exists or not:
if the type of the request task exists, preferentially selecting the data to be stored in the Container of the Node sending the request;
if the type of the requested task does not exist, the determination of step S3 is performed.
3. The Hadoop-based scheduling method according to claim 2, wherein the step S3 is performed to determine whether the Node requesting the task is marked, and select a Container for Fair scheduler scheduling according to the determination result, specifically, the following steps are performed to select the Container for Fair scheduler scheduling:
when the determination result of the step S2 is that the type of the request task does not exist, it is determined whether the Node of the request task is marked:
if the Node requesting the task is not marked, selecting a Container with the minimum transmission time or waiting time and the transmission time or waiting time less than 1 in the application program as a Container for the Fair scheduler scheduling; if no Container with the transmission time or the waiting time less than 1 exists, the task request is abandoned;
if the Node requesting the task is marked, the Container with the minimum transmission time or waiting time in the application program is selected as the Container scheduled by the Fair scheduler.
4. The Hadoop-based scheduling method according to claim 3, wherein the waiting time is calculated by the following equation:
if all nodes are running at the same speed, then:
Figure FDA0002812307870000021
wherein v is the running speed of the node; p is progression; t iseIs the running time; t islThe remaining run time;
if all nodes are not running at the same speed, then:
Figure FDA0002812307870000022
wherein f (X, Y) is the task remaining time; n is the number of stages into which the executed task is divided;
Figure FDA0002812307870000023
an ith phase representing a task being executed; x ═ X1,x2,...,xi,...,xnDenotes the P of each stage,
Figure FDA0002812307870000024
t representing each stagee
CN202011390100.1A 2020-12-02 2020-12-02 Hadoop-based scheduling method Pending CN112379990A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011390100.1A CN112379990A (en) 2020-12-02 2020-12-02 Hadoop-based scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011390100.1A CN112379990A (en) 2020-12-02 2020-12-02 Hadoop-based scheduling method

Publications (1)

Publication Number Publication Date
CN112379990A true CN112379990A (en) 2021-02-19

Family

ID=74589561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011390100.1A Pending CN112379990A (en) 2020-12-02 2020-12-02 Hadoop-based scheduling method

Country Status (1)

Country Link
CN (1) CN112379990A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140310712A1 (en) * 2013-04-10 2014-10-16 International Business Machines Corporation Sequential cooperation between map and reduce phases to improve data locality
CN111045795A (en) * 2018-10-11 2020-04-21 浙江宇视科技有限公司 Resource scheduling method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140310712A1 (en) * 2013-04-10 2014-10-16 International Business Machines Corporation Sequential cooperation between map and reduce phases to improve data locality
CN111045795A (en) * 2018-10-11 2020-04-21 浙江宇视科技有限公司 Resource scheduling method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HADOOP官方文档: "HDFS Architecture", 《HADOOP.APACHE.ORG_DOCS_R3.3.0_HADOOP-PROJECT-DIST_HADOOP-HDFS_HDFSDESIGN.HTML》 *
HADOOP官方文档: "HDFS Architecture", 《HADOOP.APACHE.ORG_DOCS_R3.3.0_HADOOP-PROJECT-DIST_HADOOP-HDFS_HDFSDESIGN.HTML》, 6 July 2020 (2020-07-06) *
TOM WHITE: "《Hadoop: The Definitive Guide, FOURTH EDITION》", 17 April 2015, O"REILLY, pages: 185 - 192 *
付庆午: "MapReduce作业的Data-Aware调度策略研究", 《吉林大学硕士学位论文》, no. 10, 15 October 2012 (2012-10-15) *

Similar Documents

Publication Publication Date Title
Taura et al. A heuristic algorithm for mapping communicating tasks on heterogeneous resources
US8150889B1 (en) Parallel processing framework
US8434085B2 (en) Scalable scheduling of tasks in heterogeneous systems
US11816509B2 (en) Workload placement for virtual GPU enabled systems
CN114138486A (en) Containerized micro-service arranging method, system and medium for cloud edge heterogeneous environment
CN111381950A (en) Task scheduling method and system based on multiple copies for edge computing environment
Kreaseck et al. Autonomous protocols for bandwidth-centric scheduling of independent-task applications
CN114610474B (en) Multi-strategy job scheduling method and system under heterogeneous supercomputing environment
CN1845075A (en) Service oriented high-performance grid computing job scheduling method
CN116450355A (en) Multi-cluster model training method, device, equipment and medium
CN117271101B (en) Operator fusion method and device, electronic equipment and storage medium
CN112395736A (en) Parallel simulation job scheduling method of distributed interactive simulation system
CN110928666B (en) Method and system for optimizing task parallelism based on memory in Spark environment
CN112912849B (en) Graph data-based calculation operation scheduling method, system, computer readable medium and equipment
CN116932147A (en) Streaming job processing method and device, electronic equipment and medium
CN112379990A (en) Hadoop-based scheduling method
Fan et al. Associated task scheduling based on dynamic finish time prediction for cloud computing
CN115080207A (en) Task processing method and device based on container cluster
CN110297693B (en) Distributed software task allocation method and system
Paravastu et al. Adaptive load balancing in mapreduce using flubber
Hassan et al. A survey about efficient job scheduling strategies in cloud and large scale environments
Chiang et al. Dynamic Resource Management for Machine Learning Pipeline Workloads
CN112068954B (en) Method and system for scheduling network computing resources
O'Neill et al. Cross resource optimisation of database functionality across heterogeneous processors
CN117579626B (en) Optimization method and system based on distributed realization of edge calculation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210219