CN112379990A - Hadoop-based scheduling method - Google Patents
Hadoop-based scheduling method Download PDFInfo
- Publication number
- CN112379990A CN112379990A CN202011390100.1A CN202011390100A CN112379990A CN 112379990 A CN112379990 A CN 112379990A CN 202011390100 A CN202011390100 A CN 202011390100A CN 112379990 A CN112379990 A CN 112379990A
- Authority
- CN
- China
- Prior art keywords
- task
- container
- hadoop
- node
- fair scheduler
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000005540 biological transmission Effects 0.000 claims description 16
- 230000004807 localization Effects 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000004088 simulation Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a Hadoop-based scheduling method, which comprises the steps of normally scheduling tasks by a Fair scheduler; when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to the judgment result; and judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result. According to the invention, the timeliness of the Hadoop scheduling method is improved by carrying out the localization treatment on the Hadoop-based scheduling process, and the reliability of the Hadoop scheduling method is high.
Description
Technical Field
The invention belongs to the field of cloud computing, and particularly relates to a Hadoop-based scheduling method.
Background
With the development of economic technology and the improvement of living standard of people, the internet is widely applied to the production and the life of people, and brings endless convenience to the production and the life of people. With the development of the internet, the accessed data terminals and users are explosively increased, and the data volume is also exponentially increased, which brings great difficulty to the traditional data processing mode. Since data is often processed in a terminal in conventional data processing, which is an intensive processing manner, for a large data, the processing takes a considerable time, which is obviously unreasonable. Under the circumstances, cloud computing has been proposed, and has many advantages such as super-large scale, virtualization, high reliability, versatility, high scalability, on-demand service, and extreme inexpensiveness, and thus has been unprecedentedly developed.
Meanwhile, for different application scenarios, the performance of the multiple Hadoop clusters has different requirements and different emphasis points, so that it is necessary to optimize the performance of the Hadoop for different application scenarios. In the Hadoop framework, all job-divided tasks are scheduled by a scheduler, which is a critical component. In the simultaneous distributed cluster, data transmission in the cluster should be avoided as much as possible, and the locality of the data is improved.
However, the current scheduling method based on Hadoop is general in timeliness and cannot meet the increasing data volume requirement.
Disclosure of Invention
The invention aims to provide a Hadoop-based scheduling method with good timeliness and high reliability.
The invention provides a Hadoop-based scheduling method, which comprises the following steps:
s1, normally scheduling tasks by a Fair scheduler;
s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result;
s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result.
The Hadoop-based scheduling method further comprises the following steps: all flags will be cleared when a new task request is submitted.
Step S2, when there is a Node requesting a task, determining whether there is a type of the requested task, and selecting a Container for Fair scheduler scheduling according to the determination result, specifically, selecting a Container for Fair scheduler scheduling by the following steps:
when a Node requests a task, judging whether the type of the requested task exists or not:
if the type of the request task exists, preferentially selecting the data to be stored in the Container of the Node sending the request;
if the type of the requested task does not exist, the determination of step S3 is performed.
Step S3, determining whether the Node requesting the task is marked, and selecting a Container scheduled by the Fair scheduler according to the determination result, specifically, selecting a Container scheduled by the Fair scheduler according to the following steps:
when the determination result of the step S2 is that the type of the request task does not exist, it is determined whether the Node of the request task is marked:
if the Node requesting the task is not marked, selecting a Container with the minimum transmission time or waiting time and the transmission time or waiting time less than 1 in the application program as a Container for the Fair scheduler scheduling; if no Container with the transmission time or the waiting time less than 1 exists, the task request is abandoned;
if the Node requesting the task is marked, the Container with the minimum transmission time or waiting time in the application program is selected as the Container scheduled by the Fair scheduler.
The waiting time is specifically calculated by adopting the following formula:
if all nodes are running at the same speed, then:
wherein v is the running speed of the node; p is progression; t iseIs the running time; t islThe remaining run time;
if all nodes are not running at the same speed, then:
wherein f (X, Y) is the task remaining time; n is the number of stages into which the executed task is divided;an ith phase representing a task being executed; x ═ X1,x2,...,xi,...,xnDenotes the P of each stage,t representing each stagee。
According to the Hadoop-based scheduling method, the timeliness of the Hadoop-based scheduling method is improved by performing localization processing on the Hadoop-based scheduling process, and the method is high in reliability.
Drawings
Fig. 1 is a schematic method flow diagram of a scheduling method of the present invention.
Fig. 2 is a schematic diagram of a cluster structure of an embodiment of a scheduling method of the present invention.
Fig. 3 is a diagram illustrating Map task numbers not executed locally according to a first embodiment of the scheduling method of the present invention.
Fig. 4 is a schematic diagram of job execution time according to a first embodiment of the scheduling method of the present invention.
Fig. 5 is a diagram illustrating Map task numbers not executed locally according to a second embodiment of the scheduling method of the present invention.
Fig. 6 is a schematic diagram of job execution time of the second embodiment of the scheduling method according to the present invention.
Fig. 7 is a diagram illustrating Map task numbers not executed locally in the third embodiment of the scheduling method of the present invention.
Fig. 8 is a schematic diagram of job execution time of a third embodiment of the scheduling method of the present invention.
Detailed Description
Fig. 1 is a schematic flow chart of a scheduling method of the present invention: the invention provides a Hadoop-based scheduling method, which comprises the following steps:
s1, normally scheduling tasks by a Fair scheduler;
s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result; specifically, the method comprises the following steps of selecting a Container scheduled by a Fair scheduler:
when a Node requests a task, judging whether the type of the requested task exists or not:
if the type of the request task exists, preferentially selecting the data to be stored in the Container of the Node sending the request;
if the type of the request task does not exist, the judgment of the step S3 is carried out;
s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result; specifically, the method comprises the following steps of selecting a Container scheduled by a Fair scheduler:
when the determination result of the step S2 is that the type of the request task does not exist, it is determined whether the Node of the request task is marked:
if the Node requesting the task is not marked, selecting a Container with the minimum transmission time or waiting time and the transmission time or waiting time less than 1 in the application program as a Container for the Fair scheduler scheduling; if no Container with the transmission time or the waiting time less than 1 exists, the task request is abandoned;
if the Node requesting the task is marked, selecting the Container with the minimum transmission time or waiting time in the application program as the Container scheduled by the Fair scheduler;
in specific implementation, the waiting time is calculated by the following formula:
if all nodes are running at the same speed, then:
wherein v is the running speed of the node; p is progression; t iseIs the running time; t islThe remaining run time;
if all nodes are not running at the same speed, then:
wherein f (X, Y) is the task remaining time; n is the number of stages into which the executed task is divided;an ith phase representing a task being executed; x ═ X1,x2,...,xi,...,xnDenotes the P of each stage,t representing each stagee;
And S4, when a new task request is submitted, all the marks are cleared.
The scheduling method of the present invention is further described with reference to an embodiment as follows:
the algorithm is realized and deployed to a Hadoop experiment platform, a plurality of jobs are simulated to run, and the advantages and disadvantages of the data nature row of the algorithm are compared with those of a Fair Scheduler. And analyzing the experimental result to obtain an experimental conclusion.
Firstly, downloading, compiling and debugging hadoop source codes, and comprising the following steps:
and (3) downloading and installing G + +, CMake, zlib1G-dev, Maven, protobuf2.5, findbugs and opennssl level, and configuring the environment variables.
Decompressing and compiling the source code into an eclipse project, and dividing into two steps. Firstly, switching to hadoop-maven-plugs under the source code root directory to execute mvninstal. After success, switching to a root directory to execute mvn eclipse, wherein eclipse-Dskiptests generate eclipse items. The item is then introduced into eclipse.
The source code is compiled by Maven, and the mvn package-PdIst-DskipTests-Dtar is executed in a root directory.
And deploying hadoop.
The deployed hadoop runs on top of the Linux operating system version Ubuntu-14.04.2.
Due to the lack of hardware conditions for large-scale clustering, we used simulations to simulate our experimental process. Mumak is adopted by simulation software, is a plug-in based on Hadoop, and mainly comprises four components: the simulation system comprises a simulation engine with a discrete event queue, a resource manager for simulating Job scheduling, a NodeManager for simulating task execution clusters and a Job submitting component Job Client. This already contains the basic components of a Hadoop cluster, with jobs submitted by the Job Client, resource manager allocating resources for each Job and storing Job partitioning tasks in its own internal data structure. The NodeManager is used as a processing node of the task, when the NodeManager is idle, the NodeManager requests the resource manager for the task, the request is sent to the resource manager through heartbeat, and the heartbeat generates a series of events to be processed by a scheduler in the resource manager. Mumak requires a system log to evaluate the processing time of the task.
FIG. 2 shows a simulated Hadoop cluster, divided into two racks. The rack has 11 nodes, one of which is a namenode of the HDFS and the other is a dataode. The namenode also serves as a resource manager, other nodes serve as node managers, and the host is connected by a gigabit switch.
The algorithm is mainly based on the comparison of latency and transmission time. The waiting time is calculated by the formula, the configuration of the remaining transmission time is crucial to us, if the configuration is too small, most tasks cannot be executed locally, and if the configuration is too large, the response time of the tasks is prolonged. Because the data is cascaded through two switches, the time for transmitting data between different racks is longer than the time for transmitting data between the same racks. Table 1 details the transfer times of the different sized blocks of data between and within the racks resulting from the test.
Table 1 cluster data block transmission schedule
The evaluation criteria for an excellent cluster are multifaceted, the most important of which include: job response time, job processing time, job data locality, cluster load balancing, and cluster network load. Here we mainly compare job processing time and data locality for both fairschedule and fairschedule times incorporating the improved data locality algorithm.
Three jobs in total, the data amount and the data block size of each job are different, wherein the task number of each job is 40, and the task number that a node can simultaneously process is a default value, which is specifically shown in table 2 below:
TABLE 2 case 1 Job data volume and task volume
The experimental results for case 1 are shown in fig. 3 and 4; as can be seen from the figure, the number of tasks not performed locally is reduced compared to the FairScheduler, which adds the improved data locality algorithm in case 1, by an average of 11%. While reducing the number of tasks that are not executed locally, we can see that the overall execution time of the job is also reduced, which is on average 10%.
In order to verify whether the experimental result can be optimized when the number of tasks is less as well as when the number of tasks is different from that of case 1, the embodiment considers case 2, where the number of tasks is 60, which is the only difference from case 1, and the number of tasks that a node can process at the same time is a default value, as shown in table 3 below:
TABLE 3 case 2 Job data volume and task volume
The experimental results of case 2 are shown in fig. 5 and 6 below. As can be seen from the graph, in case 2, the FairSchedule in which the number of tasks not executed locally adds to the improved data locality algorithm is reduced by 14% on average compared to the FairSchedule. While the execution time is reduced by 9% on average.
In the embodiment, a case 3 is considered, three jobs are considered again, the data volume and the data block size of each job are different, the number of tasks that can be processed by the nodes in the case 3 is configured to be 4, which is to compare with the case 1 differently, and it is desired to verify whether the experimental result can be optimized when the number of tasks that can be processed by a single node is different. The details are shown in table 4 below:
table 4 case 3 job data volume and task volume
It can be seen from fig. 7 that the number of tasks not performed locally is reduced by an average of 13% compared to fairschedule, which adds to the improved data locality algorithm in case 3. On the other hand, as shown in fig. 8, the execution time of the entire job is reduced by 8% on average.
As above, all experiments show that in case 1, case 2 and case 3, the number of tasks of the job is small, and the experimental results verify that our fairschedule scheduling algorithm added with the improved data locality algorithm can reduce the number of tasks not executed locally when the number of tasks of the job is small, and reduce the overall execution time of the job.
Claims (4)
1. A Hadoop-based scheduling method comprises the following steps:
s1, normally scheduling tasks by a Fair scheduler;
s2, when a Node requests a task, judging whether the type of the requested task exists or not, and selecting a Container scheduled by a Fair scheduler according to a judgment result;
s3, judging whether the Node requesting the task is marked or not, and selecting a Container scheduled by the Fair scheduler according to the judgment result.
2. The Hadoop-based scheduling method according to claim 1, wherein the Hadoop-based scheduling method further comprises the steps of: when a new task request is submitted, all the marks are cleared;
step S2, when there is a Node requesting a task, determining whether there is a type of the requested task, and selecting a Container for Fair scheduler scheduling according to the determination result, specifically, selecting a Container for Fair scheduler scheduling by the following steps:
when a Node requests a task, judging whether the type of the requested task exists or not:
if the type of the request task exists, preferentially selecting the data to be stored in the Container of the Node sending the request;
if the type of the requested task does not exist, the determination of step S3 is performed.
3. The Hadoop-based scheduling method according to claim 2, wherein the step S3 is performed to determine whether the Node requesting the task is marked, and select a Container for Fair scheduler scheduling according to the determination result, specifically, the following steps are performed to select the Container for Fair scheduler scheduling:
when the determination result of the step S2 is that the type of the request task does not exist, it is determined whether the Node of the request task is marked:
if the Node requesting the task is not marked, selecting a Container with the minimum transmission time or waiting time and the transmission time or waiting time less than 1 in the application program as a Container for the Fair scheduler scheduling; if no Container with the transmission time or the waiting time less than 1 exists, the task request is abandoned;
if the Node requesting the task is marked, the Container with the minimum transmission time or waiting time in the application program is selected as the Container scheduled by the Fair scheduler.
4. The Hadoop-based scheduling method according to claim 3, wherein the waiting time is calculated by the following equation:
if all nodes are running at the same speed, then:
wherein v is the running speed of the node; p is progression; t iseIs the running time; t islThe remaining run time;
if all nodes are not running at the same speed, then:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011390100.1A CN112379990A (en) | 2020-12-02 | 2020-12-02 | Hadoop-based scheduling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011390100.1A CN112379990A (en) | 2020-12-02 | 2020-12-02 | Hadoop-based scheduling method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112379990A true CN112379990A (en) | 2021-02-19 |
Family
ID=74589561
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011390100.1A Pending CN112379990A (en) | 2020-12-02 | 2020-12-02 | Hadoop-based scheduling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112379990A (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140310712A1 (en) * | 2013-04-10 | 2014-10-16 | International Business Machines Corporation | Sequential cooperation between map and reduce phases to improve data locality |
CN111045795A (en) * | 2018-10-11 | 2020-04-21 | 浙江宇视科技有限公司 | Resource scheduling method and device |
-
2020
- 2020-12-02 CN CN202011390100.1A patent/CN112379990A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140310712A1 (en) * | 2013-04-10 | 2014-10-16 | International Business Machines Corporation | Sequential cooperation between map and reduce phases to improve data locality |
CN111045795A (en) * | 2018-10-11 | 2020-04-21 | 浙江宇视科技有限公司 | Resource scheduling method and device |
Non-Patent Citations (4)
Title |
---|
HADOOP官方文档: "HDFS Architecture", 《HADOOP.APACHE.ORG_DOCS_R3.3.0_HADOOP-PROJECT-DIST_HADOOP-HDFS_HDFSDESIGN.HTML》 * |
HADOOP官方文档: "HDFS Architecture", 《HADOOP.APACHE.ORG_DOCS_R3.3.0_HADOOP-PROJECT-DIST_HADOOP-HDFS_HDFSDESIGN.HTML》, 6 July 2020 (2020-07-06) * |
TOM WHITE: "《Hadoop: The Definitive Guide, FOURTH EDITION》", 17 April 2015, O"REILLY, pages: 185 - 192 * |
付庆午: "MapReduce作业的Data-Aware调度策略研究", 《吉林大学硕士学位论文》, no. 10, 15 October 2012 (2012-10-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Taura et al. | A heuristic algorithm for mapping communicating tasks on heterogeneous resources | |
US8150889B1 (en) | Parallel processing framework | |
US8434085B2 (en) | Scalable scheduling of tasks in heterogeneous systems | |
US11816509B2 (en) | Workload placement for virtual GPU enabled systems | |
CN114138486A (en) | Containerized micro-service arranging method, system and medium for cloud edge heterogeneous environment | |
CN111381950A (en) | Task scheduling method and system based on multiple copies for edge computing environment | |
Kreaseck et al. | Autonomous protocols for bandwidth-centric scheduling of independent-task applications | |
CN114610474B (en) | Multi-strategy job scheduling method and system under heterogeneous supercomputing environment | |
CN1845075A (en) | Service oriented high-performance grid computing job scheduling method | |
CN116450355A (en) | Multi-cluster model training method, device, equipment and medium | |
CN117271101B (en) | Operator fusion method and device, electronic equipment and storage medium | |
CN112395736A (en) | Parallel simulation job scheduling method of distributed interactive simulation system | |
CN110928666B (en) | Method and system for optimizing task parallelism based on memory in Spark environment | |
CN112912849B (en) | Graph data-based calculation operation scheduling method, system, computer readable medium and equipment | |
CN116932147A (en) | Streaming job processing method and device, electronic equipment and medium | |
CN112379990A (en) | Hadoop-based scheduling method | |
Fan et al. | Associated task scheduling based on dynamic finish time prediction for cloud computing | |
CN115080207A (en) | Task processing method and device based on container cluster | |
CN110297693B (en) | Distributed software task allocation method and system | |
Paravastu et al. | Adaptive load balancing in mapreduce using flubber | |
Hassan et al. | A survey about efficient job scheduling strategies in cloud and large scale environments | |
Chiang et al. | Dynamic Resource Management for Machine Learning Pipeline Workloads | |
CN112068954B (en) | Method and system for scheduling network computing resources | |
O'Neill et al. | Cross resource optimisation of database functionality across heterogeneous processors | |
CN117579626B (en) | Optimization method and system based on distributed realization of edge calculation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210219 |