CN108536528A

CN108536528A - Using the extensive network job scheduling method of perception

Info

Publication number: CN108536528A
Application number: CN201810245680.1A
Authority: CN
Inventors: 唐小勇; 李肯立; 刘楚波
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2018-03-23
Filing date: 2018-03-23
Publication date: 2018-09-14

Abstract

The invention discloses a kind of extensive network job scheduling methods of application perception.This method is mainly made of the following steps：The first step, user submit interactive interface to submit operation by grid work.Second step, system predict Activity Calculation amount according to operational feature information.Third walks, the large-scale distributed grid node current state information of system queries.4th step, job scheduler accordingly can calculate nodes according to the lookup of operation application demand.5th step, distribution operation to corresponding calculate node.6th step judges in job queue whether all operations dispatch and finishes, do not dispatch and finish, and recycles and executes the 4th step and the 5th step, otherwise waits for next dispatching point and execute this method again.Using this method, grid system resource utilization rate can be effectively improved and meet user demand by being compared compared with the existing job scheduling method based on first-come-first-served policy.

Description

Application-aware Large-Scale Grid Job Scheduling Method

技术领域technical field

本发明属于计算机软件以及大规模并行分布式处理系统资源管理与任务调度技术领域，涉及一种应用感知的大规模网格作业调度方法。The invention belongs to the technical field of computer software and large-scale parallel distributed processing system resource management and task scheduling, and relates to an application-aware large-scale grid job scheduling method.

背景技术Background technique

中国国家网格是聚合了高性能计算和事务处理能力的新一代信息基础设施试验床。该网格通过资源共享、协同工作和服务机制，能有效支持科学研究、资源环境、先进制造和信息服务等应用。其主要立足于已有的高性能计算环境基础，重点研究高性能计算环境的应用服务优化关键技术，进一步完善资源建设机制，建立具有新型运行机制和丰富应用资源的、实用型的高性能计算应用服务环境和应用领域社区，降低高性能计算应用成本，全面提升高性能计算应用服务水平。The China National Grid is a new-generation information infrastructure test bed that aggregates high-performance computing and transaction processing capabilities. Through resource sharing, collaborative work and service mechanisms, the grid can effectively support applications such as scientific research, resource environment, advanced manufacturing and information services. It is mainly based on the existing high-performance computing environment foundation, focusing on research on key technologies for application service optimization in high-performance computing environments, further improving the resource construction mechanism, and establishing practical high-performance computing applications with new operating mechanisms and rich application resources Serve the environment and application community, reduce the cost of high-performance computing applications, and comprehensively improve the service level of high-performance computing applications.

目前，中国国家网格已经聚合北方主结点、南方主结点、国家超算中心和普通结点，形成了大规模分布式计算网格。其中，国家超级计算无锡中心拥有世界上首台峰值运算性能超过每秒十亿亿次浮点运算能力的超级计算机——“神威·太湖之光”，运算系统全面采用了国产“申威26010”众核处理器，也是我国第一台全部采用国产处理器构建的世界排名第一的超级计算机。国家超级计算广州中心配置天河二号系统，一期峰值计算速度每秒5.49亿亿次，持续计算速度每秒3.39亿亿次，能效比每瓦特19亿次双精度浮点运算。At present, the China National Grid has aggregated the main nodes in the north, the main nodes in the south, the national supercomputing center and ordinary nodes, forming a large-scale distributed computing grid. Among them, the National Supercomputing Wuxi Center has the world's first supercomputer with a peak computing performance exceeding one billion floating point operations per second - "Sunway Taihu Light". The computing system fully adopts the domestic "Sunway 26010" The many-core processor is also my country's first world-ranked supercomputer built entirely with domestically produced processors. The National Supercomputing Guangzhou Center is equipped with the Tianhe-2 system. The peak computing speed of the first phase is 549 petaflops, the continuous computing speed is 339 petaflops, and the energy efficiency ratio is 1.9 billion double-precision floating-point operations per watt.

国家超级计算天津中心装备有峰值性能达每秒4700万亿次的“天河一号”高效能计算机系统，该系统在2010的HPC TOP500排名中位列世界第一。另外装备有计算性能达到百万亿次的天河－天腾(TH-1)系统；包含128个Intel-EX5675CPU的天河－天翔系统；以及包含96个CPU的天河－天驰系统。这些超算中心都具有相当惊人的计算与存储能力，而普通结点能进行大规模计算，如清华大学部署的浪潮“天梭”高效能计算机系统，系统通用处理器的计算能力为每秒104万亿次，GPU加速部件的计算能力为每秒64万亿次；中科院深圳先进技术研究院采用曙光5000A集群系统，具有通用计算能力每秒10万亿次，专用计算能力每秒200万亿次，存储能力500TB，内部数据交换能力达2GB/s，整个系统资源可用率超过90％；甘肃省计算中心配备有计算峰值每秒40万亿次的高性能科学与工程计算集群，容量35TB的网络分布式存储系统。香港大学拥有包括香港大学计算机系和计算机中心两个组织的多个集群，通用计算能力峰值每秒23.45万亿次，专用计算能力每秒7.7万亿次。The National Supercomputing Tianjin Center is equipped with the "Tianhe-1" high-efficiency computer system with a peak performance of 4,700 trillion operations per second, which ranked first in the world in the 2010 HPC TOP500 ranking. In addition, it is equipped with Tianhe-Tianteng (TH-1) system with exascale computing performance; Tianhe-Tianxiang system with 128 Intel-EX5675 CPUs; and Tianhe-Tianchi system with 96 CPUs. These supercomputing centers all have quite astonishing computing and storage capabilities, and ordinary nodes can perform large-scale computing, such as the Inspur "Tianshuo" high-performance computer system deployed by Tsinghua University, the system's general-purpose processor has a computing capability of 1.04 million per second The calculation capability of the GPU acceleration component is 64 trillion times per second; the Shenzhen Institute of Advanced Technology of the Chinese Academy of Sciences adopts the Sugon 5000A cluster system, which has a general computing capacity of 10 trillion times per second and a special computing capacity of 200 trillion times per second. The storage capacity is 500TB, the internal data exchange capacity is up to 2GB/s, and the resource availability rate of the entire system exceeds 90%. The Gansu Provincial Computing Center is equipped with a high-performance scientific and engineering computing cluster with a peak computing value of 40 trillion times per second, and a network distribution capacity of 35TB storage system. The University of Hong Kong has multiple clusters including two organizations, the Department of Computer Science and the Computer Center of the University of Hong Kong, with a peak general computing capacity of 23.45 trillion operations per second and a dedicated computing capacity of 7.7 trillion operations per second.

然而，如何充分有效的利用这些资源，提高用户作业完成效率和降低用户成本，其核心是用户作业调度。作业调度的主要目标是在满足用户时间期限、成本等约束关系条件下，将可用户作业依据调度策略确定作业分配方案和作业执行顺序，以满足用户和系统要求。目前，中国国家网格上的作业调度仅采用简单的先来先服务策略，效率相对较低，因而已影响到国家网格的有效应用。本发明针对此问题，试图对计算节点能支持的应用、用户应用和计算节点资源现状进行综合平衡决策，以期实现系统作业的高效调度，从而为国家网格及其他大规模分布式系统提供有效的技术支持。However, the core of how to make full and effective use of these resources, improve the efficiency of user job completion and reduce user costs is user job scheduling. The main goal of job scheduling is to determine the job allocation plan and job execution order of user jobs according to the scheduling strategy to meet user and system requirements under the constraints of user time limit and cost. At present, the job scheduling on China's national grid only adopts a simple first-come, first-served strategy, which has relatively low efficiency, which has affected the effective application of the national grid. To solve this problem, the present invention attempts to make a comprehensive balance decision on the applications supported by computing nodes, user applications, and the status quo of computing node resources, in order to achieve efficient scheduling of system operations, thereby providing effective information for national grids and other large-scale distributed systems. Technical Support.

发明内容Contents of the invention

本发明针对国家网格及其他大规模分布式系统由于资源的异构性、地域分布性和用户应用多样性导致的效率低下问题，提出应用感知的大规模网格作业调度方法。为解决上述问题，本发明所采用的技术方案为：The invention proposes an application-aware large-scale grid job scheduling method aiming at the problem of inefficiency caused by resource heterogeneity, regional distribution and user application diversity in the national grid and other large-scale distributed systems. In order to solve the above problems, the technical solution adopted in the present invention is:

一种应用感知的大规模网格作业调度方法，其特征在于，包括如下步骤：An application-aware large-scale grid job scheduling method, characterized in that it includes the following steps:

步骤1：用户向网格系统提交作业，网格系统将作业信息存入相应的数据库网格作业表中，然后将此作业根据处理情况插入基于多线程共享机制的作业队列，所述的作业队列包括就绪队列、正在运行队列和结果反馈队列，所述的就绪队列用于排列等待处理的作业，所述的正在运行队列用于排列调度到某个计算节点运行的作业，所述的结果反馈队列用于排列运行结束的作业并且在队列中保存从节点返回的结果；Step 1: The user submits a job to the grid system, and the grid system stores the job information in the corresponding database grid job table, and then inserts the job into the job queue based on the multi-thread sharing mechanism according to the processing situation. It includes a ready queue, a running queue and a result feedback queue. The ready queue is used to queue jobs waiting to be processed, the running queue is used to queue jobs scheduled to run on a certain computing node, and the result feedback queue It is used to queue the finished jobs and save the results returned from the nodes in the queue;

步骤2：根据预先建立的用户作业运行时间模型，对用户提交的作业进行运行时间预测；Step 2: Predict the running time of the job submitted by the user according to the pre-established user job running time model;

步骤3：实时查询大规模分布式网格节点当前状态信息，并存入数据库网格节点资源信息表；Step 3: Query the current status information of large-scale distributed grid nodes in real time, and store them in the database grid node resource information table;

步骤4：依据作业的应用需求查找可用于作业计算的网格节点即可计算节点；Step 4: Find the grid nodes that can be used for job calculation according to the application requirements of the job to calculate the node;

步骤5：在步骤4找到的可计算节点集上查找满足用户作业运行时间期限的节点作为可调度节点集，然后从可调度节点集中选择资源利用率最低的节点，作为执行作业的网格节点；Step 5: On the computational node set found in step 4, find the nodes that meet the running time limit of the user's job as the schedulable node set, and then select the node with the lowest resource utilization rate from the schedulable node set as the grid node for executing the job;

步骤6：分配作业到步骤5中得到的网格节点；Step 6: assign jobs to the grid nodes obtained in step 5;

步骤7：判断作业队列中所有作业是否调度完毕，如调度完毕则等待下一个调度点；否则，从作业队列中取作业，回到步骤4循环执行。Step 7: Determine whether all jobs in the job queue have been scheduled. If the job is scheduled, wait for the next scheduling point; otherwise, take the job from the job queue and return to step 4 for cyclic execution.

所述的一种应用感知的大规模网格作业调度方法，所述的步骤1中，所述的作业信息包括用户ID、作业ID、应用软件需求、版本号、License、节点数、CPU数、众核数、运行时间、作业数据量和期望完成时间。In the above-described application-aware large-scale grid job scheduling method, in the step 1, the job information includes user ID, job ID, application software requirements, version number, license, number of nodes, number of CPUs, Number of cores, running time, job data volume, and expected completion time.

所述的一种应用感知的大规模网格作业调度方法，所述的步骤2中，所述的用户作业运行时间模型是基于系统历史作业在各计算节点运行的历史数据库，描述为：<Job_i,Time_i,j>；其中Job_i的应用特征包括数据量Jd(Job_i)、作业规模Js(Job_i)；并通过以下公式计算待调度作业Job与历史作业Job_i在应用特征上的相近度ρ_i In the above-mentioned application-aware large-scale grid job scheduling method, in the step 2, the user job running time model is based on the historical database of system historical jobs running on each computing node, described as: <Job _i ,Time _i,j >; where the application characteristics of Job _i include data volume Jd(Job _i ), job scale Js(Job _i ); and calculate the application characteristics of the job to be scheduled Job and the historical job Job _i by the following formula Similarity ρ _i

ρ_i＝|Jd(job_i)-Jd(job)|+|Js(job_i)-Js(job)|ρ _i ＝|Jd(job _i )-Jd(job)|+|Js(job _i )-Js(job)|

取相近度ρ_i最小的历史作业运行时间Time_i,j作为作业在各网格计算节点的预测运行时间。Take the historical job running time Time _i,j with the smallest similarity ρ _i as the predicted running time of the job on each grid computing node.

所述的一种应用感知的大规模网格作业调度方法，所述的步骤3中，所述的大规模分布式网格节点当前状态信息包括在线运行作业量、资源利用率、节点可支持的硬件及软件信息。In the above-mentioned application-aware large-scale grid job scheduling method, in the step 3, the current status information of the large-scale distributed grid nodes includes online running job volume, resource utilization rate, node support Hardware and software information.

所述的一种应用感知的大规模网格作业调度方法，所述的步骤5包括以下过程：In the described method for large-scale grid job scheduling with application awareness, the step 5 includes the following process:

步骤5.1：依据实时节点资源信息，初始化可计算节点集作业开始时间数据结构job_node_starttime；Step 5.1: According to the real-time node resource information, initialize the data structure job_node_starttime of the computable node set job start time;

步骤5.2：计算用户作业历史提交运行时间与实际运行时间的偏离值θ，如下式所示：Step 5.2: Calculate the deviation value θ between the running time submitted by the user's job history and the actual running time, as shown in the following formula:

其中，分别为用户提交的第i个作业提交运行时间和实际运行时间，m为该用户在网格系统提交的作业数；in, The running time and actual running time of the i-th job submitted by the user are respectively, and m is the number of jobs submitted by the user in the grid system;

步骤5.3：如果偏离值θ<0.2，作业调度决策所采用的作业运行时间为用户提交的作业信息中的运行时间，否则采用步骤2系统预测的作业运行时间为调度决策的依据；Step 5.3: If the deviation value θ<0.2, the job running time used in the job scheduling decision is the running time in the job information submitted by the user, otherwise, the job running time predicted by the system in step 2 is used as the basis for the scheduling decision;

步骤5.4：利用步骤5.3所获得的作业运行时间与作业在节点的可开始执行时间，依次计算可计算节点集中各节点作业完成时间job_node_endtime；Step 5.4: Using the job running time obtained in step 5.3 and the start execution time of the job on the node, sequentially calculate the job completion time job_node_endtime of each node in the computable node set;

步骤5.5：依据作业在各可计算节点上完成时间判断其是否满足作业运行时间期限，如满足把此节点放入可调度节点集；Step 5.5: According to the completion time of the job on each computable node, judge whether it meets the job running time limit, if so, put this node into the schedulable node set;

步骤5.6：从可调度节点集中选择资源利用率最低的节点，把作业调度到此网格节点。Step 5.6: Select the node with the lowest resource utilization rate from the schedulable node set, and schedule the job to this grid node.

所述的一种应用感知的大规模网格作业调度方法，所述的步骤6中，分配作业到相应计算节点，包括系统向计算节点提出作业计算请求、作业数据传输、作业执行状态查询和系统调度信息反馈给用户。In the above-mentioned application-aware large-scale grid job scheduling method, in step 6, assigning jobs to corresponding computing nodes includes the system submitting job computing requests to computing nodes, job data transmission, job execution status query and system Scheduling information is fed back to the user.

本发明的技术效果在于，采用本发明提供的方法，较之现有基于先来先服务策略的作业调度方法相比能有效提高网格系统资源利用率和满足用户需求。The technical effect of the present invention is that, compared with the existing job scheduling method based on the first-come-first-serve policy, the method provided by the present invention can effectively improve the resource utilization rate of the grid system and meet user needs.

下面结合附图对本发明作进一步说明。The present invention will be further described below in conjunction with accompanying drawing.

附图说明Description of drawings

图1是本发明提供的应用感知的大规模网格作业调度方法流程图；Fig. 1 is a flow chart of an application-aware large-scale grid job scheduling method provided by the present invention;

图2是本发明实施提供的大规模网格全局调度器体系结构图；Fig. 2 is a large-scale grid global scheduler architecture diagram provided by the implementation of the present invention;

图3是本发明实施提供的用户作业交互界面图；Fig. 3 is a user operation interface diagram provided by the implementation of the present invention;

图4是作业状态转换图。Figure 4 is a diagram of job state transitions.

具体实施方式Detailed ways

本发明提出了一种应用感知的大规模网格作业调度方法，其流程图如图1所示。该方法基于大规模中国国家网格，以用户作业应用特性来出发点，提出满足作业期限及应用需求的大规模分布式作业调度方法，能有效提高中国国家网格的作业调度效率。The present invention proposes an application-aware large-scale grid job scheduling method, the flowchart of which is shown in FIG. 1 . This method is based on the large-scale China National Grid, starting from the user's job application characteristics, and proposes a large-scale distributed job scheduling method that meets the job deadline and application requirements, which can effectively improve the job scheduling efficiency of the China National Grid.

本发明通过下述技术方案实现：The present invention realizes through following technical scheme:

本实施例面向大规模分布式计算系统中国国家网格，在全局作业调度器中把作业调度到各计算节点，以期达到整合大规模计算节点的空闲资源、充分利用并提高中国国家网格服务质量。本实施例的开发基于Linux平台，采用MySql数据库。其体系结构如图2所示，全局作业调度器主要由调度决策模块、作业分配模块、信息采集模块和通信模块。其中调度决策模块集成调度算法，对作业与相应计算节点调度进行决策；作业执行模块根据调度决策模块所选择的计算节点，自动生成提交当前作业到该计算中心指令；信息收集模块负责收集各计算节点资源状况(如，空闲服务器数目、CPU核数、内存使用等)，同时还能收集作业完成状况信息，如作业正在运行、预计完成时间等，能实时把这些信息反馈给用户；通信模块负责用户主程序与各计算节点进行具体通信，如将作业执行模块生成的指令发送给相应计算节点，从而在该节点提交并执行作业。This embodiment is oriented to the large-scale distributed computing system China National Grid, and schedules jobs to each computing node in the global job scheduler, in order to achieve the integration of idle resources of large-scale computing nodes, make full use of and improve the service quality of China National Grid . The development of this embodiment is based on the Linux platform, using the MySql database. Its architecture is shown in Figure 2. The global job scheduler is mainly composed of scheduling decision-making module, job assignment module, information collection module and communication module. Among them, the scheduling decision-making module integrates scheduling algorithms to make decisions on the scheduling of jobs and corresponding computing nodes; the job execution module automatically generates instructions to submit the current job to the computing center according to the computing nodes selected by the scheduling decision-making module; the information collection module is responsible for collecting information from each computing node Resource status (such as the number of idle servers, number of CPU cores, memory usage, etc.), and can also collect job completion status information, such as the job is running, the estimated completion time, etc., and can feedback this information to the user in real time; the communication module is responsible for the user The main program communicates with each computing node, such as sending the instructions generated by the job execution module to the corresponding computing node, so that the job is submitted and executed on the node.

用户首先通过网格系统如图3所示的作业交互界面提交作业。主要作业信息有用户ID、作业ID、应用软件需求、版本号、License、节点数、CPU数、众核数、运行时间、作业数据量和期望完成时间等。当用户点点击作业提交按钮后，本实施例首先把用户作业信息存入MySql数据库网格作业表中，此表的主要属性有：jobId－作业号、jobName－作业名称、software_type应用软件需求、software_system软件环境、resource_type资源类型、resource_value－相应类型下的资源值、cpu－用户所需的cpu、runtime－用户估计的作业运行时间、deadline－用户希望在这个时间之前完成作业、budget－用户预算、priority－作业优先级、memory－作业所需内存、disk－作业所需磁盘大小。然后，把此作业插入基于多线程共享机制的作业队列。作业队列是本专利用来管理用户提交作业的一种技术，当作业准备调度到某个计算节点运行时，作业要从“就绪队列”转移到“正在运行队列”；当作业运行结束时，作业要就从正在运行队列转移到结果反馈队列，并且在队列中保存从节点返回的结果。同一时间可能会有多个用户提交作业，也就是对同一队列进行操作，所以要保证队列操作的互斥性。同时，也要保持一定的并发性。其作业状态转换图如图4所示。The user first submits the job through the job interaction interface shown in Figure 3 of the grid system. The main job information includes user ID, job ID, application software requirements, version number, license, number of nodes, number of CPUs, number of cores, running time, job data volume, and expected completion time. When the user clicks the job submission button, this embodiment first stores the user job information in the MySql database grid job table. The main attributes of this table are: jobId-job number, jobName-job name, software_type application software requirements, software_system Software environment, resource_type resource type, resource_value—the resource value under the corresponding type, cpu—the cpu required by the user, runtime—the estimated running time of the job by the user, deadline—the user wants to complete the job before this time, budget—the user budget, priority - job priority, memory - memory required by the job, disk - disk size required by the job. Then, insert this job into the job queue based on the multi-thread sharing mechanism. Job queue is a technology used by this patent to manage jobs submitted by users. When a job is scheduled to run on a certain computing node, the job will be transferred from the "ready queue" to the "running queue"; when the job finishes running, the job If necessary, transfer from the running queue to the result feedback queue, and save the result returned from the node in the queue. Multiple users may submit jobs at the same time, that is, operate on the same queue, so the mutual exclusion of queue operations must be guaranteed. At the same time, a certain amount of concurrency must be maintained. Its operation status transition diagram is shown in Figure 4.

本实施例第二步对用户提交的作业进行运行时间预测。本实施例建立用户作业在各计算节点运行的历史数据库。主要采集了作业参数，如用户号/用户名、作业号、作业使用CPU数目、作业所在队列、作业工作目录、作业命令、作业提交时间、作业开始运行时间、作业运行结束时间和作业退出原因；网格计算节点参数，如集群、主机名称、时间：在什么时候采集的数据、CPU占用率、CPU微架构数据、CPU浮点计算能力、CPU指令执行速度、末级缓存命中率、内存占用率、内存读写带宽、Infiniband网络使用情况、以太网使用情况和磁盘/NFS使用情况。本实施例针对所有作业以及分用户、分应用的作业统计了运行时间的分布，发现大量作业为短时作业。同时对于作业数足够多的统计结果显示，其作业时长分布为幂律分布。其次，用户提交作业的时间具有成团聚集的性质，而且距离时间较长的两团作业，其运行时间的差距也较大。最后，本专利研究用户行为习惯，结合应用特征和用户作业特征，建立用户的特定应用运行时间模型，从而利用用户作业信息与此模型获得作业运行时间的预测值。本实施例所指的用户作业运行时间模型是基于系统历史作业在各计算节点运行的历史数据库，描述为：<Job_i,Time_i,j>；其中Job_i的应用特征包括数据量Jd(Job_i)、作业规模Js(Job_i)；并通过以下公式计算待调度作业Job与历史作业Job_i在应用特征上的相近度ρ_i The second step of this embodiment predicts the running time of the job submitted by the user. In this embodiment, a historical database of user jobs running on each computing node is established. Mainly collect job parameters, such as user number/user name, job number, number of CPUs used by the job, job queue, job working directory, job command, job submission time, job start time, job end time, and job exit reason; Grid computing node parameters, such as cluster, host name, time: when to collect data, CPU occupancy rate, CPU microarchitecture data, CPU floating-point computing capability, CPU instruction execution speed, last-level cache hit rate, memory occupancy rate , memory read and write bandwidth, Infiniband network usage, Ethernet usage, and disk/NFS usage. In this embodiment, the running time distribution of all jobs and jobs by users and applications is counted, and it is found that a large number of jobs are short-term jobs. At the same time, the statistical results of a sufficient number of jobs show that the distribution of job duration is a power-law distribution. Secondly, the time for users to submit jobs has the nature of clustering, and there is a large gap between the running time of two jobs with a long distance. Finally, this patent studies user behavior habits, combines application characteristics and user job characteristics, and establishes a user-specific application running time model, thereby using user job information and this model to obtain a predicted value of job running time. The user job running time model referred to in this embodiment is based on the historical database of system historical jobs running on each computing node, described as: <Job _i ,Time _i,j >; wherein the application characteristics of Job _i include the data volume Jd(Job _i ), the job scale Js(Job _i ); and calculate the similarity ρ _i between the job to be scheduled and the historical job _i in terms of application characteristics by the following formula

本实施例第三步实时查询大规模分布式网格节点当前状态信息，如在线运行作业量、资源利用率、节点可支持的硬件及软件信息等，并将其存入数据库网格节点资源信息表。此表的主要字段node_Id－节点Id、node_name－节点名称、resource_type－资源类型(表示对应节点拥有的资源类型)、resource_software(表明用户可以使用的应用软件)、wait_job_num－目前在等待运行的作业数、waittime－用户大概需要等多长时间才能执行、cpu－目前空闲的cpu核数、memory－目前可用内存状态、disk－目前剩余的磁盘容量、max_runtime－允许用户作业运行的最长时间、predict_runtime－预测的作业运行时间、net_delay－网络延迟(跨中心调度的时候需要考虑网络延迟问题)、cpu_usage－cpu使用率、memory_usage－内存使用率。此部分实现采用C/S模型，客户端负责实时收集各网格节点当前状态信息，服务器负责接收信息并存入Mysql数据库节点资源信息表中。其通信采用Linux平台下socket套接字，利用select函数来监听套接字，并且建立连接来实现通信。通信协议主要采用面向连接的TCP协议，这样能够确保信息的安全性。在并发访问上主要采用多线程复用方式。The third step of this embodiment is to query the current state information of large-scale distributed grid nodes in real time, such as online operation workload, resource utilization rate, hardware and software information supported by nodes, etc., and store them in the database grid node resource information surface. The main fields of this table are node_Id—node ID, node_name—node name, resource_type—resource type (indicating the type of resource owned by the corresponding node), resource_software (indicating the application software that the user can use), wait_job_num—the number of jobs currently waiting to run, waittime - How long does the user need to wait to execute, cpu - the current number of idle CPU cores, memory - the current available memory status, disk - the current remaining disk capacity, max_runtime - the maximum time allowed for user jobs to run, predict_runtime - prediction The job running time, net_delay-network delay (network delay needs to be considered when cross-center scheduling), cpu_usage-cpu usage, memory_usage-memory usage. The implementation of this part adopts the C/S model. The client is responsible for collecting the current status information of each grid node in real time, and the server is responsible for receiving the information and storing it in the node resource information table of the Mysql database. Its communication adopts the socket socket under the Linux platform, uses the select function to monitor the socket, and establishes a connection to realize the communication. The communication protocol mainly adopts the connection-oriented TCP protocol, which can ensure the security of information. Multi-thread multiplexing is mainly used for concurrent access.

第四步依据作业的应用需求查找相应的网格计算节点。网格作业表中software_type和software_system是作业运行所需的应用软件和支撑软件环境需求。本实施例将利用这些信息在节点资源中查找满足要求的节点资源集，其实现中用到了如下数据结构和队列操作函数。当前需要调度的作业信息，从当前网格节点队列中筛选出符合应用需求的队列，并将该队列名称和网格节点存储到一个数组中。关键函数为intsw_info_read(void*job_info,void*sw_match[])，其中函数参数为一个作业指针，一个结构体sw_match_info类型的指针数组。该数组存储符合软件要求的网格队列的sw_match_info结构体指针。返回值为符合软件要求的网格队列的数量。在此函数中，给定常量GN_NUM，QUEUE_NUM，SOFT_NUM(网格节点个数，队列个数，软件个数)的乘积来生成候选的网格队列个数。该函数调用完成之后得到符合软件匹配的网格队列网格ID和队列名称信息。其次，根据当前得到的sw_mathch数组，把与该数组网格ID和队列名称的队列放入链表当中。关键函数为void queue_info_read(List*l,void*sw_match[])。其中函数参数中的List*l，为存储符合软件匹配的队列链表。然后，根据当前作业信息，若存储软件匹配的队列信息链表队列满足硬件需求，则将该队列信息放入硬件匹配队列链表中，如不满足则直接从软件匹配链表中删除该队列。这样得到经过硬件筛选后的节点队列链表。关键函数为intqueShift(List*sw_list,List*hw_list,void*data)，第二个参数最终得到的硬件匹配网格队列的链表指针，返回值为经过硬件匹配之后网格队列个数。以此实现作业可计算节点集查找。The fourth step is to find the corresponding grid computing nodes according to the application requirements of the job. The software_type and software_system in the grid job table are the application software and supporting software environment requirements required for job running. In this embodiment, these information will be used to search for a set of node resources that meet the requirements in the node resources, and the following data structures and queue operation functions are used in its implementation. For the current job information that needs to be scheduled, select the queue that meets the application requirements from the current grid node queue, and store the queue name and grid node in an array. The key function is intsw_info_read(void*job_info, void*sw_match[]), where the function parameter is a job pointer and a pointer array of structure sw_match_info type. This array stores the sw_match_info structure pointers of grid queues that meet software requirements. The return value is the number of grid queues that meet the software requirements. In this function, the product of constants GN_NUM, QUEUE_NUM, SOFT_NUM (the number of grid nodes, the number of queues, and the number of software) is given to generate the number of candidate grid queues. After the function call is completed, the grid queue grid ID and queue name information matching the software will be obtained. Secondly, according to the currently obtained sw_mathch array, put the queue with the array grid ID and queue name into the linked list. The key function is void queue_info_read(List*l, void*sw_match[]). Among them, List*l in the function parameter is a queue chain list for storing conforming software matching. Then, according to the current job information, if the queue information linked list for storing software matching meets the hardware requirements, put the queue information into the hardware matching queue linked list, and delete the queue directly from the software matching linked list if not. In this way, the node queue linked list after hardware screening is obtained. The key function is intqueShift(List*sw_list,List*hw_list,void*data), the second parameter finally gets the linked list pointer of the hardware matching grid queue, and the return value is the number of grid queues after hardware matching. This implements job computable node set lookups.

第五步作业调度器在可计算节点集基础上实现满足用户作业运行时间期限下的作业调度。本实施例先利用节点资源实时信息建立可计算节点集作业开始时间数据结构job_node_starttime，然后计算用户作业历史提交运行时间与实际运行时间的偏离值θ。具体计算过程如下：首先依据实时节点资源信息，初始化可计算节点集作业开始时间数据结构job_node_starttime；然后计算用户作业历史提交运行时间与实际运行时间的偏离值θ，如下式所示：The fifth step is that the job scheduler implements job scheduling that meets the user's job running time limit on the basis of the computable node set. In this embodiment, the real-time information of node resources is used to establish the data structure job_node_starttime, which can calculate the start time of the node set job, and then calculate the deviation value θ between the running time submitted by the user's job history and the actual running time. The specific calculation process is as follows: first, based on the real-time node resource information, initialize the data structure job_node_starttime of the computable node set job start time; then calculate the deviation value θ between the running time submitted by the user’s job history and the actual running time, as shown in the following formula:

其中，分别为用户提交的第i个作业提交运行时间和实际运行时间，m为该用户在网格系统提交的作业数；本实施例设定其在某节点的偏离值θ<0.2时，作业调度决策所用的时间为用户提交的作业信息中的运行时间，否则采用本专利第二步预测的作业运行时间做为调度依据。如果该用户是第一次提交作业，则以本专利预测时间为作业调度时间。基于此调度时间计算方法与作业在节点的可开始执行时间，本专利依次计算可计算节点集中各节点作业完成时间job_node_endtime。然后，依据作业在各可计算节点上完成时间判断其是否满足作业运行时间期限，如满足把此节点放入可调度节点集。最后，本专利从可调度节点集中选择资源利用率最低的节点，把作业调度到此网格节点。in, are the submitted running time and actual running time of the i-th job submitted by the user, m is the number of jobs submitted by the user in the grid system; in this embodiment, when the deviation value θ<0.2 of a certain node, the job scheduling decision The time used is the running time in the job information submitted by the user, otherwise the job running time predicted in the second step of this patent is used as the scheduling basis. If the user is submitting a job for the first time, the predicted time of this patent is used as the job scheduling time. Based on the scheduling time calculation method and the start execution time of the job at the node, this patent calculates the job completion time job_node_endtime of each node in the computable node set sequentially. Then, according to the completion time of the job on each computable node, it is judged whether it meets the job running time limit, and if so, put this node into the schedulable node set. Finally, this patent selects the node with the lowest resource utilization rate from the set of schedulable nodes, and schedules jobs to this grid node.

第六步系统负责分配作业到相应计算节点，包括作业计算请求、作业数据传输、作业执行状态查询、调度信息反馈等。然后，判断作业队列中所有作业是否调度完毕，如调度完毕则等待下一个调度点。否则，完毕则从作业队列中取作业，再重复第四步、五步和六步以实现作业调度。The sixth step is that the system is responsible for assigning jobs to corresponding computing nodes, including job computing requests, job data transmission, job execution status query, and scheduling information feedback. Then, it is judged whether all the jobs in the job queue have been scheduled, and if the scheduling is completed, the next scheduling point is waited for. Otherwise, after completion, take the job from the job queue, and then repeat the fourth, fifth and sixth steps to realize job scheduling.

Claims

1. An application-aware large-scale grid job scheduling method is characterized by comprising the following steps:

step 1: a user submits a job to a grid system, the grid system stores job information into a corresponding database grid job table, and then the job is inserted into a job queue based on a multithreading sharing mechanism according to a processing condition, wherein the job queue comprises a ready queue, a running queue and a result feedback queue, the ready queue is used for arranging and scheduling jobs to be processed, the running queue is used for arranging and scheduling jobs to be run at a certain computing node, and the result feedback queue is used for arranging jobs after running and storing results returned from the nodes in the queue;

step 2: according to a pre-established user operation running time model, carrying out running time prediction on the operation submitted by a user;

and step 3: inquiring the current state information of the large-scale distributed grid nodes in real time, and storing the current state information into a grid node resource information table of a database;

and 4, step 4: searching a grid node which can be used for operation calculation according to the application requirement of the operation, namely a calculation node;

and 5: searching nodes meeting the operation time limit of the user operation on the calculable node set found in the step 4 as a schedulable node set, and then selecting the node with the lowest resource utilization rate from the schedulable node set as a grid node for executing the operation;

step 6: distributing the operation to the grid nodes obtained in the step 5;

and 7: judging whether all the jobs in the job queue are scheduled or not, and waiting for the next scheduling point if the scheduling is finished; otherwise, taking the job from the job queue and returning to the step 4 for circular execution.

2. The method according to claim 1, wherein in step 1, the job information includes user ID, job ID, application software requirement, version number, License, number of nodes, number of CPUs, number of many cores, running time, job data amount, and expected completion time.

3. The application-aware large-scale grid job scheduling method according to claim 1, wherein in step 2, the user job running time model is a historical database based on system historical jobs running at each computing node, and is described as:<Job_i,Time_i,j>(ii) a Wherein Job_iThe application characteristics of (1) include the data volume Jd (Job)_i) Job size Js (Job)_i) (ii) a And pass throughThe Job Job to be scheduled and the historical Job Job are calculated by the following formula_iProximity p on application features_i

ρ_i＝|Jd(job_i)-Jd(job)|+|Js(job_i)-Js(job)|

Taking the proximity rho_iMinimum historical job run Time_i,jAs the predicted run time of the job at each grid compute node.

4. The method according to claim 1, wherein in step 3, the current state information of the large-scale distributed grid nodes includes on-line operation workload, resource utilization rate, and node-supportable hardware and software information.

5. The application-aware large-scale grid job scheduling method according to claim 1, wherein the step 5 comprises the following steps:

step 5.1: initializing a data structure jobnode _ start time capable of calculating the node set operation start time according to the real-time node resource information;

step 5.2: calculating the deviation value theta of the user job history submission running time and the actual running time as shown in the following formula:

wherein,submitting the operation time and the actual operation time for the ith job submitted by the user respectively, wherein m is the number of jobs submitted by the user in the grid system;

step 5.3: if the deviation value theta is less than 0.2, the operation running time adopted by the operation scheduling decision is the running time in the operation information submitted by the user, otherwise, the operation running time predicted by the system in the step 2 is adopted as the basis of the scheduling decision;

step 5.4: sequentially calculating job completion time jobnode end time of each node in the calculable node set by using the job running time obtained in the step 5.3 and the start execution time of the job on the node;

step 5.5: judging whether the operation time limit is met or not according to the completion time of the operation on each calculable node, if so, putting the node into a schedulable node set;

step 5.6: and selecting the node with the lowest resource utilization rate from the schedulable node set, and scheduling the job to the grid node.

6. The method according to claim 1, wherein the step 6 of distributing the job to the corresponding computing node comprises the step of the system sending job calculation request, job data transmission, job execution status query and system scheduling information to the computing node for feedback to the user.