WO2020173553A1

WO2020173553A1 - Decentralized job scheduling in a topology agnostic environment

Info

Publication number: WO2020173553A1
Application number: PCT/EP2019/054753
Authority: WO
Inventors: Sanidhya AGGARWAL; Gowri LATHARANI NATARAJA; Varsha RAVEENDRAN; Varun VELAMURI
Original assignee: Siemens Aktiengesellschaft
Priority date: 2019-02-26
Filing date: 2019-02-26
Publication date: 2020-09-03

Abstract

A scheduling client (200), employed on a compute node (102) deployable in a network (101) of compute nodes (102), for scheduling a job in the network (101) is provided. The scheduling client (200) comprises a job execution module and a consensus module (202). The job execution module (201) receives a job for execution at the compute node (102), obtains resource availability data from each of the compute nodes (102) in the network (101), determines an executing compute node (102) from the compute nodes (102) in the network (101) for executing the job based on the resource availability data, and processes the job at the executing compute node (102). The consensus module (202) participates in a consensus with the compute nodes (102) of the network (101), for updating the resource availability data.

Description

Decentralized job scheduling in a topology agnostic environ ment

TECHNICAL FIELD

Various embodiments of the invention relate to providing a system and a method for decentralized scheduling of jobs in a topology agnostic environment.

BACKGROUND

In recent times, computer science has made exponential ad vances and is now a pervasive technology base in many

spheres. A threshold of computing technology generation has been reached, the objective of which, is autonomous self-con trolled machines that are driven by software processes en dowed with artificial intelligence that evolve through in sights from big data. About eight billion devices were con nected to the internet in the year two thousand and sixteen, which are expected to increase to one trillion by the year two thousand and thirty. Consequently, organizations at pre sent have a challenge of sustaining the technological and market viability of their products by adapting to, and grow ing with, rapidly advancing computing technology. With such a distributed set up of compute nodes, it is imminent to have an efficient scheduling algorithm that can schedule user's jobs on the cluster of nodes.

Typically, in industrial applications such as network manage ment systems designed for field devices, the aforementioned centralization will soon become a bottleneck and new ap proaches will have to be devised. Conventionally, de-central- ized algorithms are known to scale well compared to central ized algorithms and the architecture may have to be re-de- signed to accommodate the large number of field devices. Typ ically, data analytics jobs are compute intensive and a sin gle edge device is not completely capable of performing all the tasks alone. Therefore, a framework that is independent of several compute nodes participating in a system, and that can support the distribution of jobs locally is required.

Conventionally, two kinds of schedulers are known. A local scheduler and a grid scheduler. A local scheduler is respon sible for scheduling of jobs and managing resources at a sin gle compute node. A grid scheduler, also known as meta-sched- uler, is a global scheduler responsible for selecting appro priate node and mapping of jobs onto the selected site or do main. Some of the desired properties and features typically required in a grid to provide users with a computing environ ment, comprise heterogeneity, that is, a number of resources that are varied in nature, scalability, that is, tolerance to handle a large number of nodes without any performance degra dation, adaptability, that is, ability to reschedule jobs in cases of unexpected computational aborts, frequent hardware or software faults, etc., and security, that is, protection of the participating resources from any malicious manipula tions or interventions. Widely adopted methodologies for job scheduling are Centralized scheduling, where meta-information about all resources is maintained and all jobs are received centrally. Based on the queue of jobs submitted, and the meta-information about all the resources, the centralized scheduling is then performed. The centralized scheme is not scalable because of the amount of information required to be maintained. This scheme also does not facilitate the use of priority schemes. This can adversely affect the local jobs while benefiting remote jobs. Hierarchical scheduling involves sharing the scheduling process between the grid scheduler and the local scheduler. All jobs are submitted to the grid scheduler and the grid scheduler resides on a single compute node. However, unlike the centralized scheme, jobs are not maintained in the grid scheduler queue until dispatch time. Instead, at submission time, the grid scheduler sends the job to that node at which the earliest start time is expected for it. Each site maintains a local queue from which it schedules jobs for execution. Different nodes may use different sched uling policies. However, once submitted to a local scheduler, the grid scheduler has no further direct influence on the scheduling on the job, and the job cannot be moved to another site even if the load at the other site becomes lower at some time in the future.

Distributed scheduling involves a grid scheduler at every node receiving jobs on the node where the job originates. The grid schedulers query each other periodically to collect in stantaneous load information. In case, any of the other nodes has a lower load, the job is transferred to the site with the lowest load. Since all jobs are submitted locally, the dis tributed scheme is more scalable than the hierarchical and centralized schemes. However, in such scheduling also, a prior know-how of the topology amongst the nodes is required. The nodes used herein, refer to various devices that are em ployed in a network. Moreover, in a fast-changing topology of nodes, none of the afore-mentioned scheduling schemes offer a robust and reliable solution.

SUMMARY

Therefore, it is an object to provide topology agnostic de centralized job scheduling.

The scheduling client and the respective method, adapted to schedule a job in a network of compute nodes, disclosed herein achieve the aforementioned object. The scheduling cli ent disclosed herein is employed on a compute node deployable in a network of compute nodes. As used herein, "compute node" refers to any device having a computational capability, for example, a field device such as a programmable logic control ler (PLC) , a human machine interface (HMI), an intelligent electronic device (IED), etc., in communication with one or more compute nodes over a network. The compute node, in one aspect, is an edge device such as an Internet of Things (IoT) device, a RaspberryPi, a server, a client, etc. As used herein, the term "network" refers to a communication network, for example, a wired network, a wireless network, a communi cation network, or a network formed from any combination of these networks. Moreover, the network may be a private net work or a public network and comprising clusters of compute nodes there-within . The compute nodes may be clustered based on one or more parameters such as geography, application for which the compute nodes are employed, computational capabil ity of the compute nodes, etc. Furthermore, the network is a cloud computing environment. As used herein, "cloud computing environment" refers to a processing environment comprising configurable compute nodes, for example, servers, storage, applications, services, etc., and data distributed over the network. The cloud computing environment provides on-demand network access to a shared pool of the compute nodes. In one aspect of the present invention, the scheduling client is an application deployable on a compute node. In another aspect, the scheduling client is an application accessible by the compute node as a service via the network. In yet another as pect, the scheduling client is dedicated per compute node.

The scheduling client schedules a job in the network. As used herein, the term "job" refers to a task to be performed by a compute node. For example, the job comprises processing data received from a sensor with which the compute node such as a PLC is communicating. The job thus comprises data associated with one or more of instructions, a resource limit required for execution of the instructions, and meta-data of the job. The scheduling client comprises a j ob execution module, a peer discovery module, and a consensus module. The scheduling client also comprises a user interface for rendering and re ceiving a job published onto the network. The scheduling cli ent also comprises a network interface for establishing an interface into the network.

The scheduling client also comprises databases, for example, a node database and a scheduling database. As used herein, "databases" refer to, a structured query language (SQL) data store or a not only SQL (NoSQL) data store. In one aspect, the databases can be a location on a file system directly ac cessible. In another aspect, the database is configured as cloud-based database implemented in a cloud computing envi ronment, where computing resources are delivered as a service over the network.

The node database comprises peer node data. As used herein, "peer node data" refers to data associated with compute nodes in the network such as identifiers of each of the compute nodes, addresses of each of the compute nodes, etc. The job execution module of the scheduling client uses the peer node data for broadcasting a j ob onto the network of the compute nodes across clusters. In another aspect the peer discovery module of the scheduling client uses the peer node data to rejoin into a network post a hiatus such as restart of the computing node etc. In yet another aspect, the peer discovery module of the scheduling client identifies compute nodes that are part of a cluster and updates the peer node data accord ingly. In yet another aspect, the scheduling client comprises a health monitoring module for communicating health of the compute node on which the scheduling client is employed. Ad vantageously, the health monitoring module updates the peer node data with healthy compute nodes thereby, eliminating availability of unhealthy compute nodes for scheduling of jobs. The node database further comprises a list of seed com pute nodes which are compute nodes used by every new compute node joining onto the network for the first instance.

The scheduling database comprises a j ob pool storing incoming jobs in the network which are yet to be scheduled, that is, assigned to a compute node for execution. The job pool advan tageously stores time of receipt of a job onto the network. The scheduling database further comprises a resource pool storing resource availability data. As used herein, "resource availability data" refers to data associated with computing resources available for execution of a job at a compute node on which the scheduling client is employed. The resource availability data comprises, for example, a list of compute nodes in the network along with their respective available resources. The resource pool is advantageously configured as a heap shaped data structure having a compute node with maxi mum available resources stored on top of the heap. The re source pool also comprises job allocation data, for example, data associated with a job being scheduled in a network or in a cluster of compute nodes in the network, and the scheduled job completing execution at the assigned compute node. Advan tageously, when a scheduling client queries status of an un finished job by accessing the scheduling database and notices a j ob being assigned to a compute node which has become un healthy during course of time, then the job execution module of the compute node broadcasts the job onto the network as if broadcasted for the first time. Advantageously, the schedul ing clients at each of the compute nodes, eliminate duplica tion of entries of such jobs broadcasted into the respective job pools of their scheduling databases.

The job execution module of the scheduling client receives a job for execution at the compute node. The job execution mod ule comprises a transaction module configured to generate a transaction based on the job. The transaction comprises the job, resource utilization limit for the job, and metadata as sociated with the job, for example, an identifier of the job. The job execution module stores the transaction in the sched uling database, that is, in the job pool of the scheduling database. Thus, advantageously, each compute node comprises an ever-growing immutable record of jobs received in a clus ter or in a network in which the compute node is participat ing .

The peer discovery module receives the peer node data from the node database. The job execution module is in communica tion with the peer discovery module and uses the identifiers of the compute nodes in the peer node data to obtain resource availability data of each of the compute nodes in the network by accessing the scheduling database. The job execution module determines an executing compute node from the compute nodes in the network for executing the job based on the resource availability data. In one aspect, the job execution module retieves a compute node stored at top of the heap structure in the resource pool of the scheduling da tabase, for execution of the job. In another aspect, the job execution module retrieves a compute node from the resource pool of the scheduling databse based on amount of availabe resources of a compute node and the amount of resources re quired for execution of the job.

The consensus module participates in a consensus with the compute nodes of the network, for updating the resource availability data stored in the scheduling database. The con sensus module is responsible for participating in consensus with all compute nodes perodically and accordingly update the resource pool in the scheduling databases of all compute nodes across clusters. Advantageously, the consensus happens irrespective of whether there is a job received at a compute node for scheduling. That is, via consensus it is ensured that the resource availability data remains updated across clusters of the network for any job to be scheduled at any point at any compute node. Therefore, consensus module par ticipates in a consensus periodically, for example, after every ten seconds, thereby, ensuring the scheduling databases at each compute node remain updated post the job scheduling has happened, and thus, the executing compute node selection is performed in a decentralized manner yet being highly accu rate. In one aspect, the consensus module comprises a stake identification module obtaining a stake from each of the com pute nodes based on scheduling history of the compute node. The stake identification module employs one or more algo rithms comprising, for example, proof-of-stake algorithm in which a winner is decided based on a stake that a compute node holds. The stake is determined based on a number of times a compute node has participated in scheduling of jobs, a number of jobs that a compute node has executed, the last time a compute node scheduled a job, etc. The stake identifi cation module selects a compute node with highest stake. This winning compute node is responsible for updating the resource pool of the scheduling databases across clusters of the com pute nodes ensuring the heap structure accuracy is consist ently maintained in all compute nodes. In another aspect, the consensus module comprises one or more modules employing al gorithms comprising, for example, a proof-of-work, a proof- of-activity, etc., to identify a winning compute node respon sible for updating the resource pool thereby, implicitly gen erating a consensus on the executing node being selected among the compute nodes.

Advantageously, if a compute node has not participated in consensus for significant amount of time, for example, for more than ten seconds, then the compute node while partici pating in consensus after the hiatus retrieves an updated data from the resource pool and the job pool of the schedul ing database. The compute node achieves this by comparing current timestamp with the timestamp of last update made to the resource pool.

The job execution module processes the job at the executing compute node, that is, assigns the executing compute node to execute the scheduled job. The job execution module updates the databases including the node database and the scheduling database, for example, by adding data associated with the compute node that won the consensus, by removing the job from the job pool, and by updating the resource pool with the job allocation details.

Also disclosed herein is a compute node deployable in an au tomation environment comprising aforementioned scheduling client. The compute node is in communication with one or more compute nodes forming aforementioned network. Each of the compute nodes processes a job broadcasted on the network. The compute nodes comprise, for example, PLCs or HMIs in an in dustrial automation environment, IEDs in a substation automa tion environment, etc. In one aspect, the compute node is a seed node. As used herein, "seed node" refers to a compute node used by other compute nodes while joining into a network for the first time and/or after crashes, repairs, mainte nance, etc. Advantageously, at least one seed node is present in a network. Multiple seed nodes may exist to achieve relia bility and redundancy in the network. Advantageously, afore mentioned scheduling client on each compute node, is pre-con- figured with a list of seed nodes. The list comprises, for example, IP addresses and/or DNS names of the seed nodes that can be used by the scheduling client to connect into a net work. On connecting with a seed node, the compute node regis ters itself with the seed node and requests for a list of compute nodes already registered with the seed node. This list is then stored by a scheduling client employed on the compute node as peer node data in the node database. In one aspect, a compute node may use the peer node data while re joining into the network instead of approaching the seed node thereby, reducing computational load of the seed node. Once registered, the scheduling client on the compute node broad casts about its resource availability to all the compute nodes on the network using peer node data. The scheduling clients on other compute nodes listen to these broadcasts and update their peer node data accordingly.

Also disclosed herein is a method for scheduling a job.

The method employs a scheduling client deployable on a com pute node in a network of compute nodes. The method comprises receiving a j ob for execution at the compute node. The method receives the job via a user interface of the scheduling cli ent. In one aspect, a user using the compute node defines a job using the user interface of the scheduling client. The method comprises obtaining resource availability data from each of the compute nodes in the network. For obtaining the resource availability data, the method comprises receiving peer node data from the node database, and obtaining a com pute capability of each of the compute nodes, from the sched uling database.

The method comprises determining an executing compute node from the compute nodes in the network for executing the job based on the resource availability data. The method comprises selecting a compute node as the executing compute node based on a compute capability required for execution of the job. Advanageously, to save time in matching the compute capabili ties of the compute node and the requirement of the job, the method comprises selecting a compute node as an executing compute node based on highest available compute capability of a compute node. The method comprises processing the job at the executing compute node, that is, assigining the job to the executing compute node for execution.

In order to ensure that the executing compute node selection is valid every single time, the method comprises participat ing in a consensus with the compute nodes of the network, for updating the resource availability data. The method obtains a stake from each of the compute nodes based on scheduling his tory of the compute node and determines a winning compute node. This winning compute node is a compute node which has maximum stake, that is, maximum relevant data associated with resource availability for a given job. The scheduling client at this winning compute node then updates the resource pool of its scheduling database with the resource availability data which is replicated across scheduling databases of the scheduling clients at all compute nodes, as immutable rec ords. The method comprises updating the scheduling database and the node database as and when data associated with the job and the compute nodes changes. The data comprises, for example, allocation of a job to an executing compute node, completion of execution of a job, a compute node that isnt healthy for participating in a consensus and/or execution of a job, etc. Also disclosed herein is a computer program product compris ing machine-readable instructions stored therein, that when executed by one or more processors, cause the one or more processors to perform the aforementioned method for schedul ing a job.

Thus, the scheduling client, the method, the computer program product, and the compute node disclosed herein enable real time decentralized scheduling of jobs without a requirement of prior knowledge of network topology as the scheduling da tabases at every compute node are in sync via a process of consensus thereby, eliminating need for a central server or for homogenous compute nodes and thereby, providing a fault tolerant distributed and secure job scheduling.

The above summary is merely intended to give a short overview over some features of some embodiments and implementations and is not to be construed as limiting. Other embodiments may comprise other features than the ones explained above.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other elements, features, steps and character istics of the present disclosure will be more apparent from the following detailed description of embodiments with refer ence to the following figures:

FIG 1 schematically illustrates a network of compute

nodes .

FIG 2A schematically illustrates an architecture of a com pute node illustrated in FIG 1 employing a schedul ing client for scheduling a job.

FIG 2B schematically illustrates a scheduling client il lustrated in FIG 2A for scheduling a job. FIG 3 is a flowchart illustrating a method performed by the scheduling client illustrated in FIGS 2A-2B for scheduling a job.

DETAILED DESCRIPTION OF EMBODIMENTS

In the following, embodiments of the invention will be de scribed in detail with reference to the accompanying draw ings. It is to be understood that the following description of embodiments is not to be taken in a limiting sense. The scope of the invention is not intended to be limited by the embodiments described hereinafter or by the drawings, which are taken to be illustrative only.

The drawings are to be regarded as being schematic represen tations and elements illustrated in the drawings, which are not necessarily shown to scale. Rather, the various elements are represented such that their function and general purpose become apparent to a person skilled in the art. Any connec tion or coupling between functional blocks, devices, compo nents, or other physical or functional units shown in the drawings or described herein may also be implemented by an indirect connection or coupling. A coupling between compo nents may also be established over a wireless connection. Functional blocks may be implemented in hardware, firmware, software, or a combination thereof.

FIG 1 schematically illustrates a network 101 of compute nodes 102. These compute nodes 102 represent an industrial plant environment. The network 101 includes clusters 101A, 101B, 101C of compute nodes 102 separated, for example, based on geography, compute capabilities, etc. The compute nodes 102 are industrial devices supporting Internet of Things (IoT) . The cluster 101A has HMIs 102 in communication with one another. The cluster 101B has PLCs 102 from a section A of an industrial plant in communication with one another. The cluster 101C has PLCs 102 from a section B of the industrial plant in communication with one another. Each of the compute nodes 102 are operably communicating with a transaction pool 103 having transactions 104 stored therein. The transaction pool 103 has transactions 104 incoming into the industrial plant. Thus, all compute nodes 102 have an equal right to ac cess each transaction 104. The transaction 104 comprises a job defined for example by a plant operator, a resource limit required for execution of the job, meta-data such as tags, a type, etc., of the job, the compute node 102 on which the job was defined, and a unique identifier for the job.

FIG 2A schematically illustrates an architecture of a compute node 102 illustrated in FIG 1 employing a scheduling client 200 for scheduling a job. The compute node 102 is programma ble using a high-level computer programming language. The compute node 102 may be implemented using programmed and pur poseful hardware. As illustrated in FIG 2A, the compute node 102 comprises a processor 102A, a non-transitory computer readable storage medium such as a memory unit 102B for stor ing programs and data, an input/output (I/O) controller 102C, a network interface 102D, a data bus 102E, a display unit 102F, input devices 102G, a fixed media drive 102H such as a hard drive, a removable media drive 1021 for receiving remov able media, output devices 102J, and a scheduling client 200.

The processor 102A refers to any one of microprocessors, cen tral processing unit (CPU) devices, finite state machines, microcontrollers, digital signal processors, an application specific integrated circuit (ASIC) , a field-programmable gate array (FPGA) , etc., or any combination thereof, capable of executing computer programs or a series of commands, instruc tions, or state transitions. The processor 102A may also be implemented as a processor set comprising, for example, a general-purpose microprocessor and a math or graphics co-pro- cessor. The processor 102A is selected, for example, from the Intel^® processors, Advanced Micro Devices (AMD^®) processors, International Business Machines (IBM^®) processors, etc. The compute node 102 disclosed herein is not limited to employing a processor 102A. The compute node 102 may also employ a con troller or a microcontroller. The processor 102A executes the modules, for example, 201, 201A, 202, 202A, 203, 204, 205, etc., of the scheduling client 200.

The memory unit 102B is used for storing programs, applica tions, and data. For example, the modules 201-205, the data bases 206 and 207, of the scheduling client 200 are stored in the memory unit 102B of the compute node 102. The memory unit 102B is, for example, a random-access memory (RAM) or another type of dynamic storage device that stores information and instructions for execution by the processor 102A. The memory unit 102B also stores temporary variables and other interme diate information used during execution of the instructions by the processor 102A. The compute node 102 further comprises a read only memory (ROM) or another type of static storage device that stores static information and instructions for the processor 102A. The I/O controller 102C controls input actions and output actions performed by the scheduling client 200.

The network interface 102D enables connection of the compute node 102 to the network 101. For example, the scheduling cli ent 200 of the compute node 102 connects to the network 101 via the network interface 102D. In an embodiment, the network interface 102D is provided as an interface card also referred to as a line card. The network interface 102D comprises, for example, interfaces using serial protocols, interfaces using parallel protocols, and Ethernet communication interfaces, interfaces based on wireless communications technology such as satellite technology, radio frequency (RF) technology, near field communication, etc. The data bus 102E permits com munications between the modules, for example, 201-205, etc., of the scheduling client 200. The display unit 102F, via the user interface 205, displays information for example that of a job, with help of user in terface elements such as text fields, buttons, windows, etc., for allowing a user to define a job. The display unit 102F comprises, for example, a liquid crystal display, a plasma display, an organic light emitting diode (OLED) based dis play, etc. The input devices 102G are used for inputting data into the compute node 102. The input devices 102G are, for example, a keyboard such as an alphanumeric keyboard, a touch sensitive display device, and/or any device capable of sens ing a tactile input.

Computer applications and programs are used for operating the compute node 102. The programs are loaded onto the fixed me dia drive 102H and into the memory unit 102B of the compute node 102 via the removable media drive 1021. In an embodi ment, the computer applications and programs may be loaded directly via the network 101. The output devices 102J output the results of operations performed by the compute node 102. For example, a compute node 102 which is a PLC, outputs a time series command for opening and closing of valves.

The processor 102A executes an operating system, for example, the Linux^® operating system, the Unix^® operating system, any version of the Microsoft^® Windows^® operating system, the Mac OS of Apple Inc., the IBM^® OS/2, etc. The compute node 102 employs the operating system for performing multiple tasks. The operating system is responsible for management and coor dination of activities and sharing of resources of the com pute node 102. The operating system further manages security of the compute node 102, peripheral devices connected to the compute node 102, and network connections. The operating sys tem employed on the compute node 102 recognizes, for example, inputs provided by the users using one of the input devices 102G, the output display, files, and directories stored lo cally on the fixed media drive 102H. The operating system on the compute node 102 executes different programs using the processor 102A. The processor 102A and the operating system together define a computer platform for which application programs in high level programming languages are written.

The processor 102A of the compute node 102 retrieves instruc tions defined by the modules 201-205 of the scheduling client 200, for performing respective functions disclosed in the de tailed description of FIG 2B. The processor 102A retrieves instructions for executing the modules, for example, 201-205, of the scheduling client 200 from the memory unit 102B. A program counter determines the location of the instructions in the memory unit 102B. The program counter stores a number that identifies the current position in the program of each of the modules, for example, 201-205, etc., of the protection configuration system 200. The instructions fetched by the processor 102A from the memory unit 102B after being pro cessed are decoded. The instructions are stored in an in struction register in the processor 102A. After processing and decoding, the processor 102A executes the instructions, thereby performing one or more processes defined by those in structions .

At the time of execution, the instructions stored in the in struction register are examined to determine the operations to be performed. The processor 102A then performs the speci fied operations. The operations comprise arithmetic opera tions and logic operations. The operating system performs multiple routines for performing several tasks required to assign the input devices 102G, the output devices 102J, and memory for execution of the modules, for example, 201-205, etc. The tasks performed by the operating system comprise, for example, assigning memory to the modules, for example, 201-205, and to data used by the modules, for example, 201- 205, moving data between the memory unit 102B and disk units, and handling input/output operations. The operating system performs the tasks on request by the operations and after performing the tasks, the operating system transfers the exe cution control back to the processor 102A. The processor 102A continues the execution to obtain one or more outputs. The outputs of the execution of the modules, for example, 201-

204, etc., are displayed to the user on the user interface

205.

For purposes of illustration, the detailed description refers to the scheduling client 200 being run locally on the compute node 102; however the scope of the present invention is not limited to the scheduling client 200 being run locally on the compute node 102 via the operating system and the processor 102A, but may be extended to run remotely over the network 101 by employing a web browser and a remote server, a mobile phone, or other electronic devices. One or more portions of the scheduling client 200 may be distributed across one or more compute nodes 102 coupled to the network 101.

Disclosed herein is also a computer program product compris ing a non-transitory computer readable storage medium that stores computer program codes comprising instructions execut able by at least one processor 102A for scheduling a job, as disclosed in the present invention. The computer program product comprises a first computer program code for receiving a j ob for execution at the compute node 102; a second com puter program code for obtaining resource availability data from each of the compute nodes 102 in the network 101; a third computer program code for determining an executing com pute node 102 from the compute nodes 102 in the network 101 for executing the job based on the resource availability data; a fourth computer program code for participating in a consensus with the compute nodes 102 of the network 101, for updating the resource availability data; and a fifth computer program code for processing the job at the executing compute node 102.

The second computer program code comprises a sixth computer program code for receiving peer node data from a node data base 206 of the scheduling client 200 and a seventh computer program code for obtaining a compute capability of each of the compute nodes 102. The third computer program code com prises an eighth computer program code for selecting a com pute node 102 with a compute capability required for execu tion of the job. The fourth computer program code comprises a ninth computer program code for obtaining a stake from each of the compute nodes 102 based on scheduling history of the compute node 102.

In an embodiment, a single piece of computer program code comprising computer executable instructions performs one or more steps of the method according to the present invention, for scheduling a job in a network. The computer program codes comprising computer executable instructions are embodied on the non-transitory computer readable storage medium. The pro cessor 102A of the compute node 102 retrieves these computer executable instructions and executes them. When the computer executable instructions are executed by the processor 102A, the computer executable instructions cause the processor 102A to perform the steps of the method.

FIG 2B schematically illustrates a scheduling client 200 il lustrated in FIG 2A for scheduling a job. The scheduling cli ent 200 is installed on and accessible by a compute node 102 illustrated in FIG 1, for a scheduling a job. The scheduling client 200 is in communication with scheduling clients 200 installed on other compute nodes 102, via a network 101. The scheduling client 200 is downloadable and usable on the com pute node 102 or is configured as a web-based platform, for example, a website hosted on a server or a network of serv ers, or, is implemented in the cloud computing environment as a cloud computing-based platform implemented as a service for scheduling a job.

The scheduling client 200 comprises a non-transitory computer readable storage medium and at least one processor communica tively coupled to the non-transitory computer readable stor age medium. As used herein, "non-transitory computer readable storage medium" refers to all computer readable media, for example, non-volatile media, volatile media, and transmission media except for a transitory, propagating signal. The non- transitory computer readable storage medium is configured to store computer program instructions defined by modules, for example, 201, 202, 203, 204, 205, etc. The processor is con figured to execute the defined computer program instructions. As illustrated in FIG 2B, the scheduling client 200 comprises a user interface 205. A user using the compute node 102 can access the scheduling client 200 via the user interface 205. The scheduling client 200 comprises a job execution module 201, a consensus module 202, a peer discovery module 203, a health monitoring module 204, a user interface 205, a node database 206, and a scheduling database 207. The job execution module 201 receives a job for execution at a compute node 102. The job execution module 20 comprises a transaction module 201A. The transaction module 201A gener ates a transaction 104 based on the job. The transaction mod ule 201A stores each transaction 104 in a job pool of the scheduling database 207. The job execution module 201 is in communication with the peer discovery module 203. The peer discovery module 203 receives peer node data from the node database 206 or from the compute nodes 102 that are healthy. The health monitoring module 204 of the scheduling client 200 peridocially broadcasts health of the compute node 102 on the network 101. The peer node data comprises identifiers for each of the healthy compute nodes 102 of the network 101. The job execution module 201 establishes communication with each of the compute nodes 102 using the peer node data and obtains their respective resource availability data. The job execu tion module 201 determines an executing compute node 102 from the compute nodes 102 in the network 101 for executing the job based on the resource availability data. The consensus module 202 participates in a consensus with the compute nodes 102 of the network 101, for updating resource availability data post scheduling of the job. The consensus module 202 comprises a stake identification module 202A. The stake iden tification module obtains a stake from each of the compute nodes 102 based on scheduling history of the compute node 102. The consensus module 202 thus, selects a winning compute node 102 that has maximum stake in a network, for example, a compute node that has scheduled maximum number of jobs. All the participating compute nodes 102 update their respective resource pools in their respective scheduling databases with the data being broadcasted by the winning compute node 102 thereby, ensuring updation of the resource availability data. The job execution module 201 processes the job at the execut ing compute node 102. The job execution module 201 after exe cution of the job, removes the corresponding transaction 104 from the job pool of the scheduling database 207, updates a resource pool of the scheduling database 207 with the re source availability of the compute node 102, etc.

FIG 3 is a flowchart illustrating a method 300 performed by the scheduling client 200 illustrated in FIGS 2A-2B for scheduling a job. The method 300 employs the scheduling cli ent 200 illustrated in FIGS 2A-2B for scheduling a job re ceived at a compute node 102. At step 301, the job execution module 201 receives a job at the compute node 102 on which the scheduling client 200 is employed. At step 301A, the transaction module 201A of the job execution module 201 gen erates a transaction 104 using data associated with the job and stores the transaction 104 in a job pool of the schedul ing database 207. At step 301B, the peer discovery module 203 of the scheduling client 200 obtains peer node data from the node database 206 of the scheduling client 200. At step 301C, the job execution module 201 broadcasts the job to all the compute nodes 102 listed in the peer node data. The schedul ing clients 200 on the each of these compute nodes 102 are actively listening to the broadcasted jobs as long as the memory of their job pools is not exhausted. Each of the scheduling clients 200 replicates the step 301A and stores the transactions 104 in their respective job pools within the scheduling databases 207.

At step 302, the job execution module 201 obtains resource availability data from each of the compute nodes 102 listed in the peer node data. The health monitoring module 204 in the background periodically keeps the peer node data updated with healthy compute nodes 102. At step 302A, the job execu tion module 201 accesses the resource pool in the scheduling database 207. The resource pool stores resource availability data comprising data associated with computing resources available for execution of a job at a compute node 102, that is, a list of compute nodes 102 in the network 101 along with their respective available resources. The resource pool is advantageously configured as a heap structure having a com pute node 102 with maximum available resources stored on top of the heap. The resource pool also comprises job allocation data, for example, data associated with a job being scheduled in a network 101 or in a cluster 101A, 101B, or 101C, of the network 101, and the scheduled job completing execution at an assigned compute node 102. At step 302B, the job execution module 201 obtains the compute node 102 having maximum re source availability from the resource pool. At step 303, the job execution module 201 selects this compute node 102 as the executing compute node 102. At step 303A, the job execution module 201 updates the job allocation data stored in the re source pool of the scheduling database 207 with the executing compute node 102, as well as the resource availability data in view of the job scheduled.

At step 304, the consensus module 202 of the executing com pute node 102 participates in a consensus with the consensus modules 202 of the other compute nodes 102 on the network 101. At step 304A, the stake identification module 202A of the consensus module 202 obtains a stake from each of the participating compute nodes 102 based on one or more parame ters associated with scheduling history of the compute nodes 102 such as a number of times a compute node 102 has partici pated in scheduling of jobs, a number of jobs that a compute node 102 has scheduled, the last time a compute node 102 scheduled a job, etc. At step 304B, an executing compute node 102 that has won the consensus broadcasts its data on the network. At step 304C, the consensus modules 202 of the scheduling clients 200 on all participating compute nodes 102 update their respective scheduling databases 207 with a copy of data published by the compute node 102 that won the con sensus. Thus, the resource availability data in the resource pool in the scheduling database of the scheduling client at every compute node is updated for next job to be scheduled. At step 305, the job execution module 201 processes, that is, assigns the executing compute node 102 for execution of the scheduled job. At step 305A, the job execution module 201 up dates the scheduling database 207 by removing the transaction from the job pool and updating resource availability of the compute node in the resource pool of the scheduling database 207.

Where databases are described such as the node database 206 or the scheduling database 207, it will be understood by one of ordinary skill in the art that (i) alternative database structures to those described may be readily employed, and (ii) other memory structures besides databases may be readily employed. Any illustrations or descriptions of any sample da tabases disclosed herein are illustrative arrangements for stored representations of information. Any number of other arrangements may be employed besides those suggested by ta bles illustrated in the drawings or elsewhere. Similarly, any illustrated entries of the databases represent exemplary in formation only; one of ordinary skill in the art will under stand that the number and content of the entries can be dif ferent from those disclosed herein. Further, despite any de piction of the databases as tables, other formats including relational databases, object-based models, and/or distributed databases may be used to store and manipulate the data types disclosed herein. Likewise, object methods or behaviors of a database can be used to implement various processes such as those disclosed herein. In addition, the databases may, in a known manner, be stored locally or remotely from a device that accesses data in such a database. In embodiments where there are multiple databases in the system, the databases may be integrated to communicate with each other for enabling simultaneous updates of data linked across the databases, when there are any updates to the data in one of the data bases .

The foregoing examples have been provided merely for the pur pose of explanation and are in no way to be construed as lim iting of the present invention disclosed herein. While the invention has been described with reference to various embod iments, it is understood that the words, which have been used herein, are words of description and illustration, rather than words of limitation. Further, although the invention has been described herein with reference to particular means, ma terials, and embodiments, the invention is not intended to be limited to the particulars disclosed herein; rather, the in vention extends to all functionally equivalent structures, methods and uses, such as are within the scope of the ap pended claims. Those skilled in the art, having the benefit of the teachings of this specification, may affect numerous modifications thereto and changes may be made without depart ing from the scope and spirit of the invention in its as pects .

List of reference numerals

101 network

101A, 101B, 101C clusters

102 compute node/executing compute node

103 transaction pool

104 transaction

200 scheduling client

201 job execution module

201A transaction module

202 consensus module

202A stake identification module

203 peer discovery module

204 health monitoring module

205 user interface 206 node database

207 scheduling database 102A processor

102B memory unit

102C I/O controller

102D network interface 102E data bus

102F display unit

102G input devices

102H fixed media drive

1021 removable media drive 102J output devices

Claims

1. A scheduling client (200), employed on a compute node (102) deployable in a network (101) of compute nodes (102), for scheduling a job in the network (101), and comprising :

- a job execution module (201) configured to receive a job for execution at the compute node (102);

- the job execution module (201) further configured to ob tain resource availability data from each of the compute nodes (102) in the network (101);

- the job execution module (201) further configured to de termine an executing compute node (102) from the compute nodes (102) in the network (101) for executing the job based on the resource availability data;

- a consensus module (202) configured to participate in a consensus with the compute nodes (102) of the network (101), for updating the resource availability data; and

- the job execution module (201) further configured to process the job at the executing compute node (102) .

2. The scheduling client (200) according to the claim 1, wherein the job execution module (201) is in communica tion with a peer discovery module (203) of the schedul ing client (200), and wherein the peer discovery module (203) is configured to receive peer node data from a node database (206), wherein the peer node data com prises identifiers for each of the compute nodes (102) of the network (101) .

3. The scheduling client (200) according to the claim 1, wherein the job execution module (201) comprises a transaction module (201A) configured to generate a transaction based on the job, wherein the transaction comprises the job, resource utilization limit for the job, and metadata associated with the job.

4. The scheduling client (200) according to the claim 3, wherein the job execution module (201) is further con figured to store the transaction in a scheduling data base (207) .

5. The scheduling client (200) according to any one of the claims 1 and 5, wherein the consensus module (202) com prises a stake identification module (202A) configured to obtain a stake from each of the compute nodes (102) based on scheduling history of the compute node (102) .

6. A method (300) for scheduling a job, the method employ ing a scheduling client (200) deployable on a compute node (102) in a network (101) of compute nodes (102), and comprising:

- receiving a job for execution at the compute node (102);

- obtaining resource availability data from each of the compute nodes (102) in the network (101);

- determining an executing compute node (102) from the

compute nodes (102) in the network (101) for executing the job based on the resource availability data;

- participating in a consensus with the compute nodes

(102) of the network (101), for updating the resource availability data; and

- processing the job at the executing compute node (102) .

7. The method (300) according to the claim 6, wherein ob taining the resource availability data comprises receiv ing peer node data from a node database (206), wherein the peer node data comprises identifiers for each of the compute nodes (102) of the network (101) .

8. The method (300) according to the claim 6, wherein de termining the executing compute node (102) comprises se lecting a compute node (102) with a compute capability required for execution of the job.

9. The method (300) according to the claim 6, wherein par ticipating in the consensus comprises obtaining a stake from each of the compute nodes (102) based on scheduling history of the compute node (102) .

10. A computer program product comprising machine-read- able instructions stored therein, that when executed by one or more processors (102A), cause the one or more processors (102A) to perform a method (300) according to the claims 6-9.

11. A compute node (102) deployable in an automation environment comprising a scheduling client (200) accord ing to the claims 1-5.

12. The compute node (102) according to the claim 11, is in communication with one or more compute nodes (102) forming a network (101) therebetween, wherein each of the compute nodes (102) participates in a consensus with the compute nodes (102) of the network (101) .