METHOD, SYSTEM AND COMPUTER PROGRAM FOR SCHEDULING EXECUTION
OF JOBS DRIVEN BY EVENTS
Technical Field
The present invention relates to the data processing field. More specifically, the present invention relates to the scheduling of the execution of jobs in a data processing system.
Background
Workload schedulers (or simply schedulers) are commonly used to control the execution of large quantities of jobs in a data processing system. An example of commercial scheduler is the "IBM Tivoli Workload Scheduler (TWS)" by IBM Corporation.
The jobs consist of any sort of work units that can be executed in the system. For example, as described in US-A-7, 150, 037, the scheduler is used to control the downloading of configuration files to network devices (in a network configuration management system) . Each configuration file is generated dynamically by evaluating corresponding
policies, which are formed by one or more rules; each rule includes conditions (for determining how to identify the corresponding network devices in an infrastructure database) , actions (for determining how to set desired configuration parameters) and verifications (for determining how to interpret any discrepancy between the rules and the actual configurations of the corresponding network devices during the verification of a network configuration) .
Typically, the scheduler controls the execution of the jobs on multiple workstations from a central scheduling server; the workstation for each job may be either defined statically or selected dynamically when the job is submitted for execution (among all the available ones having the required characteristics) . The latter solution allows implementing systems that are easily scaleable and high reliable; moreover, workload balancing techniques may be exploited to optimize the distribution of the jobs on the workstations .
The submission of the jobs is controlled according to a predefined workload plan (or simply plan) . The plan establishes a flow of execution of the jobs based on temporal constraints (i.e., date and/or time); in addition, the execution of the jobs may also be conditioned on specific dependencies (such as the completion of preceding jobs). However, the schedulers are completely ineffective in controlling the execution of jobs that are not defined in the plan. This is a problem when the need of executing a job is not known a priori (for example, because it is triggered by the occurrence of a specific event) . The general idea of controlling the execution of tasks either on a scheduled basis or in response to the occurrence of events has already been proposed in a very specific application, as described in US-A-7, 146, 350. Particularly, this document discloses a system for auditing an Information Technology (IT) infrastructure of an enterprise. For this purpose, a server of the system controls the execution of
static assessments or dynamic assessments (including sequences of steps defined in corresponding policies) of particular resources of the IT infrastructure. The assessments may be triggered by exploiting a scheduler as usual to provide year, date and time of day information; alternatively, the same assessments may also be triggered by predefined events detected on nodes of the system. For this purpose, each node must monitor all the possible events of interest; the information so obtained is then collected on the server from the different nodes. However, this brings about an overhead of the nodes (for detecting the events) and of the server (for collecting them) ; moreover, the large amount of information transmitted from the nodes to the server for collecting the events involves a significant increase of the network traffic.
Summary
In its general terms, the present disclosure is aimed at supporting the scheduling of jobs either according to a plan or in response to events. Particularly, different aspects of the present invention provide a solution as set out in the independent claims. Advantageous embodiments of the invention are described in the dependent claims.
More specifically, an aspect of the invention proposes a method for scheduling execution of jobs on target entities
(such as workstations) of a data processing system - under the control of a scheduling entity of the system (such as a scheduling server) . The method starts with the step of providing a plan, which defines a flow of execution of a set of jobs. The method continues by submitting each job for execution on a selected target entity according to the plan. A set of rules is also provided; each rule defines an action to be executed on an action target entity in response to an event on an event target entity. The method then includes the step
of determining the events that are defined for each event target entity in the rules. Each event target entity is then enabled to detect the corresponding events. The execution of each action on the corresponding action target entity is now triggered in response to the detection of the corresponding event .
For example, the actions may consist of further jobs that are even not defined in the plan.
In a suggested implementation, each (event) workstation is enabled to detect the corresponding events by deploying a configuration structure for one or more detection modules running on it.
As a further improvement, the deployment of the configuration structure is prevented when it is equal to a previous version thereof being already available on the workstation .
For this purpose, it is possible to compare digest values of the two versions of the configuration structure.
In an embodiment of the invention, the server receives the notification of each event from each (event) workstation, and then submits the corresponding action for execution on the relevant (action) workstation.
A way to further improve the solution is to monitor the rules, so as to perform the operations described above only in response to any change thereof.
As a further enhancement, those operations are restricted to the (event) workstations that are impacted by the changed rules .
Another aspect of the invention proposes a computer program for performing the above-described method.
A different aspect of the invention proposes a corresponding system.
Brief description of the drawings
The invention itself, as well as further features and the advantages thereof, will be best understood with reference to the following detailed description, given purely by way of a non-restrictive indication, to be read in conjunction with the accompanying drawings, in which:
FIG.l is a schematic block diagram of a data processing system in which the solution according to an embodiment of the invention may be applied,
FIG.2 shows the functional blocks of an exemplary computer of the system;
FIG.3 illustrates the main software components that can be used to implement the solution according to an embodiment of the invention, and
FIGs.4A-4B show a diagram describing the flow of activities relating to an implementation of the solution according to an embodiment of the invention.
Detailed Description
With reference in particular to FIG.l, a data processing system 100 with distributed architecture is illustrated. The system 100 includes a scheduling server (or simply server) 105, which is used to control the execution of jobs in the system 100; typically, the jobs consist of batch (i.e., non-interactive) applications - such as payroll or cost analysis programs. The jobs are executed under the control of the server 105 on a plurality of target workstations (or simply workstations) 110. For this purpose, the server 105 and the workstations 110 communicate through a network 115 (for example, a LAN) .
Moving to FIG.2, a generic computer of the above-described system (server or workstation) is denoted with 200. The computer 200 is formed by several units that are connected in parallel to a system bus 205 (with a structure that is suitably scaled according to the actual function of the computer 200 in the system) . In detail, one or more microprocessors (//P) 210
control operation of the computer 200; a RAM 215 is directly used as a working memory by the microprocessors 210, and a ROM 220 stores basic code for a bootstrap of the computer 200. Several peripheral units are clustered around a local bus 225 (by means of respective interfaces) . Particularly, a mass memory consists of one or more hard-disks 230 and drives 235 for reading CD-ROMs 240. Moreover, the computer 200 includes input units 245 (for example, a keyboard and a mouse) , and output units 250 (for example, a monitor and a printer) . An adapter 255 is used to connect the computer 200 to the network (not shown in the figure) . A bridge unit 260 interfaces the system bus 205 with the local bus 225. Each microprocessor 210 and the bridge unit 260 can operate as master agents requesting an access to the system bus 205 for transmitting information. An arbiter 265 manages the granting of the access with mutual exclusion to the system bus 205.
Considering now FIG.3, the main software components that can be used to implement the solution according to an embodiment of the invention are denoted as a whole with the reference 300. The information (programs and data) is typically stored on the hard-disk and loaded (at least partially) into the working memory of each computer when the programs are running, together with an operating system and other application programs (not shown in the figure) . The programs are initially installed onto the hard disk, for example, from CD-ROM.
In detail, the server 105 runs a scheduler 305 (for example, the above-mentioned TWS) .
The scheduler 305 includes a configurator 310 (such as the "Composer" of the TWS) , which is used to maintain a workload database 315 (written in a suitable control language - for example, XML-based) . The workload database 315 contains a definition of all the workstations that are available to the scheduler 305; for example, each workstation is defined by information for accessing it (such as name, address, and the like) , together with its physical/logical characteristics
(such as processing power, memory size, operating system, and the like) . The workload database 315 also includes a descriptor of each job. The job descriptor specifies the programs to be invoked (with their arguments and environmental variables). Moreover, the job descriptor indicates the workstations on which the job may be executed - either statically (by their names) or dynamically (by their characteristics). The job descriptor then provides temporal constraints for the execution of the job (such as its run-cycle, like every day, week or month, an earliest time or a latest time for its starting, or a maximum allowable duration). Optionally, the job descriptor specifies dependencies of the job (i.e., conditions that must be met before the job can start); exemplary dependencies are sequence constraints (such as the successful completion of other jobs), or enabling constraints (such as the entering of a response to a prompt by an operator) . Generally, the jobs are organized into streams; each job stream consists of an ordered sequence of (logically related) jobs, which should be run as a single work unit respecting predefined dependencies. For the sake of simplicity, the term job will be used hereinafter to denote either a single job or a job stream. The workload database 315 also stores statistic information relating to previous executions of the jobs (such as a log of their duration from which a corresponding estimated duration may be inferred) .
A planner 320 (such as the "Master Domain Manager" of the TWS) is used to create a plan, which definition is stored in a control file 325 (such as the "Symphony" of the TWS) . The plan specifies the flow of execution of a batch of jobs in a specific production period (typically, one day) , together with the definition of the required workstations. A new plan is generally created automatically before every production period. For this purpose, the planner 320 processes the information available in the workload database 315 so as to select the jobs to be run and to arrange them in the desired sequence (according to their expected duration, temporal
constraints, and dependencies) . The planner 320 creates the plan by adding the jobs to be executed (for the next production period) and by removing the preexisting jobs (of the previous production period) that have been completed; in addition, the jobs of the previous production period that did not complete successfully or that are still running or waiting to be run can be maintained in the plan (for their execution during the next production period) .
A handler 330 (such as the "Batchman" process of the "TWS") starts the plan at the beginning of every production period. The handler 330 submits each job for execution as soon as possible; for this purpose, the handler 330 selects a workstation - among the available ones - having the required characteristics (typically, according to information provided by a load balancer - not shown in the figure) .
The actual execution of the jobs is managed by a corresponding module 335 (such as the "Jobman" process of the "TWS") ; for this purpose, the executor 335 interfaces with an execution agent 340 running on each workstation 110 (only one shown in the figure) .
The agent 340 enforces the execution of each job in response to a corresponding command received from the executor 335, and returns feedback information relating to the result of its execution (for example, whether the job has been completed successfully, its actual duration, and the like) . The feedback information of all the executed jobs is passed by the executor 335 to the handler 330, which enters it into the control file 325 (so as to provide a real-time picture of the current state of all the jobs of the plan) . At the end of the production period, the planner 320 accesses the control file 325 for updating the statistic information relating to the executed jobs in the workload database 315.
In the solution according to an embodiment of the present invention, as described in detail in the following, the scheduler 305 also supports the executions of jobs (or more generally, any other actions) in response to corresponding
events. For this purpose, each workstation is enabled to detect only the events of interest - i.e., the ones which occurrence on the workstation triggers the execution of a corresponding action (for example, by deploying customized configuration files selectively) .
In this way, the scheduler can control the execution of whatever actions, even when the need of their execution is not known a priori; particularly, this allows submitting jobs that are not defined in the plan. In any case, the desired result is achieved with a minimal overhead of the workstations and the server; moreover, no significant increase of the network traffic is brought about.
More specifically, in the implementation illustrated in the figure, an editor 345 is used to maintain a rule repository 350 (preferably secured by an authentication/authorization mechanism to control any update thereof) . Each rule in the repository 350 defines an action to be executed on a corresponding (action) workstation in response to the detection of an event on a corresponding (event) workstation. A number of different events may be supported; for example, the events may consist of the entering of an error condition for a job, the shutdown of a workstation, the creation or deletion of a file, and the like. Typically, the actions consist of the submission of a job for its execution; in this respect, it is emphasized that the rule can specify any job, even if it is not included in the plan. However, other actions may be supported - for example, an e-mail notification to a user, the turn-on of a workstation, and the like. The events may be detected and the actions may be executed on any computer of the system; for example, the events relating to the change of status of the jobs are detected by the server itself (in this case operating as a workstation as well); moreover, the actions consisting of the submissions of the jobs may be executed on workstations that are defined either statically or dynamically (according to required characteristics) .
A set of plug-in modules (or simply plug-ins) is provided for detecting the events and for executing the actions (different from the submission of the jobs); an example of (event) plug-in may be a file scanner, whereas an example of (action) plug-in may be an e-mail sender. The rule repository 350 is accessed by the planner 320 (so as to add the information required for the detection of the events and the execution of the corresponding actions into the control file 325) . An event plug-in database 355 associates each event with the corresponding event plug-in for its detection. A monitor 360 processes the rules in the repository 350 (for example, whenever a change is detected) . More specifically, the monitor 360 determines the events that are defined for each workstation in the rules. The monitor 360 then creates a configuration file for each event plug-in associated with these events (as indicated in the event plug-in database 355) ; the configuration file sets configuration parameters of the event plug-in that enable it to detect the desired event (s) . The configuration files of each workstation are then combined into a single configuration archive (for example, in a compressed form) . The monitor 360 saves all the configuration archives so obtained into a corresponding repository 365. At the same time, the monitor 360 calculates a Cyclic Redundancy Code (CRC) of each configuration archive (by applying a 16- or 32-bit polynomial to it) . A configuration table 370 is used to associate each workstation with the corresponding configuration archive and its CRC (under the control of the monitor 360) . A deployer 375 transmits each CRC to the corresponding workstation (as indicated in the configuration table 370); for this purpose, the deployer 375 retrieves the required information from the definition of the workstations in the control file 325. With reference to the same workstation 110 as above for the sake of simplicity, this information is received by a controller 380. The controller 380 accesses the
current configuration files (denoted with 385) of the (event and/or action) plug-ins that are installed on the workstation
110 (denoted with 390) . When the received CRC differs from the one of the configuration files 385, the controller 380 downloads the (new) configuration archive from the server 105
(through the deployer 375) , and then updates the configuration files 385 accordingly; preferably, the configuration archives provided by the server 105 are encrypted and secured, so as to ensure their confidentiality and integrity. The plug-ins 390 interface with the agent 340 for exchanging information with the server 105. Particularly, the agent 340 notifies the events detected on the workstation 110 to an event collector 391; preferably, the notifications of the events provided by the workstation 110 are encrypted and secured, so as to ensure their confidentiality and integrity. The event collector 391 passes the notifications of the events detected on all the workstations to an event correlator 392. The event correlator 392 accesses the rule repository 350, so as to determine the actions to be executed in response thereto (together with the corresponding workstations) . For each action to be executed on a specific workstation, the event correlator 392 calls the handler 330 (by passing this information) . The handler 330 accesses an action plug-in database 393, which associates each action with the corresponding action plug-in for its execution. The handler 330 then invokes the action plug-in - denoted as a whole with 394 - associated with the action to be executed (as indicated in the action plug-in database 394) . Each action plug-in 394 manages the actual execution of the corresponding action on the desired workstations; for this purpose, the action plug-in 394 interface with the agent 340 running on each relevant workstation (as shown for the same workstation 110 as above in the figure) . Moreover, the action plug-ins 394 may also include modules adapted to perform user's notifications (for example, by e-mails) .
Moving to FIGS.4A-4B, the logic flow of an exemplary process that can be implemented in the above-described system to schedule the execution of jobs is represented with a method 400. The method begins at the black start circle 403 in the swim-lane of the server. When a new plan is created at block 406, the process passes to block 409; in this phase, the definition of the plan (including the specification of the flow of execution of the jobs and the definition of the workstations required for their execution) is generated and then stored into the control file.
The flow of activity passes to block 412 when the monitor detects any change in the rules (stored in the corresponding repository) . In response thereto, at block 415 the plan is regenerated and replaced in the control file, so as to add the definition of the workstations where the events are to be detected and the corresponding actions are to be executed.
A loop is then performed for processing the rules that have been changed; the loop begins at block 418 wherein every changed rule is identified (starting from the first one) . Proceeding to block 421, the event plug-in associated with the event specified in the (current) changed rule is extracted from the event plug-in database. With reference now to block 424, this event plug-in is invoked (by passing an indication of the event to be detected) ; in this way, the configuration file of the event plug-in is generated (with the corresponding configuration parameters properly set so as to enable the event plug-in to detect the desired event) . The workstation wherein the event indicated in the rule is to be detected is identified at block 430. Continuing to block 433, the configuration file so obtained is added to the configuration archive of this workstation. A test is then made at block 436 to determine whether a next rule has been changed. If so, the method returns to block 418 to repeat the same operations described above for the next changed rule.
Conversely, once all the changed rules have been processed, a further loop is entered for processing the new configuration archives obtained above; the loop begins at block 439, wherein the (new) CRC of every (new) configuration archive - starting from the first one - is calculated. Proceeding to block 442, the new CRC is transmitted to the corresponding workstation. In response thereto, this workstation at block 445 calculates the (old) CRC of the configuration files that are currently installed thereon; the new CRC is then compared with the old CRC. The flow of activity branches at block 448 according to the result of the comparison. If the new CRC differs from the old CRC, the workstation at block 451 requires the new configuration archive to the server. Returning to the swim-lane of the server, the required new configuration archive is transmitted to the workstation at block 454. Once the new configuration archive has been received by the workstation at block 457, its configuration files are extracted and installed onto the workstation. The method then descends into block 460 in the swim-lane of the server; the same point is also reached directly from block 448 when the new CRC is equal to the old CRC. At this point, a test is made to determine whether all the new configuration archives have been processed. If not, the method returns to block 439 to repeat the same operations described above for another new configuration archive.
On the contrary, the flow of activity descends into block
463 when the plan is started at the beginning of the production period. As soon as every job of the plan can be executed
(according to its temporal constraints and dependencies) , the method passes from block 466 to block 469; in this phase, the job is submitted for execution on a selected workstation (among the available ones having the required characteristics) . In response thereto, the job is executed on the (selected) workstation at block 472 - for the sake of simplicity, represented with the same one as above. Continuing to block 475, the workstation returns feedback information (relating to
the result of the execution of the job) to the server. Moving to the swim-lane of the server at block 478, the feedback information is entered into the control file.
In a completely asynchronous manner, the flow of activity passes to block 481 whenever a generic (event) workstation - for the sake of simplicity, represented with the same one as above - detects one of the events indicated in the configuration files of its event plug-ins. In response thereto, the workstation notifies the event to the server at block 484. Moving now to block 485 in the swim-lane of the server, any actions to be executed in response to this event (together with the corresponding workstations) are determined according to the rules extracted from the rule repository. For this purpose, the event correlator may simply evaluate the rules (each one defining the execution of an action in response to an event) ; moreover, the event correlator may also evaluate relationships among the rules (for example, defining the execution of an action in response to the detection of different events). Continuing to block 487, the server submits the execution of each action on the corresponding workstation
(for the sake of simplicity, represented with the same one as above) ; for this purpose, the handler invokes the corresponding action plug-in (as indicated in the action plug-in database) . At the same time, the server may also send a corresponding notification (for example, with an e-mail to the user of the workstation 110) .
In response thereto, the action is executed on the workstation at block 490 (by means of the execution agent or the corresponding action plug-in). Continuing to block 493, as above the workstation returns feedback information (relating to the result of the execution of the action) to the server. Moving to the swim-lane of the server at block 496, the feedback information is entered into the control file as above. The flow of activity then ends at the concentric white/black stop circles 499.
Naturally, in order to satisfy local and specific requirements, a person skilled in the art may apply to the solution described above many logical and/or physical modifications and alterations. More specifically, although the present invention has been described with a certain degree of particularity with reference to preferred embodiment (s) thereof, it should be understood that various omissions, substitutions and changes in the form and details as well as other embodiments are possible. Particularly, the proposed solution may even be practiced without the specific details (such as the numerical examples) set forth in the preceding description to provide a more thorough understanding thereof; conversely, well-known features may have been omitted or simplified in order not to obscure the description with unnecessary particulars. Moreover, it is expressly intended that specific elements and/or method steps described in connection with any disclosed embodiment of the invention may be incorporated in any other embodiment as a matter of general design choice. Particularly, the proposed solution lends itself to be implemented with an equivalent method (by using similar steps, removing some steps being non-essential, or adding further optional steps) ; moreover, the steps may be performed in a different order, concurrently or in an interleaved way (at least in part) .
Moreover, the same solution may be applied to any other workload scheduler (or equivalent application) . Particularly, even though in the preceding description reference has been made to non-interactive jobs, this is not to be intended as a limitation; indeed, the same solution may be used to schedule the execution of any kind of work units (for example, interactive tasks) . Likewise, the plan may be defined and/or generated in a different way - for example, based on any additional or alternative temporal constraints or dependencies (even based on dynamic relationships among the workstations) ; in addition, any other criteria may be used for selecting the
workstations for the submission of the jobs (for example, according to statistic methods for distributing the execution of the jobs uniformly) .
The proposed solution may be implemented with any other type of rules (or policies) for defining actions to be executed in response to corresponding events; likewise, the above-described events and actions are merely illustrative, and they are not to be interpreted in a limitative manner. For example, (basic) rules may be combined into (complex) rules with any logical operator (such as OR, AND, and the like) , so as to define the execution of actions in response to any combination of events (even on different workstations); likewise, the rules may define the execution of (complex) actions consisting of multiple (basic) actions even on (complex) entities each one consisting of multiple (basic) workstations - i.e., by aggregating more rules based on the same event. Alternatively, the events may consist of the outcome of other rules; moreover, the actions may also be conditioned by temporal constraints and/or dependencies. Similar considerations apply if the notifications are sent to additional or different users, if they are made by SMS, and the like. In different embodiments of the invention, the actions may only consist of jobs, of notifications, or of any other predefined type of operations; of course, the need of regenerating the plan for including the information relating to the rules is not strictly necessary.
Similar considerations apply if the configuration files (and the configuration archives) are replaced with equivalent structures - for example, simply consisting of commands for forcing the desired behavior of the event plug-ins; likewise, the configuration files may be deployed to the relevant workstations in any other way (for example, by exploiting a software distribution infrastructure) .
In a basic implementation of the proposed solution, it is also possible to distribute the configuration files indiscriminately to all the workstations.
Moreover, the CRC may be of another type (for example, a CRC-4), it may be replaced by a simple checksum of the configuration archive, by a hash value, or more generally by any other digest value representing the configuration archive in a far shorter form. However, nothing prevents managing the selective deployment of the configuration archives directly on the server (for example, by maintaining information about the actual status of all the workstations centrally) .
A general variant of the proposed solution also allows each (event) workstation to notify each event to the corresponding (action) workstations directly - without passing through the server. For example, this may happen for every event or only when the action is to be executed on the same workstation wherein the corresponding event has been detected. The possibility of forcing the deployment of the desired configuration files on request (even without any monitoring of the rules) is within the scope of the present solution.
In any case, in a simplified implementation it is also possible to regenerate (and deploy) the configuration files for all the workstations at any change of the rules.
Similar considerations apply if the program (which may be used to implement each embodiment of the invention) is structured in a different way, or if additional modules or functions are provided; likewise, the memory structures may be of other types, or may be replaced with equivalent entities (not necessarily consisting of physical storage media) . In any case, the program may take any form suitable to be used by or in connection with any data processing system, such as external or resident software, firmware, or microcode (either in object code or in source code - for example, to be compiled or interpreted) . Moreover, it is possible to provide the program on any computer-usable medium; the medium can be any element suitable to contain, store, communicate, propagate, or transfer the program. For example, the medium may be of the electronic, magnetic, optical, electromagnetic, infrared, or semiconductor type; examples of such medium are fixed disks
(where the program can be pre-loaded) , removable disks, tapes, cards, wires, fibers, wireless connections, networks, broadcast waves, and the like. In any case, the solution according to an embodiment of the present invention lends itself to be implemented with a hardware structure (for example, integrated in a chip of semiconductor material) , or with a combination of software and hardware. It would be readily apparent that it is also possible to deploy the proposed solution as a service that is accessed through a network (such as the Internet) .
The proposed method may be carried out on a system having a different architecture or including equivalent units (for example, based on a local network) . Moreover, each computer may include similar elements (such as cache memories temporarily storing the programs or parts thereof to reduce the accesses to the mass memory during execution) ; in any case, it is possible to replace the computer with any code execution entity (such as a PDA, a mobile phone, and the like) , or with a combination thereof (such as a multi-tier server architecture, a grid computing infrastructure, and the like) .