US8959525B2  Systems and methods for affinity driven distributed scheduling of parallel computations  Google Patents
Systems and methods for affinity driven distributed scheduling of parallel computations Download PDFInfo
 Publication number
 US8959525B2 US8959525B2 US12/607,497 US60749709A US8959525B2 US 8959525 B2 US8959525 B2 US 8959525B2 US 60749709 A US60749709 A US 60749709A US 8959525 B2 US8959525 B2 US 8959525B2
 Authority
 US
 United States
 Prior art keywords
 place
 activities
 computer readable
 scheduling
 places
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Active, expires
Links
 230000000694 effects Effects 0.000 claims description 164
 238000004590 computer program Methods 0.000 claims description 11
 238000003860 storage Methods 0.000 claims description 11
 238000000034 method Methods 0.000 description 54
 238000010586 diagram Methods 0.000 description 11
 KEJGAYKWRDILTFPGQYJIMISAN (3aR,5S,6aS)5(2,2dimethyl1,3dioxolan4yl)2,2dimethyl3a,5,6,6atetrahydrofuro[2,3d][1,3]dioxol6ol Chemical compound   O1C(C)(C)OCC1[C@@H]1C(O)[C@@H]2OC(C)(C)O[C@H]2O1 KEJGAYKWRDILTFPGQYJIMISAN 0.000 description 7
 230000001939 inductive effect Effects 0.000 description 7
 238000009826 distribution Methods 0.000 description 6
 230000003044 adaptive Effects 0.000 description 4
 230000000875 corresponding Effects 0.000 description 4
 230000004048 modification Effects 0.000 description 3
 238000006011 modification reaction Methods 0.000 description 3
 230000003287 optical Effects 0.000 description 3
 241000408659 Darpa Species 0.000 description 2
 125000004122 cyclic group Chemical group 0.000 description 2
 239000003365 glass fiber Substances 0.000 description 2
 239000000463 material Substances 0.000 description 2
 239000011159 matrix material Substances 0.000 description 2
 230000002093 peripheral Effects 0.000 description 2
 229920000673 poly(carbodihydridosilane) Polymers 0.000 description 2
 230000000644 propagated Effects 0.000 description 2
 241000256844 Apis mellifera Species 0.000 description 1
 230000036809 Fabs Effects 0.000 description 1
 238000000342 Monte Carlo simulation Methods 0.000 description 1
 210000003666 Nerve Fibers, Myelinated Anatomy 0.000 description 1
 230000000903 blocking Effects 0.000 description 1
 239000000969 carrier Substances 0.000 description 1
 238000004519 manufacturing process Methods 0.000 description 1
 CBZNDCXNWNCBHKUHFFFAOYSAN methylidenesilane Chemical compound   [SiH2]=C CBZNDCXNWNCBHKUHFFFAOYSAN 0.000 description 1
 230000005012 migration Effects 0.000 description 1
 238000000329 molecular dynamics simulation Methods 0.000 description 1
 239000004065 semiconductor Substances 0.000 description 1
Images
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for program control, e.g. control units
 G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
 G06F9/46—Multiprogramming arrangements
 G06F9/48—Program initiating; Program switching, e.g. by interrupt
 G06F9/4806—Task transfer initiation or dispatching
 G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
 G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multilevel priority queues

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F9/00—Arrangements for program control, e.g. control units
 G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
 G06F9/46—Multiprogramming arrangements
 G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
 G06F9/524—Deadlock detection or avoidance

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F2209/00—Indexing scheme relating to G06F9/00
 G06F2209/48—Indexing scheme relating to G06F9/48
 G06F2209/483—Multiproc
Abstract
Embodiments of the invention provide efficient scheduling of parallel computations for higher productivity and performance. Embodiments of the invention provide various methods effective for affinity driven and distributed scheduling of multiplace parallel computations with physical deadlock freedom.
Description
Languages such X10, Chapel and Fortress, which are based on a partitioned global address (PGAS) paradigm, have been designed and implemented as part of the Defense Advanced Research Projects Agency High Productivity Computing Systems (DARPA HPCS) program for higher productivity and performance on manycore and massively parallel platforms. Nonetheless, manycore and massively parallel platforms have significant drawbacks related to scheduling of parallel computations.
Embodiments of the invention provide efficient scheduling of parallel computations for higher productivity and performance. Embodiments of the invention provide various methods effective for affinity driven distributed scheduling of multiplace (“place” is a group of processors with shared memory) parallel computations with physical deadlock freedom. Embodiments of the invention provide an online affinity driven distributed scheduling process for strict place annotated multithreaded computations that assumes unconstrained space. Moreover, embodiments of the invention provide a novel affinity driven online distributed scheduling process assuming bounded space per place.
In summary, one aspect of the invention provides an apparatus comprising: one or more processors; and a computer readable storage medium having computer readable program code embodied therewith and executable by the one or more processors, the computer readable program code comprising: computer readable program code configured to provide online distributed affinity driven scheduling of multiplace computations in a deadlock free manner for one or more places, the one or more places each comprising one or more processors having shared memory.
Another aspect of the invention provides a method comprising: utilizing one or more processors to execute a program of instructions tangibly embodied in a program storage device, the program of instructions comprising: computer readable program code configured to provide online distributed affinity driven scheduling of multiplace computations in a deadlock free manner for one or more places, the one or more places each comprising one or more processors having shared memory.
A further aspect of the invention provides a computer program product comprising: a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising: computer readable program code configured to provide online distributed affinity driven scheduling of multiplace computations in a deadlock free manner for one or more places, the one or more places each comprising one or more processors having shared memory.
For a better understanding of exemplary embodiments of the invention, together with other and further features and advantages thereof, reference is made to the following description, taken in conjunction with the accompanying drawings, and the scope of the claimed embodiments of the invention will be pointed out in the appended claims.
It will be readily understood that the components of the embodiments of the invention, as generally described and illustrated in the Figures herein, may be arranged and designed in a wide variety of different configurations in addition to the described exemplary embodiments. Thus, the following more detailed description of the embodiments of the invention, as represented in the Figures, is not intended to limit the scope of the embodiments of the invention, as claimed, but is merely representative of exemplary embodiments of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” (or the like) means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, appearances of the phrases “in one embodiment” or “in an embodiment” or the like in various places throughout this specification are not necessarily all referring to the same embodiment.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the various embodiments of the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, wellknown structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
The inventors have recognized that with the advent of multicore and manycore architectures, scheduling of parallel programs for higher productivity and performance has become an important problem. Languages such X10, Chapel and Fortress which are based on PGAS paradigm, and have been designed and implemented as part of DARPA HPCS program for higher productivity and performance on manycore and massively parallel platforms. These languages have inbuilt support for initial placement of threads (also referred to as activities) and data structures in the parallel program and therefore locality comes implicitly with the programs. The runtime system of these languages needs to provide algorithmic online scheduling of parallel computations with medium to fine grained parallelism. For handling large parallel computations, the scheduling algorithm should be designed to work in a distributed fashion on manycore and massively parallel architectures. Further, it should ensure physical deadlock free execution under bounded space. It is assumed that the parallel computation does not have any logical deadlocks due to control, data or synchronization dependencies, so physical deadlocks can only arise due to cyclic dependency on bounded space. This is a very challenging problem since the distributed scheduling algorithm needs to follow affinity and provide efficient space and time complexity along with distributed deadlock freedom.
The description now turns to the Figures. The illustrated embodiments of the invention will be best understood by reference to the Figures. The following description is intended only by way of example and simply illustrates certain selected exemplary embodiments of the invention as claimed herein.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, apparatuses, methods and computer program products according to various embodiments of the invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardwarebased systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The two affinity driven distributed scheduling problems addressed herein are as follows. Given: (a) An input computation DAG (FIG. 2(A) ) that represents a parallel multithreaded computation with fine to medium grained parallelism. Each node in the to DAG is a basic operation such as and/or/add etc. and is annotated with a place identifier which denotes where that node should be executed. The edges in the DAG represent (i) spawn of a new thread or, (ii) sequential flow of execution or, (iii) synchronization dependency between two nodes; (b) A cluster of n SMPs (each SMP also referred as place has fixed number(m) of processors and memory) as the target architecture on which to schedule the computation DAG. To Solve: For both problems, one needs to generate a schedule for the nodes of the computation DAG in an online and distributed fashion that ensures exact mapping of nodes onto places as specified in the input DAG. Specifically, for the first problem it is assumed that the input is a strict computation DAG (synchronization dependency edges in the input DAG go only between a thread and its ancestor thread) and there is unconstrained space per place. Here, one needs to generate an online schedule for the nodes in the computation DAG while minimizing the time and message complexity. For the second problem, it is assumed that the input is a terminally strict parallel computation DAG (synchronization dependency edge represents an activity waiting for the completion of a descendant activity) and the space per place is bounded. Here, the aim is to generate an online schedule that ensures physical deadlock free execution while keeping low time and message complexity for execution.
Thus, consistent with various embodiments of the invention, herein are presented affinity driven distributed scheduling processes and proven space, time and message bounds while guaranteeing deadlock free execution. The processes assume initial placement of annotations on the given parallel computation with consideration of load balance across the places. The processes control the online expansion of the computation DAG based on available resources. They use efficient remote spawn and reject handling mechanisms across places for ensuring affinity. Randomized work stealing within a place helps load balancing. The distributed scheduling process for bounded space carefully manages space for execution in a distributed fashion using estimated computation depth based ordering of threads/activities. The distributed deadlock avoidance strategy ensures deadlock free execution of the parallel computation. These processes can be easily extended to variable number of processors per place and also to mapping of multiple logical places in the program to the same physical place, provided the physical place has sufficient resources.
Herein are proposed novel affinity driven distributed scheduling processes for both unconstrained and bounded space per place. The bounded space process is designed for terminally strict multiplace computations and ensures physical deadlock free execution using a novel distributed deadlock avoidance strategy. Presented herein is a space bound and deadlock freedom proof for this process.
It is shown herein that for the unconstrained space process, the lower bound on the expected execution time is O(max_{k}T_{1} ^{k}/m+T_{∞,n}) and the upper bound is O(Σ_{k}(T_{1} ^{k}/m+T_{∞} ^{k})); where k is a variable that denotes places from 1 to n, m denotes the number of processors per place, T_{1} ^{k }denotes the execution time for place k using a single processor, and T_{∞, n }denotes the execution time of the computation on n places with infinite processors on each place. Expected and probabilistic lower and upper bounds for message complexity are also discussed herein.
Scheduling of dynamically created tasks for shared memory multiprocessors has been a wellstudied problem. Previous work promoted the strategy of randomized work stealing. Here, a processor that has no work (processor that has no work) randomly steals work from another processor (victim) in the system. Other work demonstrated efficient bounds on space (O(P·S_{1})) and time (O(T_{1}/P+T_{∞})) for scheduling of fullystrict computations in an SMP platform; where P is the number of processors, T_{1 }and S_{1 }are the time and space for sequential execution respectively, and T_{∞} is execution time on infinite processors. Subsequently, the importance of data locality for scheduling threads motivated work stealing with data locality wherein the data locality was discovered on the fly and maintained as the computation progressed. This work also explored initial placement for scheduling and provided experimental results to show the usefulness of the approach; however, affinity was not always followed, the scope of the algorithm was limited to SMP environments and its time complexity was not analyzed. Other work did analyze time complexity (O(T_{1}/P+T_{∞})) for scheduling general parallel computations on SMP platform but did not consider space or message complexity bounds. Herein, embodiments of the invention consider distributed scheduling problems across muliple places (cluster of SMPs) while ensuring affinity and also providing time and message bounds.
Other prior work considers workstealing algorithms in a distributedmemory environment, with adaptive parallelism and faulttolerance. Here task migration was entirely pullbased (via a randomized work stealing algorithm); hence, it ignored affinity and also didn't provide any formal proof for the deadlockfreedom or resource utilization properties. Prior work also described a multiplace (distributed) deployment for parallel computations for which initial placement based scheduling strategy is appropriate. A multiplace deployment has multiple places connected by an interconnection network where each place has multiple processors connected as in an SMP platform. This work showed that online greedy scheduling of multithreaded computations may lead to physical deadlock in presence of bounded space and communication resources per place. Bounded resources (space or communication) can lead to cyclic dependency amongst the places that can lead to physical deadlock. Prior work also provided a scheduling strategy based on initial placement and proved space bounds for physical deadlock free execution of terminally strict computations by resorting to a degenerate mode called Doppelganger mode. The computation did not respect affinity in this mode and no time or communication bounds were provided. Also, the aspect of load balancing was not addressed. Embodiments of the invention ensure affinity while guaranteeing deadlock free distribution scheduling in a multiplace setup. Scheduling of hybrid parallel computations where some activities in the computation have place affinity while other activities in the computation can be executed on any place has been considered. This work has a specific focus on prioritized random work stealing across places and it leverages the detailed results on deadlock freedom for the bounded space algorithm and the time and message complexity for unconstrained space algorithm presented in this paper. Tabular comparison of features between the processes according to embodiments of the invention and previous work are presented herein.
System and Computation Model
According to embodiments of the invention, the system on which the computation DAG is scheduled is assumed to be a cluster of SMPs connected by an Active Message Network. Each SMP is a group of processors with shared memory. Each SMP is also referred to as place herein. Active Messages (AM) is a lowlevel lightweight RPC (remote procedure call) mechanism that supports unordered, reliable delivery of matched request/reply messages. It is assumed that there are n places and each place has m processors (also referred to as workers herein).
The parallel computation, to be dynamically scheduled on the system, is assumed to be specified by the programmer in languages such as X10 and Chapel. To describe the distributed scheduling processes consistent with exemplary embodiments of the invention, it is assumed that the parallel computation has a DAG (directed acyclic graph) structure and consists of nodes that represent basic operations like and, or, not, add and others. There are edges between the nodes in the computation DAG (FIG. 2(A) ) that represent creation of new activities (spawn edge), sequential execution flow between nodes within a thread/activity (continue edge) and synchronization dependencies (dependence edge) between the nodes. Herein the parallel computation to be scheduled is referred to as the computation DAG. At a higher level the parallel computation can also be viewed as a computation tree of activities. Each activity is a thread (as in multithreaded programs) of execution and consists of a set of nodes (basic operations). Each activity is assigned to a specific place (affinity as specified by the programmer). Hence, such a computation is called multiplace computation and DAG is referred to as placeannotated computation DAG (FIG. 2(A) : v1 . . . v20 denote nodes, T1 . . . T6 denote activities and P1 . . . P3 denote places). The types of parallel computations based on the nature of dependencies in the computation DAG and the notations used are described in FIG. 1(AB) .
Distributed Scheduling in Unconstrained Space
Herein is presented a description of affinity driven distributed scheduling in unconstrained space consistent with embodiments of the invention. Consider a strict placeannotated computation DAG. The distributed scheduling process described below schedules activities with affinity at only their respective places. Within a place, work stealing is enabled to allow loadbalanced execution of the computation subgraph associated with that place. The computation DAG unfolds in an online fashion in a breadthfirst manner across places when the affinity driven activities are pushed onto their respective remote places. Within a place, the online unfolding of the computation DAG happens in a depthfirst manner to enable efficient space and time execution. Since sufficient space is guaranteed to exist at each place, physical deadlocks due to lack of space cannot happen in this process.
Each place maintains Fresh Activity Buffer (FAB) which is managed by a dedicated processor (different from workers) at that place. Each worker at a place has a Ready Deque and Stall Buffer (refer FIG. 2(B) ). The FAB at each place as well as the Ready Deque at each worker use concurrent deque implementation. An activity that has affinity for a remote place is pushed into the FAB at that place. An idle worker at a place will attempt to randomly steal work from other workers at the same place (randomized work stealing). Note that an activity which is pushed onto a place can move between workers at that place (due to work stealing) but can not move to another place and thus obeys affinity at all times. An exemplary distributed scheduling process is illustrated in FIG. 3 .
Distributed Scheduling in Bounded Space
Due to limited space on real systems, the distributed scheduling process has to limit online breadth first expansion of the computation DAG while minimizing the impact on execution time and simultaneously providing deadlock freedom guarantee. This process uses a distributed deadlock avoidance scheme. Due to space constraints at each place in the system, the activities can be stalled due to lack of space. The process keeps track of stack space available on the system and that required by activities for execution (heap space is not considered for simplicity). The space required by an activity u is bounded by the maximum stack space needed for its execution, that is ((D_{max}−D_{u})·S_{max}), where D_{max }is the maximum activity depth in the computation tree and D_{u }is the depth of u in the computation tree. The process follows depth based ordering of computations for execution by allowing the activities with higher depth on a path to execute to completion before the activities with lower depth on the same path. This happens in a distributed fashion. Both during workpushing and intraplace work stealing, each place and worker checks for availability of stack space for execution of the activity. Due to depth based ordering, only bounded number of paths in the computation tree are expanded at any point of time. This bound is based on the available space in the system. Using this distributed deadlock avoidance scheme, the system always has space to guarantee the execution of a certain number of paths, that can vary during the execution of the computation DAG.
To provide good time and message bounds the distributed deadlock avoidance scheme is designed to have low communication cost while simultaneously exposing maximal concurrency inherent in the placeannotated computation DAG. This scheme ensures deadlock free execution for terminally strict multiplace computations. When an activity is stalled due to lack of space at a worker, it moves into localstalled state. When an activity is stalled as it cannot be spawned onto a remote place, it moves into remotestalled state. An activity that is stalled due to synchronization dependencies, it moves into dependstalled state.
It is assumed that maximum depth of the computation tree (in terms of number of activities), D_{max}, can be estimated fairly accurately prior to the execution from the parameters used in the input parallel computation. D_{max }value is used in a distributed scheduling process to ensure physical deadlock free execution. The assumption on knowledge of D_{max }prior to execution holds true for the kernels and large applications of the Java® Grande Benchmark suite. The D_{max }for kernels including LUFact (LU factorization), Sparse (Sparse Matrix multiplication), SOR (successive over relaxation for solving finite difference equations) can be exactly found from the dimension of input matrix and/or number of iterations. For kernels such as Crypt (International Data Encryption Algorithm) and Series (Fourier coefficient analysis) the D_{max }again is well defined from the input array size. The same holds for applications such as Molecular Dynamics, Monte Carlo Simulation and 3D Ray Tracer. Also, for graph kernels in the SSCA#2 benchmark, D_{max }can be known by estimating Δ_{g }(diameter) of the input graph (for example, O(polylog(n)) for RMAT graphs, O(√{square root over (n)}) for DIMACS graphs).
Distributed DataStructures & Process Design
The distributed data structures for a bounded space process according to embodiments of the invention are given in FIG. 4 . FIG. 5 (AB) illustrates distributed data structures for bounded space scheduling and Remote Spawn and Empty Deque cases according to an embodiment of the invention.
Let AMRejectMap(i,r), PrQ(i,r) and StallBuffer(i,r) denote the AMRejectMap, PrQ and StallBuffer, respectively for worker W_{i} ^{r }at place P_{i}. Let B_{i} ^{r }denote the combined space for the PrQ(i,r) and StallBuffer(i,r). Let FAB(i) and WorkRejectMap(i) denote the FAB and WorkRejectMap, respectively at place P_{i}. Let F_{i }denote the current space available in FAB(i). Let AM(T) denote the active message for spawning the activity T. The activities in remotestalled state are tracked using a linked list using activity IDs with the head and tail of the list available at the tuple corresponding to the place in map AMRejectMap.
Computation starts with root (depth 1) of the computation DAG at a worker W_{0} ^{s}, at the default place P_{0}. At any point of time a worker at a place, W_{i} ^{r}, can either be executing an activity, T, or be idle. The detailed process is presented in FIG. 6 . Some cases of the process are described here. When T needs to attempt a remote spawn (Remote Spawn case, refer FIG. 5(B) ) at place P_{j}, it first checks if there are already stalled activities in AMRejectMap(i,r). If there is already a stalled activity, then T is added to the StallBuffer(i,r) and the link from the current tail in the tuple corresponding to P_{j}, in AMRejectMap(i,r) is set to T. Also, the tail of the tuple is set to T.
If there is no stalled activity in AMRejectMap(i,r) for place P_{j}, then the worker attempts a remote spawn at place P_{j}. At P_{j}, check is performed by the dedicated processor for space availability in the FAB(j). If it has enough space then the active message, AM(U), is stored in the remote FAB(j), the available space in FAB(j) is updated and T continues execution. If there is not enough space then AMRejectMap(i,r) is updated accordingly and T is put in the StallBuffer(i,r).
When the worker W_{i} ^{r }receives notification (Receives Notification case) of available space from place P_{j}, then it gets the tuple for P_{j }from AMRejectMap(i,r) and sends the active message and the head activity to P_{j}. At P_{j}, the WorkRejectMap(j) is updated. Also, W_{i} ^{r }updates the tuple for P_{j }by updating the links for the linked list in that tuple. The remotestalled activity is enabled and put in PrQ(i,r) (Activity Enabled case).
Space Bound and Deadlock Freedom Proof
Herein are stated the lemmas and sketch of the proof of the theorems (refer to Appendix C.2 for details). Since herein is considered stack space for execution in the space constraint, the depth of activity in the computation tree is used in the lemmas/proofs below. An activity at depth d requires less than ((D_{max}−d)*S_{max}) amount of stack space for execution since it can generate a maximum of (D_{max}−d) stalled activities along one execution path and each stack frame is bounded by S_{max }bytes. During the process, this stack space (((D_{max}−d)*S_{max})) is checked before picking the activity for execution (Empty Deque Case) or placing a remote active message in the FAB (Remote Spawn case). S_{max }space is reserved in the FAB when that active message is accepted and S_{max }space is released from the FAB when that active message is picked up by an idle worker for execution. S_{max }space is taken away from B_{i} ^{r }when an activity gets stalled (Activity Stalled case), while S_{max }is added to B_{i} ^{r }when that activity is picked up for execution (Empty Deque case).
Lemma 1 A place or a worker that accepts activity with depth d′ has space to execute activities of depth greater than or equal to d′+1.
Lemma 2 There is always space to execute activities at depth D_{max}.
Lemma 3 At any point of time (before termination of complete computation tree execution) at least one path in the computation tree is guaranteed to execute.
Proof Herein is used the depth based ordering property (valid during scheduling). Let the max depth activity that a place P_{1 }is executing be d_{1}. Then the place is guaranteed to execute/accept an activity of d_{2 }depth such that d_{2}>d_{1 }by Lemma 1. Therefore, this activity of depth d_{1 }if it wants to create a child locally (Local Spawn case) can do so without any trouble (lemma holds true). Else, suppose that it wants to create child at remote place P_{2 }and that place rejects (Remote Spawn and Activity Stalled case). Now, there are two cases. In the first case, P_{2 }has an active executing path, possibly not having reached depth d_{1}, but that is not stalled (lemma holds true). In the second case, P_{2 }is either executing an activity (at a worker at that place) of depth at least d_{1}+1 (lemma holds true) or has such an activity in stalled state. If this stalled state is depthstalled state then an activity of depth even higher depth is executing at this or another place (lemma holds true). If this stalled state is localstalled state, then there must be another activity of higher depth executing at that worker (lemma holds true). However, if the stalled state is remotestalled state then the same argument is applied to the remote place on which this activity is waiting and it can bee seen a monotonically increasing depth of activities in this resource dependency chain. Following this chain eventually will be hit an executing path due to cases discussed here or a leaf is reached in the computation tree which can execute without dependencies (lemma 2). Hence, it can be seen that there exists a path across places that belongs to the computation tree such that it is actively executing. Hence, at each instant of time there exists a path that is guaranteed to execute in the system. In fact, there can be multiple paths that are executing at any instant of time and this depends on the available space in the system and the computation tree.
Theorem 1 (Assured Leaf Execution) The scheduling maintains assured leaf execution property during computation. Assured leaf execution ensures that each node in computation tree becomes a leaf and gets executed.
Proof: Proof is given herein by induction on depth of an activity in the computation tree.
Base case (depth of an activity is D_{max}):
By lemma 3, a path to a leaf is guaranteed. An activity at depth D_{max }is always a leaf and has no dependencies on other activities. Thus, an activity that occurs at D_{max }will always get executed (by lemma 2).
Induction Hypothesis: Assume that all activities at depth d and higher are assured to become leaves and get executed.
Induction Step: It needs to be shown that all activities of depth d−1 are assured to become leaves and get executed. By induction hypothesis, the activities of depth d and higher have terminated. As in the Termination case, if there are remaining activities in the Deque then (they are at depth (d−1)) they become leaves and are picked up for execution. Otherwise, if the Deque becomes empty (Empty Deque case), the highest depth activities are picked for execution both from the PrQ and the FAB. Therefore, the activities at depth (d−1) start execution. Further, the dependencies in the computation tree are from descendants to ancestors (terminallystrict computation). Therefore, when activities of the depth d or higher finish execution, the activities at depth (d−1), in dependstalled or remotestalled state, definitely become leaves and get enabled. Hence, they are put into the PrQ at the respective workers (Activity Enabled case). If the activity, at depth (d−1), was in remotestalled state, the blocked active message is sent to the remote place (Receives Notification case) for the spawn of child activity at depth d. By induction hypothesis, all activities at depth d have terminated so this has already happened earlier. Upon termination of d depth activity, assume the Deque is not empty and there are activities in PrQ of depth (d−1). These activities wait until the current executing path in the Deque terminates. Then, these activities which have become leaves get picked up for execution (since they have the highest depth and have the highest priority in the PrQ). Hence, all activities at depth (d−1) are also guaranteed to become leaves and get executed.
Theorem 2. A terminally strict computation scheduled using process in FIG. 6 takes O(m·(D_{max}·S_{max}+n·S_{max}+S_{1})) bytes as space per place.
Proof Sketch: The PrQ, StallBuffer, AMRejectMap and deque per worker (processor) take total of O(m·(D_{max}·S_{max}+n·S_{max}+S_{1})) bytes per place. The WorkRejectMap and FAB take total O(m·n+D_{max}) and O(D_{max}·S_{max}) space per place (discussed previously herein). The scheduling strategy adopts a space conservation policy to ensure deadlock free execution in bounded space. The basic aim of this strategy is to ensure that only as much breadth of a tree is explored as can be accommodated in the available space assuming each path can go to the maximum depth of D_{max}.
It starts with the initial condition where available space is at least D_{max}·S_{max }per worker per place. It is ensured that any activity that gets scheduled on a worker does not exceed the available space in the PrQ and StallBuffer at that worker. This will hold because only the activities in the Deque can be stalled and check was made that enough space for maximum number of stalled activities is available before execution. For more details, refer to Appendix C.2.
Time and Message Complexity Analysis
Herein is presented an overview of the time and message complexity analysis for both the unconstrained and bounded space distributed scheduling processes. Refer to Appendix (A and B) for details. The analysis is based on the number of throws by workers during execution. Each throw represents an attempt by a worker (processor that has no work) to steal an activity from either another worker (victim) or FAB at the same place.
Lemma 2.1. Consider a strict placeannotated computation DAG with work per place, T_{1} ^{k}, being executed by the unconstrained space scheduling process (FIG. 3 ). Then, the execution (finish) time for place, k, is O(T_{1} ^{k}/m+Q_{r} ^{k}/m+Q_{e} ^{k}/m), where Q_{r} ^{k }denotes the number of throws when there is at least one ready node at place k and Q_{e} ^{k }denotes the number of throws when there are no ready nodes at place k The lower bound on the execution time of the full computation is O(max_{k}(T_{1} ^{k}/m+Q_{r} ^{k}/m)) and the upper bound is O(Σ_{k}(T_{1} ^{k}/m+Q_{r} ^{k}/m).
Proof Sketch: (Token based counting argument) Consider three buckets at each place in which tokens are placed: work bucket where a token is placed when a worker at the place executes a node of the computation DAG; readynodethrow bucket where a token is placed when a worker attempts to steal and there is at least one ready node at the place; nullnodethrow bucket where a token is placed when a worker attempts to steal and there are no ready nodes at the place (models wait time when there is no work at a place). The total finish time of a place can be computed by counting the tokens in these three buckets and by considering load balanced execution within a place using randomized work stealing. The upper and lower bounds on the execution time arise from the structure of the computation DAG and the structure of the online schedule generated (Appendix A).
Next, the bound on the number of tokens in the readynodethrow bucket is computed using potential function based analysis. A unique contribution is in proving the lower and upper bounds of time complexity and message complexity for multiplace distributed scheduling algorithm presented in FIG. 3 that involves both intraplace work stealing and remote place affinity driven work pushing. For potential function based analysis, each ready node u is assigned a potential 3^{2w(u)−1 }or 3^{2w(u) }depending upon whether it is assigned for execution or not (w(u)=T_{∞,n}−depth(u)). All nonready nodes have 0 potential. The total potential of the system at step i is denoted by φ_{i }and φ_{i}(D_{i}) denotes potential of all Deques that have some ready nodes. The key idea is to show that the potential φ_{i }monotonically decreases from φ_{i}(0)=3^{2T} ^{ ∞,n } ^{−1 }(potential of the root node) to 0 (no ready node left) during the execution and this happens in a bounded number of steps.
Theorem 2.1 Consider a strict placeannotated computation w DAG with work per place k, denoted by T_{1} ^{k}, being executed by the affinity driven multi place distributed scheduling process, FIG. 3 . The lower bound on the expected execution time is O(max_{k}(T_{1} ^{k}/m)+T_{∞} ^{k})) Moreover, for any ε>0, the lower bound on the execution time is O(max_{k}T_{1} ^{k}/m+T_{∞,n}+log(1/ε)) with probability at least 1−ε. Similar probabilistic upper bound exists.
Proof Sketch: For the lower bound, the number of throws (when there is at least one ready node at a place) is analyzed by breaking the execution into phases. Each phase has θ(P=mn) throws (O(m) throws per place). It can be shown that with constant probability, a phase causes the potential drop by a constant factor. More precisely, between phases i and i+1, Pr{(φ_{i}−φ_{i+1}}>¼ (details in Appendix B). Since the potential starts at φ_{0}=3^{2T} ^{ ∞,n } ^{−1 }and ends at zero and takes integral values, the number of successful phases is at most (2T_{∞,n}−1) log_{4/3 }3<8_{∞,n}. Thus, the expected number of throws per place gets bounded by O(T_{∞,n}·m) and the number of throws is O(T_{∞,n}·m)+log(1/ε)) with probability at least 1−ε (using Chernoff Inequality). Using lemma 2.1 the lower bound on the expected execution time is O(max_{k}(T_{1} ^{k}/m)+T_{∞,n}). The detailed proof and probabilistic bounds are presented in Appendix B.
For the upper bound, consider the execution of the subgraph of the computation at each place. The number of throws in the readynodethrow bucket per place can be similarly bounded by O(T_{∞} ^{k}·m). Further, the place that finishes the execution in the end, can end up with number of tokens in the nullnodethrow bucket equal to the tokens in work and readynodethrow buckets of all other places.
Hence, the finish time for this place, which is also the execution time of the full computation DAG is O(Σ_{k}(T_{1} ^{k}/m+T_{∞} ^{k})). The probabilistic upper bound can be similarly established using Chernoff Inequality.
Theorem 2.2. Consider the execution of a strict placeannotated computation DAG with critical pathlength T_{∞,n }by the Affinity Driven Distributed Scheduling Process, (FIG. 3 ). Then, the total number of bytes communicated across places is O(I(S_{max}+n_{d})) and the lower bound on the total number of bytes communicated within a place has the expectation O(m·T_{∞,n}S_{max}·n_{d}) where n_{d }is the maximum number of dependence edges from the descendants to a parent, I is the number of remote spawns from one place to a remote place. Moreover, for any ε>0, the probability is at least 1−ε that the lower bound on the communication overhead per place is O(m.n.(T_{∞}+log(1/ε)).n_{d}.S_{max}). Similarly message upper bounds exist.
The communication complexity for interplace and intraplace communication can be derived by considering remote spawns during execution and throws for work stealing within places respectively. Detailed proof is given in Appendix C.
The bounded space scheduling process does constant work for handling rejected spawns but incurs additional log(D_{max}) factor for FAB (concurrent priority queue) operations. Hence, the lower bound on the expected time complexity of the bounded space scheduling process is O(max_{k}(T_{1} ^{k}/m)·log(D_{max})+T_{∞,n}). The analysis of the upper bound on time complexity involves modeling resource driven wait time and is not addressed herein. The interplace message complexity is the same as theorem 2.2 as there is a constant amount of work for handling rejected remote spawns and notification of space availability.
To contrast the various exemplary embodiments of the invention that have been described herein with prior work, the following brief discussion is presented. Prior work extended a work stealing framework for terminally strict X10 computations and establishes deadlock free scheduling for SMP deployments. This work proved deadlock free execution with bounded resources on uniprocessor cluster deployments while using Doppelganger mode of execution. However, this work neither considers work stealing in this framework, nor does it provide performance bounds. The Doppelganger mode of execution can lead to arbitrarily high costs in general. In contrast, embodiments of the invention consider affinity driven scheduling over an SMP cluster deployment using Active Message network. Further, embodiments of the invention include intraplace and interplace work stealing and prove space and performance bounds with deadlock free guarantee.
Other prior work considered nestedparallel computations on multiprocessor HSMSs (hardwarecontrolled shared memory systems) and proved upper bounds on the number of cachemisses and execution time. It also presents a locality guided work stealing algorithm that leads to costly synchronization for each thread/activity. However, activities may not get executed at the processor for which they have affinity. In contrast, embodiments of the invention consider affinity driven scheduling in a multiplace setup and provide performance bounds under bounded space while guaranteeing deadlock free execution.
Still other prior work provided performance bounds of a nonblocking work stealing algorithm in a multiprogrammed SMP environment, for general multithreaded computations under various kernel schedules using potential function technique. This approach however does not consider locality guided scheduling. In contrast, embodiments of the invention consider affinity driven multiplace work stealing processes for applications running in dedicated mode (stand alone), with deadlock freedom guarantees under bounded resources and leverage the potential function technique for performance analysis.
Still further prior work introduced a workdealing technique that attempts to achieve “locality oriented” load distribution on smallscale SMPs. It has a low overhead mechanism for dealing out work to processors in a global balanced way without costly compareandswap operations. Various embodiments of the invention assume that the programmer has provided place annotations in the program in a manner that leads to optimal performance considering loadbalancing. According to embodiments of the invention, the activities with affinity for a place are guaranteed to execute on that place while guaranteeing deadlock freedom.
Still further work presented a spaceefficient scheduling algorithm for shared memory machines that combines the low scheduling overheads and good locality of work stealing with the low space requirements of depthfirst schedulers. For locality this work uses the heuristic of scheduling threads that are close in the computation DAG onto the same processor. Embodiments of the invention consider a multiplace setup and assume affinities in the placeannotate computation have been specified by the programmer.
Still further work studied twolevel adaptive multiprocessor scheduling in a multiprogrammed environment. This work presented a randomized workstealing thread scheduler for forkjoin multithreaded jobs that provides continual parallelism feedback to the job scheduler in the form of requests for processors and uses trim analysis to obtain performance bounds. However, this work did not consider locality guided scheduling. Various embodiments of the invention assume a dedicated mode of execution but can be extended to multiprogrammed modes also.

 Column, Scheduling Algorithm, has values: WS (Work Stealing), WD (Work Dealing), DFS (Depth First Search) and WP (Work Pushing).
 Column, Affinity Driven, has values: Y (Yes), N (No) and L (limited extent).
 Column, Nature Of Computation, has values: FS (fullystrict), G (general), NP (nested parallel), IDP (iterative data parallel) and TS (terminally strict).
 Column, MP vs SP, denotes multiplace (MP) or single place (SP) algorithm setup.
 Column, DM vs MPM, denotes dedicated mode (DM) or multiprogrammed mode (MPM) environment.
 Column, Sync. Overheads, has values L (low), M (medium) and H (high).
 Column, DG mode, denotes whether Doppelganger mode is used in multiplace setup.
 Column, IAP vs. Both, denotes whether intraplace stealing (IAP) is only supported or both(Both) interplace and intraplace stealing are supported.
 The last Column denotes whether deadlock freedom, space bound and time bound are presented in the respective scheduling approaches.
Anyplace Activity
The runtime system needs to provide online distributed scheduling of large hybrid parallel computations on manycore and massively parallel architectures. Activities (threads) that have prespecified placement are referred to herein as affinity annotated activities. Further, there are activities (threads) in the parallel program that can be run on any place. Such activities are referred to as anyplace activities. Parallel computations that have both affinity annotated activities and anyplace activities are referred to as hybrid parallel computations.
Herein, anyplace activities are allowed in the input hybrid computation DAG. This generalization allows more parallel applications to be expressed easily by the programmer. Also, herein is presented are novel distributed scheduling processes that incorporate interplace prioritized random work stealing to provide automatic dynamic load balancing across places. It is proved that with suitable choice of probability distribution, the prioritized random work stealing across places is efficient. Further, it leads to low average communication cost when the distances between the places are different (e.g. 3D torus interconnect). An embodiment of the invention leverages the distributed deadlock avoidance strategy for deadlock free execution and time and message complexity proofs in prior work for efficient scheduling of hybrid parallel computations. Some key aspects of various embodiments of the invention include the following.
First, an online multiplace distributed scheduling algorithm for strict multiplace hybrid parallel computations assuming unconstrained (sufficient) space per place is given. This process incorporates (a) intraplace work stealing, (b) remote place work pushing for affinity annotated activities and (c) prioritized random work stealing across places for anyplace activities. It is shown herein that prioritized random stealing across places is efficient. Also presented herein are the time and message complexity bounds of the scheduling algorithm.
Second, for bounded space per place, a novel distributed scheduling process for terminally strict multiplace hybrid computations with provable physical deadlock free execution is presented.
Process Design: Each place maintains one Fresh Activity Buffer (FAB) which is managed by the interface processor at that place. An activity that has affinity for a remote place is pushed into the FAB at that place. Each worker at a place has: (a) an APR Deque that contains anyplace ready activities, (b) an AFR Deque that contains affinity annotated ready activities and (c) Stall Buffer that contains stalled activities (refer FIG. 7(B) ). The FAB at each place as well as the AFR Deque and APR Deque at each worker are implemented using concurrent deque datastructure. Each place also maintains a Worker List Buffer (WLB) that is a list of workers that have anyplace activities ready to be stolen. WLB is implemented as a concurrent linked list and is maintained by the interface processor. WLB aids in remote stealing where the remote workers which attempt to steal activities from this place get information about available workers for stealing from WLB. The distributed scheduling algorithm is given in FIG. 8 .
Time Complexity Analysis: The detailed time complexity analysis using potential function on ready nodes in the system follows as in prior works. Herein a brief intuitive explanation of time and message complexity is given. Contributions unique to embodiments of the invention are (a) proof that prioritized random interplace work stealing is efficient using suitable probability density function, (b) proof of the lower and upper bounds of time complexity and message complexity for the multiplace distributed scheduling algorithm presented herein that includes (1) intraplace work stealing, (2) remoteplace work stealing and (3) remote place affinity driven work pushing.
Below, throw represents an attempt by a worker (processor that has no work) to steal an activity. It can be an intra place throw when the activity is stolen from another local worker (victim), or remote place throw when it is stolen from a remote place. For potential function based analysis, each ready node u is assigned a potential 3^{2w(u)−1 }or 3^{2w(u) }depending on whether it is assigned for execution or not (w(u)=T_{∞,n}−depth(u)). The total potential of the system at step i is denoted by φ^{i }and φ_{i}(D_{i}) denotes potential of all APR Deques and AFR Deques that have some ready nodes.
Prioritized Random InterPlace Work Stealing. Herein it is proven that distanceprioritized interplace work stealing works efficiently with suitable choice of probability distribution across places. Consider a 2D torus interconnect across places. Let the place where a processor attempts to steal be denoted by the start place. The places around the start place can be viewed as rings. The rings increase in size as one moves to rings at increasing distance from the start place, i.e. there are more places in a ring farther away from the start place than the ring closer to the start place. (refer FIG. 9 ). In a remote steal attempt from the start place, the places on the same ring are chosen with equal probability.
This probability decreases with increasing ring distance from the start place but the total probability of choosing a processor over all processors across all places should be equal to 1. In order to model this scenario, consider a generalized Balls and Weighted Bins game where P balls are thrown independently but nonuniformly at random into P bins. An upper bound is derived on the probability of the unsuccessful steal attempts using Markov's inequality.
Lemma 3.1. Prioritized Balls and Weighted Bins Game: Let there be n places arranged in a 2D torus topology. Suppose that at least P balls are thrown independently but nonuniformly at random into P bins, where i=1, . . . P, bin i has weight W_{i}. The total weight W=Σ_{1≦i≦P}W_{i}. For each bin i, define a random variable X(i) as,
X(i)=W _{i}, if some ball lands in bin i
X(i)=0, otherwise
X(i)=W _{i}, if some ball lands in bin i
X(i)=0, otherwise
Let l_{max }be the distance of the start place from the last ring. Define the probability distribution of choosing rings as follows. Let γ/l_{max }be the probability of choosing the last ring at distance l_{max }from the source of the steal request, where 0<γ<1. The probability of selecting other rings is chosen appropriately so that the sum of choosing processor across all processors equals 1. (For example, let γ=¾. Here, there is assigned a probability of 5/4/l_{max }to each of the first l_{max}/2 rings and probability of 3/4l_{max }to each of the last l_{max}/2 rings.)
If X=Σ_{1≦i≦P}X(i), then for β in the range of 0<β<1, thus:
Pr X≧β.W>1−1/((1−β)e ^{γ/2}.
Pr X≧β.W>1−1/((1−β)e ^{γ/2}.
Proof A ring at distance l from the start place has 8^{l }places. Since each place has m processors, the ring at distance l has 8^{l }m processors and each of the processors have equal probability that a ball will and in that processor (bin).
Now, for each bin i, consider the random variable, W(i)−X(i). It takes on a value W(i) when no ball lands on bin (i) otherwise is taken value 0. Thus:
E[W(i)−X(i)]=W(i) * probability that no ball lands in bin(i)
≦W(i)*[1−Min.prob. that any ball lands in bin(i)]^{P }
≦W(i)*[1−γ/l _{max}·8l _{max} m0]^{mn }
≦W(i)/e ^{(l} ^{ mn } ^{+1)}·γ/(2.l _{max})
∴n=4l _{max}(l _{max}+1); (1−1/x)^{x}≦1/e
≦W(i)/e ^{(γ/2)}, for lrg l_{max }
E[W(i)−X(i)]=W(i) * probability that no ball lands in bin(i)
≦W(i)*[1−Min.prob. that any ball lands in bin(i)]^{P }
≦W(i)*[1−γ/l _{max}·8l _{max} m0]^{mn }
≦W(i)/e ^{(l} ^{ mn } ^{+1)}·γ/(2.l _{max})
∴n=4l _{max}(l _{max}+1); (1−1/x)^{x}≦1/e
≦W(i)/e ^{(γ/2)}, for lrg l_{max }
It follows that: E[W−X]≦We^{r/2 }
From Markov's inequality thus:
It can be seen that due to skewed probability of balls choosing which bin to go, the probability of successful attempts goes down compared to the case of uniform probability. Even though a ring distance was chosen based probability variation, actual processor distance based probability variation can be similarly analyzed with suitable probability distribution. By choosing β=⅕,γ=¾ one can show that after O(mn) remote place throws across the system, the potential of anyplace ready activities in φ_{i}(D_{i}) decreases by 1/16. The time and message complexity lower and upper bounds are given by theorems below. Detailed proofs follow by extending the analysis in prior work.
Theorem 3.1. Consider a strict multiplace hybrid computation DAG with work for place P_{k}, denoted by T_{1} ^{k}, being executed by the distributed scheduling process (discussed above). Let the criticalpath length for the computation be T_{∞,n}. The lower bound on the expected execution time is O(max_{k}T_{1} ^{k}/m+T_{∞,n}) and the upper bound is O(Σ_{k}(T_{1} ^{k}/m+T_{∞} ^{k})). Moreover, for any ε>0, the lower bound for the execution time is O(max_{k}T_{1} ^{k}/m+T_{∞,n}+log(1/ε)) with probability at least 1−ε. Similar probabilistic upper bound exists.
Theorem 3.2. Consider the execution of a strict hybrid multiplace computation DAG with critical pathlength T_{∞,n }by the Distributed Scheduling Algorithm (discussed herein). Then, the total number of bytes communicated across places has the expectation O(I·S_{max}·n_{d})+m·T_{∞,n}·S_{max}·n_{d}). Further, the lower bound on number of bytes communicated within a place has the expectation O(m·T_{∞,n}·S_{max}·n_{d}), where n_{d }is the maximum number of dependence edges from the descendants to a parent and I is the number of remote spawns from one place to a remote place. Moreover, for any ε>0, the probability is at least (1−ε) that the lower bound on the intraplace communication overhead per place is O(m·(T_{∞,n}+log(1/ε))·n_{d}·S_{max}). Similarly message upper bounds exist.
Distributed Scheduling of Hybrid Computation in Bounded Space: Due to limited space on real systems, the distributed scheduling algorithm has to limit online breadth first expansion of the computation DAG while minimizing the impact on execution time and simultaneously providing deadlock freedom guarantee. Due to bounded space constraints this distributed online scheduling algorithm has guaranteed deadlock free execution for terminally strict multiplace hybrid computations. Due to space constraints at each place in the system, the algorithm needs to keep track of space availability at each worker and place to ensure physical deadlock freedom. It does so by ensuring that remote activity pushing, interplace stealing and intraplace stealing happen only when there is sufficient space to execute the remaining path to the leaf in the current path. This tracking of available space and using depth based ordering of activities for execution from FAB help in ensuring distributed deadlock avoidance. An activity can be in one of the following stalled states: (a) localstalled due to lack of space at a worker, (b) remotestalled due to failed spawn onto a remote place, (c) dependstalled due to synchronization dependencies.
Herein it is assumed that maximum depth of the computation tree (in term's of number of activities), D_{max}, can be estimated fairly accurately prior to the execution from the parameters used in the input parallel computation. D_{max }value is used in the distributed scheduling algorithm to ensure physical deadlock free execution. The assumption on knowledge of D_{max }prior to execution holds true for the kernels and large applications of the Java Grande Benchmark suite.
Distributed DataStructures & Process Design: The data structures used for bounded space scheduling algorithm are described in FIG. 10 . FIG. 11 illustrates distributed data structures for bounded space scheduling according to an embodiment of the invention.
Let AM(T) denote the active message for spawning the activity T. The activities in remotestalled state are tracked using a linked list using activity IDs with the head and tail of the list available at the tuple corresponding to the place in the map AMRejectMap. For notation purpose, the suffix (i) and (i, r) denote that datastructure is located at place P_{i }and worker W_{i} ^{r }respectively.
Computation starts with root of the computation DAG which is at depth 1. The computation starts at a worker W_{0} ^{s}, at the default place P_{0}. At any point of time a worker at a place, W_{i} ^{r}, can either be executing an activity, T, or be idle. The detailed process is presented in FIG. 12 . The actions taken by the interface processor have been kept implicit in the description for sake of brevity.
Distributed deadlock freedom can be proved by induction as in affinity driven scheduling and has been left for brevity. The essence lies in showing that when an activity gets rejected then a higher depth activity must be executing at that place and then using induction one can show that all activities eventually become leaf and get executed starting from maximum depth activities and going backwards to lower depth activities as the space gets released by completed activities. The following theorem gives the space bound.
Theorem 3.3 A terminally strict computation scheduled using algorithm in FIG. 12 uses O(m·(D_{max}·S_{max}+n·S_{max}+S_{1})) bytes as space per place.
The interplace message complexity is same as theorem2.2 (assuming similar order of number of throws for interplace work stealing) as there is constant amount of work for handling rejected remote spawns and notification of space availability. For intraplace work stealing again the message complexity is same as theorem 2.2.
MultiProgrammed Mode
Embodiments of the invention provide a multiprogrammed mode using an adaptive work stealing framework. Here there are multiple jobs in the system (with multiple places). The framework is adaptive because the kernel scheduler changes the resources available to a job based on its utilization. If its utilization is high it might allocate more available resources and if its utilization is low then it might take away resources from that job. Given a set of resources from kernel scheduler (resources meaning processors/memory) the user scheduler runs the bounded space affinity driven distributed scheduling algorithm. Embodiments of the invention provide feedback to the kernel scheduler on the online demand for processors per place and memory per place. There can be minimum requirement of processors/cores and memory by each job. The kernel scheduler will guarantee that such resources are always available to that job. This is based on minimum performance requirements expected for that job. There are two schedulers here. One is a user level scheduler that gets the resources from the kernel scheduler. At regular intervals it informs the kernel scheduler whether the resources provided have been overutilized or under utilized. The other is the kernel level scheduler that provides resources to multiple jobs based on their resource utilization. Here the resources include both processors/cores and memory.
Embodiments of the invention may be implemented in one or more computing devices configured appropriately to execute program instructions consistent with the functionality of the embodiments of the invention as described herein. In this regard,
Referring now to FIG. 14 , there is depicted a block diagram of an illustrative embodiment of a computer system 100. The illustrative embodiment depicted in FIG. 14 may be an electronic device such as a desktop computer or workstation computer. As is apparent from the description, however, the embodiments of the invention may be implemented in any appropriately configured device, as described herein.
As shown in FIG. 14 , computer system 100 includes at least one system processor 42, which is coupled to a ReadOnly Memory (ROM) 40 and a system memory 46 by a processor bus 44. System processor 42, which may comprise one of the AMD line of processors produced by AMD Corporation or a processor produced by INTEL Corporation, is a generalpurpose processor that executes boot code 41 stored within ROM 40 at poweron and thereafter processes data under the control of an operating system and application software stored in system memory 46. System processor 42 is coupled via processor bus 44 and host bridge 48 to Peripheral Component Interconnect (PCI) local bus 50.
PCI local bus 50 supports the attachment of a number of devices, including adapters and bridges. Among these devices is network adapter 66, which interfaces computer system 100 to LAN, and graphics adapter 68, which interfaces computer system 100 to display 69. Communication on PCI local bus 50 is governed by local PCI controller 52, which is in turn coupled to nonvolatile random access memory (NVRAM) 56 via memory bus 54. Local PCI controller 52 can be coupled to additional buses and devices via a second host bridge 60.
Computer system 100 further includes Industry Standard Architecture (ISA) bus 62, which is coupled to PCI local bus 50 by ISA bridge 64. Coupled to ISA bus 62 is an input/output (I/O) controller 70, which controls communication between computer system 100 and attached peripheral devices such as a as a keyboard, mouse, serial and parallel ports, et cetera. A disk controller 72 connects a disk drive with PCI local bus 50. The USB Bus and USB Controller (not shown) are part of the Local PCI controller (52).
As will be appreciated by one skilled in the art, aspects of the invention may be embodied as a system, method or computer program product. Accordingly, aspects of the invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, aspects of the invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a nonexhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a readonly memory (ROM), an erasable programmable readonly memory (EPROM or Flash memory), an optical fiber, a portable compact disc readonly memory (CDROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electromagnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer (device), partly on the user's computer, as a standalone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
Aspects of the invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
This disclosure has been presented for purposes of illustration and description but is not intended to be exhaustive or limiting. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiments were chosen and described in order to explain principles and practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.
Although illustrative embodiments of the invention have been described herein with reference to the accompanying drawings, it is to be understood that the embodiments of the invention are not limited to those precise embodiments, and that various other changes and modifications may be affected therein by one skilled in the art without departing from the scope or spirit of the disclosure.
A Affinity Driven Scheduling with Unconstrained Space
Lemma A. 1 Consider a strict placeannotated computation DAG with work per place, T_{1} ^{k}, being executed by the affinity driven distributed scheduling algorithm presented in FIG. 3 . Then, the execution (finish) time for place,k, is O(T_{1} ^{k}/m+Q_{r} ^{k}/m+Q_{e} ^{k}/m), where Q_{r} ^{k}, denotes the number of throws when there is at least one ready node at place k and Q_{e} ^{k }denotes the number of throws when there are no ready nodes at place k. The lower bound on the execution time of the full computation is O(max_{k}(T_{1} ^{k}/m+Q_{r} ^{k})) and the upper bound is O(Σ_{k}(T_{1} ^{k}/m+T_{∞} ^{k})).
Proof At any place, k, we collect tokens in three buckets: work bucket, readynodethrow bucket and nullnodethrow bucket. In the work bucket the tokens get collected when the processors at the place it execute the ready nodes. Thus, the total number of tokens collected in the work bucket is T_{1} ^{k}. When, the place has some ready nodes and a processor at that place throws or attempts to steal ready nodes from the PrQ or another processor's deque then the tokens are added to the readnodethrow bucket. If there are no ready nodes at the place then the throws by processors at that place are accounted for by placing tokens in the nullnodethrow bucket. The tokens collected in these three buckets account for all work done by the processors at the place till the finish time for the computation at that place. Thus, the finish time at the place k, is O(T_{1} ^{k}/m+Q_{r} ^{k}/m+Q_{e} ^{k}/m). The finish time of the complete computation DAG is the maximum finish time over all places. So, the execution time for the computation is max_{k }O(T_{1} ^{k}/m+Q_{r} ^{k}/m+Q_{e} ^{k}/m). We consider two extreme scenarios for Q_{e} ^{k }that define the lower and upper bounds. For the lower bound, at any step of the execution, every place has some ready node, so there are no tokens placed in the nullnodethrow bucket at any place. Hence, the execution time per place is O(T_{1} ^{k}/m+Q_{r} ^{k}/m. The execution time for the full computation becomes O(max_{k}(T_{1} ^{k}/m+Q_{r} ^{k}/m)). For the upper bounds , there exists a place, say (w.l.o.g.) s, where the number of tokens in the nullnodethrow buckets, Q_{e} ^{s},is equal to the sum of the total number of tokens in the work buckets of all other places and the tokens in the readnodethrow bucket over all other places. Thus, the finish time for this place, T_{f} ^{s}, which is also the execution time for the computation is given by:
B Time Complexity Analysis: Scheduling Algorithm with Unconstrained Space
We compute the bound on the number of tokens in the readynodethrow bucket using potential function based analysis. Nirma S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Threadscheduling for multiprogrammed multiprocessors. In SPAA, pages 119129, Puerto Vallarta Mexico, 1998.
Let there be a nonnegative potential with ready nodes (each representing one instruction) in a computation dag. During the execution using the affinity driven distributed scheduling algorithm (FIG.3 ), the weight of a node u in the enabling tree, w(u) is defined as (T∞,n−depth(u)), where depth(u), is the depth of u in the enabling tree of the computation. For a ready node, u, we define φ_{i}(u), the potential of u at timestep i, as:
φ_{i}(u)=32w(u)−1, if u is assigned; (B.1a)
φ_{i}(u)=32w(u)−1, if u is assigned; (B.1a)
All nonready nodes have 0 potential. The potential at step i, φ_{i}, is the SUM of the potential of each ready node at step i. When an execution begins, the only ready node is the root node with potential, φ(0) =32T∞,n^{−1}. At the end the potential is 0 since there are no ready nodes. Let F_{i }denote the set of processes whose deque is empty at the beginning of step i, and let D_{i }denote the set of all other processes with nonempty deque. Let, F_{i }denote the set of all ready nodes present in the FABs of all places. The total potential can be partitioned into three parts as follows:
_{i}=φφ_{i}(E_{i})+φ_{i}(D_{i})+φ_{i}(F_{i}) (B.2)
where,
_{i}=φφ_{i}(E_{i})+φ_{i}(D_{i})+φ_{i}(F_{i}) (B.2)
where,
where, φ_{i} ^{k}( ) are respective potential components per place k. The potential at the place k, φ_{i} ^{k}, is equal to the sum of the three components, i.e.
φ_{i} ^{k}=φ_{i} ^{k}(E_{i})+φ_{i} ^{k}(D_{i})+φ_{i} ^{k}(F_{i}) (B.4)
φ_{i} ^{k}=φ_{i} ^{k}(E_{i})+φ_{i} ^{k}(D_{i})+φ_{i} ^{k}(F_{i}) (B.4)
Actions such as assignment of a node from deque to the processor for execution, stealing nodes from the top of victim's deque and execution of a node, lead to decrease of potential (refer Lemma B.2). The idle processors at a place do workstealing alternately between stealing from deque and stealing from the FAB. Thus, 2P throws in a round consist of P throws to other processor's deque and P throws to the FAB. We first analyze the case when randomized workstealing takes place from another processor's deque using balls and bins game to compute the expected and probabilistic bound on the number of throws. For uniform and random throws in the bails and bins game it can be shown that one is able to get a constant fraction of the reward with, constant probability (refer Lemma B.3). The lemma below shows that whenever P or more throws occur for getting nodes from the top of the deques of other processors at the same place, the potential decreases by a constant fraction of φ_{i}(D_{i}) with a constant probability. For algorithm in FIG. 3 , P=m (only intraplace work stealing).
Lemma B.1 Consider any round i and any later round s. such that at least P throws have taken place between round i (inclusive) and; round j (exclusive), then, Pr{(φ_{i}−φ_{j})≧¼.φ_{i}(D_{i})}>¼
There is an addtional component of potential decrease which is due to pushing of ready nodes onto remote FABs. Let the potential decrease due to this transfer be φ_{i→j} ^{k}(out). The new probabilistic bound becomes:
Pr{(φ_{i}−φ_{j})≧¼.φ_{i}(D _{i})=φ_{i→j}(out)}>¼ (B.5)
Pr{(φ_{i}−φ_{j})≧¼.φ_{i}(D _{i})=φ_{i→j}(out)}>¼ (B.5)
The throws that occur on the FAB at a place can he divided into two cases. In the first case, let the FAR have at least P=m activities at the beginning of round i. Since, all m throws will be successful, we consider the tokens collected from such throws as work tokens and assign them to the work bucket of the respective processors. In the second case, in the beginning of round i, the FAB has less than m activities. Therefore, some of the m throws might be unsuccessful. Hence, from the perspective of place k, the potential φ_{i} ^{k}(F_{i}) gets reduced to zero. The potential added at place k, in φ_{j} ^{k}(F_{j})is due to ready nodes pushed from the deque of other places. Let this component be φ_{i→j} ^{k}(in). The potential of the FAB at the beginning of round j is:
φ_{j} ^{k}(F_{j})→φ_{j} ^{k}(F_{i})=φ_{i→j} ^{k}(in). (B.6)
φ_{j} ^{k}(F_{j})→φ_{j} ^{k}(F_{i})=φ_{i→j} ^{k}(in). (B.6)
Furthermore, at each place the potential also drops by a constant factor of φ_{i} ^{k}(E_{i}). If a process q in the set E_{i} ^{k }does not have an assigned node, then φ_{i}(q)=0. If q has an assigned node u, then φ_{i}(q)=φ_{i}(u) and when node u gets executed in round i then the potential drops by at least 5/9.φ_{i}(u). Adding over each process q in E_{i} ^{k}, we get:
{φ_{i} ^{k}(E_{i})−φ_{j} ^{k}(E_{j})}≧ 5/9.φ_{i} ^{k}(E_{i}). (B.7)
{φ_{i} ^{k}(E_{i})−φ_{j} ^{k}(E_{j})}≧ 5/9.φ_{i} ^{k}(E_{i}). (B.7)
Lemma B.2 The potential function satisfies the following properties

 1. When node u is assigned to a process at step i, then the potential decreases by at least ⅔φ_{i}(u).
 2. When node u is executes at step i, then the potential decreases by at least 5/9φ_{i}(u) at step i.
 3. For any process, q in D_{i}, the topmost node u in the deque for q maintains the property that: φ_{i}(u)≧¾φ_{i}(q)
 4. If the topmost node u of a processor q is stolen by processor p at step i, then the potential at the end of step i decreases by at least ½φ_{i}(q) due to assignment or execution of u.
Lemma B.3 Balls and Bins Game: Suppose that at least P balls are thrower independently and uniformly at random into P bins, where for i=1,2. . . P, bin i has weight W_{i}. The total weight W=Σ_{1≦i≦P}W_{i}. For each bin i, define a random variable, X_{i }as,
X_{i}=W_{i}, if some ball lands in bin i (B.8a)
=0, otherwise (B.8b)
X_{i}=W_{i}, if some ball lands in bin i (B.8a)
=0, otherwise (B.8b)
If X=Σ_{1≦i≦P }X_{i}, then for any β in the range 0<β<1, we have Pr{X≧β.W}>1−1/((1−β)e)
Theorem B.4 Consider any placeannotated multithreaded computation with total work T_{1 }and work per place denoted by T_{1} ^{k}, being executed by the affinity driven multiplace distributed scheduling algorithm ??. Let the criticalpath length for the computation be T_{∞}. The lower bound on the expected execution time is O(max_{k}T_{1} ^{k}/m+T_{∞,n}) and the upper bound is O(Σ_{k}(T_{1} ^{k}/m+T_{∞} ^{k})). Moreover , for any e>0, the execution time is O(max_{k}T_{1} ^{k}/m+T_{∞,n}+log(1/e)) with probability at least 1−e.
Proof Lemma A.1 provides the lower bound on the execution time in terms of number of throws. We shall prove that the expected number of throws per place is O(T_{∞}·m), and that the number of throws per place is O(T_{∞}·m+log(1/e)) with probability at least 1−e.
We analyze the number of readynodethrows by breaking the execution into phases of θ(P=mn) throws (O(m) throws per place). We show that with constant probability, a phase causes the potential to drop by a constant factor, and since we know that the potential starts at φ_{0}=3 ^{2T}∞,n^{−1 }and ends at zero, we can use this fact to analyze the number of phases. The first phase begins at step t_{1}=1 and ends at the first step, t′_{1}, such that at least P throws occur during the interval of steps [t_{1},t′_{1}]. The second phase begins at step t_{2}=t′_{1}+1, and so on.
Combining equations (B.5), (B.6) and (B.7) over all places, the components of the potential at the places corresponding to φ_{i→j} ^{k}(out) and φ_{i→j} ^{k}(in) cancel out. Using this and Lemma B.1, we get that: Pr{(φ_{i}−φ_{j})≧¼.φ_{i}}>¼.
We say that a phase is successful if it causes the potential to drop by at least a ¼ fraction. A phase is successful with probability at least ¼. Since the potential drops from 3^{2T} ^{ ∞,n } ^{−1 }to 0 and takes integral values, the number of successful phases is at most (2T_{∞,n}−1)log_{4/3 }3<8T_{∞,n}. The expected number of phases needed to obtain 8T_{∞,n }successful phases is at most 32T_{∞,n}. Since each phase contains O(mn) readynode throws, the expected number of readynodethrows is O(T_{∞,n}·m·n) with O(T_{∞,n}·m) throws per place. The high probability bound can be derived [4] using Chernoffs Inequality. We omit this for brevity.
Now, using Lemma A.1, we get that the lower bound on the expected execution time for the affinity driven multiplace distributed scheduling algorithm is O(max_{k}T_{1} ^{k}/m+T_{∞,n}).
For the upper bound, consider the execution of the subgraph of the computation at each place. The number of throws in the readynodethrow bucket per place can be similarly bounded by O(T_{∞}k·m). Further, the place that finishes the execution in the end, can end up with number of tokens in the nullnodethrow bucket equal to the tokens in work and readnodethrow buckets of other places. Hence, the finish time for this place, which is also the execution time of the full computation DAG is O(Σ_{k}(T_{1} ^{k}/m+T_{∞} ^{k})). The probabilistic upper bound can be similarly established using Chernoff bound.
C
C.1 Message Complexity of Distributed Scheduling Algorithm in Unconstrained Space
Proof First consider interplace messages. Let the number of affinity driven pushes to remote places be O(I), each of O(S_{max}) bytes. Further, there could be at most n_{d }dependencies from remote descendants to a parent, each of which involves communication of constant, O(1), number of bytes. So, the total inter place communication is O(I.(S_{max}+n_{d})). Since the randomized work stealing is within a place, the lower bound on the expected number of steal attempts per place is O(m.T_{∞}) with each steal attempt requiring O(S_{max}) bytes of communication within a place. Further, there can be communication when a child thread enables its parent and puts the parent into the child processors' deque. Since this can happen n_{d }times for each time the parent is stolen, the communication involved is at most (n_{d}·S_{max}). So, the expected total infraplace communication per place is O(m·T_{∞,n}·S_{max}·n_{d}). The probabilistic bound can be derived using Chernoff's inequality and is omitted for brevity. Similarly, expected and probabilistic upper bounds can be established for communication complexity within the places.
C.2 Deadlock Freedom and Space Bound Proof For Distributed Scheduling Algorithm using Bounded Space
Lemma C.1 A place or a worker that accepts activity with depth d′ has space to execute activities of depth greater than or equal to d′+1.
Proof At any point of time, a place or a worker accepts an activity of depth d′ only if it has space greater than (D_{max}−d′)·S_{max}. This holds true in the Remote Empty Deque and Activity Stalled cases of the algorithm (FIG. 6 ). The algorithm adopts this reservation policy which ensures that activities already executing have reserved space that they may need for stalled activities. The space required to execute an activity of depth greater or equal to (d′+1) is obviously less, and hence, the place can execute it.
Lemma C.2 There is always space to execute activities at depth D_{max}.
Proof The space required to execute activities at D_{max }is at most S_{max }because it is the leaf activity. Such activities do not depend on other activities and will not spawn any child activities so they will not generate any stalled activities. Hence, they require a maximum of S_{max }amount of space. Therefore, leaf activities get consumed from the PrQ as soon as its worker gets idle. The leaf activities also get pulled from the FAB and get executed by the worker, that has empty deque and cannot execute activities from its PrQ due to lack of activities or space.
Theorem C.3 A terminally strict computation scheduled using algorithm in FIG. 6 uses O(m·(D_{max}·S_{max}+n·S_{max}+S_{1})) bytes as space per place.
Proof The PrQ, StallBuffer, AMRejectMap and deque per worker (processor) take total of O(m·(D_{max}·S_{max}+n·S_{max}+S_{1})) bytes per place. The WorkRejectMap and FAB take total O(m·n+D_{max}) and O(D_{max}·S_{max}) space per place (section 5.1). The scheduling strategy adopts a space conservation policy to ensure deadlock free execution in bounded space. The basic aim of this strategy is to ensure that only as much breadth of a tree is explored as can be accommodated in the available space assuming each path can go to the maximum depth of D_{max}. It starts with the initial condition where available space is at least D_{max}·S_{max }per worker per place. No activity (with depth D_{u}) can be scheduled on a worker if it cannot reserve the space for the possible stalled activities (((D_{max}−D_{u})·S_{max})) that it can generate at that place (Remote Spawn, Empty Deque cases). A place that enables a remote activity stalled because of space does so only after ensuring that appropriate amount of space is present forte activity that shall be created (Activity Enabled and Receives Notification cases). Similarly, when a worker steals it will ensure that it has enough space (((D_{max}−D_{u})·S_{max})) to accommodate the stalled activities that would get created as a result of execution of stolen activity, u (Empty Deque case). When an activity gets stalled (Activity Stalled case) it reserves S_{max}space from B_{i} ^{r }and When it is picked up for execution (Empty Deque case) it release this space, S_{max }from B_{i} ^{r}. So, the space B_{i} ^{r }suffices during execution. Similarly, for the FAB, S_{max }space is reserved when an active message is placed and S_{max }space is release when that active message is picked for execution by an idle worker (Empty Deque case). Thus, the FAB space requirement does not exceed during execution. The check on the FAB space for remote spawn ((((D_{max}−D_{u})·S_{max})) ensures depthbased ordering of activities across places and hence helps in deadlock free execution. From the algorithm, it can be seen that every reservation and release is such that the total space requirement at a place does not exceed what was available initially. Hence, the total space per place used is O(m·(D_{max}·S_{max}+n·S_{max}+S_{1})).
Claims (20)
1. An apparatus comprising:
one or more processors; and
a nontransitory computer readable storage medium having computer readable program code embodied therewith and executable by the one or more processors, the computer readable program code comprising:
computer readable program code configured to provide distributed scheduling of activities for workers at one or more places, the one or more places each comprising one or more processors having shared memory;
wherein to provide distributed scheduling further comprises permitting activities with higher depth on a path in a computation tree to execute to completion before activities with lower depth on the same path;
wherein each worker at a place maintains a priority queue that accords higher priority to activities with higher depth, and a record of rejected attempts to spawn activities at another place;
wherein each place maintains a fresh activity buffer containing activities spawned by remote places; and
wherein each place maintains a list of workers whose spawned activities are rejected at that place.
2. The apparatus according to claim 1 , wherein to provide distributed scheduling further comprises providing distributed scheduling for terminally strict multiplace computations.
3. The apparatus according to claim 1 , wherein to provide distributed scheduling further comprises providing hierarchical scheduling, wherein the hierarchical scheduling involves scheduling within a place and across places.
4. The apparatus according to claim 1 , wherein the computer readable program code further comprises computer readable program code configured to provide scheduling for hybrid parallel computations including anyplace activities and activities.
5. The apparatus according to claim 1 , wherein the computer readable program code further comprises computer readable program code configured to provide anyplace activities in a parallel computation to enable automatic loadbalancing across places.
6. The apparatus according to claim 1 , wherein the computer readable program code further comprises computer readable program code configured to provide prioritized random work stealing across places where a probability of stealing activities from closer places is higher than a probability of stealing from farther places.
7. The apparatus according to claim 1 , wherein the computer readable program code further comprises computer readable program code configured to provide space scheduling utilizing one or more of intraplace work stealing and remote work pushing for general computations.
8. The apparatus according to claim 1 , wherein the computer readable program code further comprises computer readable program code configured to provide bounded space scheduling for terminally strict computations.
9. The apparatus according to claim 8 , wherein the bounded space scheduling further comprises depthbased priority of activities/threads.
10. The apparatus according to claim 1 , wherein the computer readable program code further comprises computer readable program code configured to provide a multiprogram mode wherein a kernel scheduler changes resources available to a job based on utilization.
11. A method comprising:
utilizing one or more processors to execute a program of instructions tangibly embodied in a program storage device, the program of instructions comprising:
computer readable program code configured to provide distributed scheduling of activities for workers at one or more places, the one or more places each comprising one or more processors having shared memory;
wherein to provide distributed scheduling further comprises permitting activities with higher depth on a path in a computation tree to execute to completion before activities with lower depth on the same path;
wherein each worker at a place maintains a priority queue that accords higher priority to activities with higher depth, and a record of rejected attempts to spawn activities at another place;
wherein each place maintains a fresh activity buffer containing activities spawned by remote places; and
wherein each place maintains a list of workers whose spawned activities are rejected at that place.
12. The method according to claim 11 , wherein to provide distributed scheduling further comprises providing distributed scheduling for terminally strict multiplace computations.
13. The method according to claim 11 , wherein to provide distributed scheduling further comprises providing hierarchical scheduling, wherein the hierarchical scheduling involves scheduling within a place and across places.
14. The method according to claim 11 , wherein the program of instructions further comprises computer readable program code configured to provide scheduling for hybrid parallel computations including anyplace activities and activities.
15. The method according to claim 11 , wherein the program of instructions further comprises computer readable program code configured to provide anyplace activities in a parallel computation to enable automatic loadbalancing across places.
16. The method according to claim 11 , wherein the program of instructions further comprises computer readable program code configured to provide prioritized random work stealing across places where a probability of stealing activities from closer places is higher than a probability of stealing from farther places.
17. The method according to claim 11 , wherein the computer readable program code further comprises computer readable program code configured to provide space scheduling utilizing one or more of intraplace work stealing and remote work pushing for general computations.
18. The method according to claim 11 , wherein the program of instructions further comprises computer readable program code configured to provide bounded space scheduling for terminally strict computations.
19. The method according to claim 11 , wherein the program of instructions further comprises computer readable program code configured to provide a multiprogram mode wherein a kernel scheduler changes resources available to a job based on utilization.
20. A computer program product comprising:
a computer readable storage medium having computer readable program code embodied therewith, the computer readable program code comprising:
computer readable program code configured to provide distributed scheduling of multiplace computations for one or more places, the one or more places each comprising one or more processors having shared memory;
wherein to provide distributed scheduling of multiplace computations further comprises providing distributed scheduling for multithreaded computations with, using a combination of intraplace workstealing for load balancing and remote work pushing across places for preserving affinity; and
wherein a fresh activity buffer is implemented as a concurrent deque used for keeping new activities spawned from remote places.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

US12/607,497 US8959525B2 (en)  20091028  20091028  Systems and methods for affinity driven distributed scheduling of parallel computations 
Applications Claiming Priority (3)
Application Number  Priority Date  Filing Date  Title 

US12/607,497 US8959525B2 (en)  20091028  20091028  Systems and methods for affinity driven distributed scheduling of parallel computations 
JP2010224913A JP5756271B2 (en)  20091028  20101004  Apparatus, method, and computer program for affinitydriven distributed scheduling of parallel computing (system and method for affinitydriven distributed scheduling of parallel computing) 
CN2010105235306A CN102053870A (en)  20091028  20101028  Systems and methods for affinity driven distributed scheduling of parallel computations 
Publications (2)
Publication Number  Publication Date 

US20110099553A1 US20110099553A1 (en)  20110428 
US8959525B2 true US8959525B2 (en)  20150217 
Family
ID=43899487
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

US12/607,497 Active 20330505 US8959525B2 (en)  20091028  20091028  Systems and methods for affinity driven distributed scheduling of parallel computations 
Country Status (3)
Country  Link 

US (1)  US8959525B2 (en) 
JP (1)  JP5756271B2 (en) 
CN (1)  CN102053870A (en) 
Cited By (4)
Publication number  Priority date  Publication date  Assignee  Title 

US20150347625A1 (en) *  20140529  20151203  Microsoft Corporation  Estimating influence using sketches 
US9454373B1 (en)  20151210  20160927  International Business Machines Corporation  Methods and computer systems of software level superscalar outoforder processing 
US10152349B1 (en) *  20160927  20181211  Juniper Networks, Inc.  Kernel scheduling based on precedence constraints and/or artificial intelligence techniques 
US10255122B2 (en) *  20141218  20190409  Intel Corporation  Function callback mechanism between a Central Processing Unit (CPU) and an auxiliary processor 
Families Citing this family (16)
Publication number  Priority date  Publication date  Assignee  Title 

US7840914B1 (en) *  20050513  20101123  Massachusetts Institute Of Technology  Distributing computations in a parallel processing environment 
US20110276966A1 (en) *  20100506  20111110  Arm Limited  Managing task dependency within a data processing system 
US8483735B2 (en) *  20100826  20130709  Telefonaktiebolaget L M Ericsson (Publ)  Methods and apparatus for parallel scheduling of frequency resources for communication nodes 
US9262228B2 (en) *  20100923  20160216  Microsoft Technology Licensing, Llc  Distributed workflow in loosely coupled computing 
US9158592B2 (en) *  20110502  20151013  Green Hills Software, Inc.  System and method for time variant scheduling of affinity groups comprising processor core and address spaces on a synchronized multicore processor 
US20120304192A1 (en) *  20110527  20121129  International Business Machines  Lifelinebased global load balancing 
US9569400B2 (en)  20121121  20170214  International Business Machines Corporation  RDMAoptimized highperformance distributed cache 
US9378179B2 (en)  20121121  20160628  International Business Machines Corporation  RDMAoptimized highperformance distributed cache 
US9332083B2 (en)  20121121  20160503  International Business Machines Corporation  High performance, distributed, shared, data grid for distributed Java virtual machine runtime artifacts 
EP2759933A1 (en)  20130128  20140730  Fujitsu Limited  A process migration method, computer system and computer program 
US9628399B2 (en) *  20130314  20170418  International Business Machines Corporation  Software product instance placement 
CN107092573A (en) *  20130315  20170825  英特尔公司  Work in heterogeneous computing system is stolen 
US9600327B2 (en) *  20140710  20170321  Oracle International Corporation  Process scheduling and execution in distributed computing environments 
US20170068675A1 (en) *  20150903  20170309  Deep Information Sciences, Inc.  Method and system for adapting a database kernel using machine learning 
US9891955B2 (en) *  20151222  20180213  Nxp Usa, Inc.  Heterogenous multicore processor configuration framework 
US10671436B2 (en) *  20180502  20200602  International Business Machines Corporation  Lazy data loading for improving memory cache hit ratio in DAGbased computational system 
Citations (14)
Publication number  Priority date  Publication date  Assignee  Title 

US5872972A (en)  19960705  19990216  Ncr Corporation  Method for load balancing a per processor affinity scheduler wherein processes are strictly affinitized to processors and the migration of a process from an affinitized processor to another available processor is limited 
US6105053A (en) *  19950623  20000815  Emc Corporation  Operating system for a nonuniform memory access multiprocessor system 
US6292822B1 (en)  19980513  20010918  Microsoft Corporation  Dynamic load balancing among processors in a parallel computer 
US6823351B1 (en)  20000515  20041123  Sun Microsystems, Inc.  Workstealing queues for parallel garbage collection 
WO2005116832A1 (en)  20040531  20051208  International Business Machines Corporation  Computer system, method, and program for controlling execution of job in distributed processing environment 
JP2006502457A (en)  20011031  20060119  ヒューレット・パッカード・カンパニーＨｅｗｌｅｔｔ−Ｐａｃｋａｒｄ Ｃｏｍｐａｎｙ  Method and system for offloading the execution and resources of a device having constraints on networked resources 
US7103887B2 (en)  20010627  20060905  Sun Microsystems, Inc.  Loadbalancing queues employing LIFO/FIFO work stealing 
US7346753B2 (en)  20051219  20080318  Sun Microsystems, Inc.  Dynamic circular workstealing deque 
US7363438B1 (en)  20041105  20080422  Sun Microsystems, Inc.  Extendable memory workstealing 
WO2008077267A1 (en)  20061222  20080703  Intel Corporation  Locality optimization in multiprocessor systems 
US20080178183A1 (en)  20040429  20080724  International Business Machines Corporation  Scheduling Threads In A MultiProcessor Computer 
US20080244588A1 (en)  20070328  20081002  Massachusetts Institute Of Technology  Computing the processor desires of jobs in an adaptively parallel scheduling environment 
US20090031318A1 (en)  20070724  20090129  Microsoft Corporation  Application compatibility in multicore systems 
US20090031317A1 (en)  20070724  20090129  Microsoft Corporation  Scheduling threads in multicore systems 
Family Cites Families (3)
Publication number  Priority date  Publication date  Assignee  Title 

US6243788B1 (en) *  19980617  20010605  International Business Machines Corporation  Cache architecture to enable accurate cache sensitivity 
US7363468B2 (en) *  20041118  20080422  International Business Machines Corporation  Load address dependency mechanism system and method in a high frequency, low power processor system 
CN101246438A (en) *  20080307  20080820  中兴通讯股份有限公司  Process and interrupt processing method and device for symmetrical multiprocessing system 

2009
 20091028 US US12/607,497 patent/US8959525B2/en active Active

2010
 20101004 JP JP2010224913A patent/JP5756271B2/en active Active
 20101028 CN CN2010105235306A patent/CN102053870A/en active Pending
Patent Citations (14)
Publication number  Priority date  Publication date  Assignee  Title 

US6105053A (en) *  19950623  20000815  Emc Corporation  Operating system for a nonuniform memory access multiprocessor system 
US5872972A (en)  19960705  19990216  Ncr Corporation  Method for load balancing a per processor affinity scheduler wherein processes are strictly affinitized to processors and the migration of a process from an affinitized processor to another available processor is limited 
US6292822B1 (en)  19980513  20010918  Microsoft Corporation  Dynamic load balancing among processors in a parallel computer 
US6823351B1 (en)  20000515  20041123  Sun Microsystems, Inc.  Workstealing queues for parallel garbage collection 
US7103887B2 (en)  20010627  20060905  Sun Microsystems, Inc.  Loadbalancing queues employing LIFO/FIFO work stealing 
JP2006502457A (en)  20011031  20060119  ヒューレット・パッカード・カンパニーＨｅｗｌｅｔｔ−Ｐａｃｋａｒｄ Ｃｏｍｐａｎｙ  Method and system for offloading the execution and resources of a device having constraints on networked resources 
US20080178183A1 (en)  20040429  20080724  International Business Machines Corporation  Scheduling Threads In A MultiProcessor Computer 
WO2005116832A1 (en)  20040531  20051208  International Business Machines Corporation  Computer system, method, and program for controlling execution of job in distributed processing environment 
US7363438B1 (en)  20041105  20080422  Sun Microsystems, Inc.  Extendable memory workstealing 
US7346753B2 (en)  20051219  20080318  Sun Microsystems, Inc.  Dynamic circular workstealing deque 
WO2008077267A1 (en)  20061222  20080703  Intel Corporation  Locality optimization in multiprocessor systems 
US20080244588A1 (en)  20070328  20081002  Massachusetts Institute Of Technology  Computing the processor desires of jobs in an adaptively parallel scheduling environment 
US20090031318A1 (en)  20070724  20090129  Microsoft Corporation  Application compatibility in multicore systems 
US20090031317A1 (en)  20070724  20090129  Microsoft Corporation  Scheduling threads in multicore systems 
NonPatent Citations (14)
Title 

Acar et al., "The Data Locality of Work Stealing", SPAA '00, Proceedings of the 12th Annual ACM Symposium on Parallel Algorithms and Architectures, Jul. 912, 2000, pp. 112, Bar Harbor, Maine, USA. 
Agarwal et al., "DeadlockFree Scheduing of X10 Computations with Bounded Resources", SPAA '07, Proceedings of the 19th Annual ACM Symposium on Parallel Algorithms and Architectures, Jun. 911, 2007, pp. 229240, San Diego, California, USA. 
Blumofe et al., "Scheduling Multithreaded Computations by Work Stealing", Journal of the ACM, Sep. 1999, pp. 720748, vol. 46, No. 5, ACM, New York, New York, USA. 
Dinanet, James al., "A Framework for GlobalView Task Parallelism", Proceedings of the International Conference on Parallel Processing, p. 586593, 2008, Proceedings37th International Conference on Parallel Processing, ICPP 2008. 
Dinanet, James al., "A Framework for GlobalView Task Parallelism", Proceedings of the International Conference on Parallel Processing, p. 586593, 2008, Proceedings—37th International Conference on Parallel Processing, ICPP 2008. 
Faxen, KarlFilip et al, "Multicore ComputingThe State of the Art, Swedish Institute of Computer Science", Dec. 3, 2008. 
Faxen, KarlFilip et al, "Multicore Computing—The State of the Art, Swedish Institute of Computer Science", Dec. 3, 2008. 
Gautier, Thierry et al, "KAAPI: A Thread Scheduling Runtime System for Data Flow Computations on Cluster of MultiProcessors", International Conference on Symbolic and Algebraic Computation Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, London, Ontario, Canada, pp. 1523, 2007. 
Motohashi, Takeshi et al., "An ActivityBased Parallel Execution Mechanism Using Distributed Activity Queues", Journal of Information Processing Society of Japan, Japan, Information Processing Society of Japan, Oct. 1994, pp. 21282137, vol. 35, No. 10, NIIElectronic Library Service. 
Motohashi, Takeshi et al., "An ActivityBased Parallel Execution Mechanism Using Distributed Activity Queues", Journal of Information Processing Society of Japan, Japan, Information Processing Society of Japan, Oct. 1994, pp. 21282137, vol. 35, No. 10, NII—Electronic Library Service. 
Saito, Hideo et al., "Localityaware Connection Management and Rank Assignment for Widearea MPI", Journal of Information Processing Society of Japan, Japan, Information Processing Society of Japan, May 23, 2007, pp. 4445, vol. 48, No. SIG 18, NIIElectronic Library Service. 
Saito, Hideo et al., "LocalityAware Connection Management and Rank Assignment for WideArea MPI", Principles and Practice of Parallel Programming (PPoPP'07), Mar. 1417, 2007, San Jose, CA, USA, pp. 150151, ACM Digital Library. 
Saito, Hideo et al., "Localityaware Connection Management and Rank Assignment for Widearea MPI", Journal of Information Processing Society of Japan, Japan, Information Processing Society of Japan, May 23, 2007, pp. 4445, vol. 48, No. SIG 18, NII—Electronic Library Service. 
Videau Brice et al, "PaSTeL: Parallel Runtime and Algorithms for Small Datasets", Institute National De Recherche En Informatique Et En Automatique, No. 6650, Sep. 2008. 
Cited By (8)
Publication number  Priority date  Publication date  Assignee  Title 

US20150347625A1 (en) *  20140529  20151203  Microsoft Corporation  Estimating influence using sketches 
US9443034B2 (en) *  20140529  20160913  Microsoft Technology Licensing, Llc  Estimating influence using sketches 
US20160350382A1 (en) *  20140529  20161201  Microsoft Technology Licensing, Llc  Estimating influence using sketches 
US10255122B2 (en) *  20141218  20190409  Intel Corporation  Function callback mechanism between a Central Processing Unit (CPU) and an auxiliary processor 
US10706496B2 (en)  20141218  20200707  Intel Corporation  Function callback mechanism between a Central Processing Unit (CPU) and an auxiliary processor 
US9454373B1 (en)  20151210  20160927  International Business Machines Corporation  Methods and computer systems of software level superscalar outoforder processing 
US10152349B1 (en) *  20160927  20181211  Juniper Networks, Inc.  Kernel scheduling based on precedence constraints and/or artificial intelligence techniques 
US10748067B2 (en) *  20160927  20200818  Juniper Networks, Inc.  Kernel scheduling based on precedence constraints and/or artificial intelligence techniques 
Also Published As
Publication number  Publication date 

JP5756271B2 (en)  20150729 
JP2011096247A (en)  20110512 
CN102053870A (en)  20110511 
US20110099553A1 (en)  20110428 
Similar Documents
Publication  Publication Date  Title 

US8959525B2 (en)  Systems and methods for affinity driven distributed scheduling of parallel computations  
Drozdowski  Scheduling for parallel processing  
Turilli et al.  A comprehensive perspective on pilotjob systems  
Tang et al.  A stochastic scheduling algorithm for precedence constrained tasks on grid  
KR20180073669A (en)  Streambased accelerator processing of computed graphs  
US20070226735A1 (en)  Virtual vector processing  
WO2012028213A1 (en)  Rescheduling workload in a hybrid computing environment  
Lee et al.  Orchestrating multiple dataparallel kernels on multiple devices  
Muller et al.  Latencyhiding work stealing: Scheduling interacting parallel computations with work stealing  
Qureshi et al.  Grid resource allocation for realtime dataintensive tasks  
Nguyen et al.  Demandbased scheduling priorities for performance optimisation of stream programs on parallel platforms  
Memeti et al.  A review of machine learning and metaheuristic methods for scheduling parallel computing systems  
Capannini et al.  A job scheduling framework for large computing farms  
Tan et al.  A MapReduce based framework for heterogeneous processing element cluster environments  
Wang et al.  A general and fast distributed system for largescale dynamic programming applications  
León  mpibind: A memorycentric affinity algorithm for hybrid applications  
Mei et al.  RealTime stream processing in java  
You et al.  Task Scheduling Algorithm in GRID Considering Heterogeneous Environment.  
KR20110046296A (en)  Affinity driven distributed scheduling system and method of parallel computation  
Ejarque et al.  A hierarchic taskbased programming model for distributed heterogeneous computing  
Narang et al.  Affinity driven distributed scheduling algorithm for parallel computations  
Tarakji et al.  Os support for load scheduling on acceleratorbased heterogeneous systems  
Narang et al.  Performance driven distributed scheduling of parallel hybrid computations  
He  Scheduling in Mapreduce Clusters  
Hasija et al.  DMMLQ Algorithm for Multilevel Queue Scheduling 
Legal Events
Date  Code  Title  Description 

AS  Assignment 
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:AGARWAL, SHIVALI;NARANG, ANKUR;SHYAMASUNDAR, RUDRAPATNA K.;REEL/FRAME:023801/0489 Effective date: 20091106 

STCF  Information on status: patent grant 
Free format text: PATENTED CASE 

MAFP  Maintenance fee payment 
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551) Year of fee payment: 4 