WO2022013887A1

WO2022013887A1 - Apparatus for implementing dynamic, data-dependent parallelism for task execution based on an execution model

Info

Publication number: WO2022013887A1
Application number: PCT/IN2021/050678
Authority: WO
Inventors: Madhava Krishna Chembati; Narayan RANJANI
Original assignee: Morphing Machines Pvt. Ltd
Priority date: 2020-07-16
Filing date: 2021-07-13
Publication date: 2022-01-20

Abstract

Provided is an apparatus for implementing dynamic, data-dependent parallelism for task execution based on a macro dataflow execution model. The execution model defines a plurality of high-level abstractions that determine the behavior of a plurality of hardware components of the apparatus. The execution model defines primitive operations that enable the runtime construction of a hyperOp dependence graph (HDG) for applications. The execution model defines primitive operations that enable configuring two distinct address spaces of a memory as a global memory for shared memory communication and a context memory for synchronization. A runtime system of the execution model defines functionalities that are implemented in an orchestrator to manage execution of unordered hyperOp instances in parallel and directly load operands of a hyperOp instance from an associated context frame into a register file of a compute element (CE).

Description

APPARATUS FOR IMPLEMENTING DYNAMIC. DATA-DEPENDENT

PARALLELISM FOR TASK EXECUTION BASED ON AN EXECUTION

MODEL

FIELD OF THE INVENTION

[0001] The invention generally relates to parallel task execution in a computing environment. Specifically, the invention relates to implementing dynamic, data- dependent parallelism for task execution in a computing environment such as a processor or a many-core coprocessor environment based on a macro dataflow execution model which provides high performance by efficiently exploiting task parallelism at a finer granularity.

BACKGROUND OF THE INVENTION

[0002] Execution models of traditional computing environments or processor technologies sequentially control and orchestrate instructions without any parallelism. With the development of System on a Chip (SoC) architecture, massively parallel and heterogeneous many-core processors were implemented. For such processor technology, dataflow or control flow execution models are used to control and orchestrate instructions and operations in parallel.

[0003] REDEFINE many-core is a co-processor used for accelerating compute intensive parts of an application. It provides high performance by efficiently exploiting task parallelism at a finer granularity than conventional multithreading.

[0004] REDEFINE many-core processor is programmed using a heterogeneous programming environment. A heterogeneous programming environment explicitly separates a program execution into host side execution and device side execution. For example, in a heterogeneous programming framework like OpenCL, the host controls the device side execution through OpenCL runtime Application Programming Interfaces (APIs) and the code that runs on the device is called the kernel.

[0005] To work in conjunction with the above described framework, there is a need for an improved dataflow execution model for providing high performance by efficiently exploiting task parallelism at a finer granularity than conventional multithreading.

SUMMARY OF THE INVENTION

[0006] An apparatus is disclosed for implementing dynamic, data-dependent parallelism based on a macro dataflow execution model as shown in and/or described in connection with, at least one of the figures, as set forth more completely in the claims.

[0007] The apparatus can be a computer system that may be realized in hardware, or a combination of hardware and software that may include, but are not limited to, a memory with two distinct address spaces, a plurality of compute resources, each of the one or more compute resources comprising a plurality of compute elements (CEs), and an orchestrator. The execution model defines a plurality of high-level abstractions that determine the behavior of a plurality of hardware components of the apparatus. The execution model includes a concurrency model that defines one or more primitive operations that enable the runtime construction of a hyperOp dependence graph (HDG) that describes one or more application programs. The HDG is a hierarchical dataflow graph which comprises a plurality of hyperOps represented by nodes of the hierarchical dataflow graph and a plurality of directed edges of the hierarchical dataflow graph connect the plurality of hyperOps,

[0008] A hyperOp is a multiple-input and multiple-output macro operation, and the plurality of directed edges represent explicit data transfer or execution order requirement between one or more connected hyperOps of the plurality of hyperOps. The hyperOp types, their relations and data dependencies in the one or more application programs are statically defined. The one or more application programs are organized as hyperOp computations and hyperOp static metadata. The hyperOp static metadata specifies hyperOp composition and its annotations.

[0009] The execution model further includes a memory communication model that defines one or more primitive operations that enable configuring the two distinct address spaces of the memory as a global memory and a context memory. The global memory includes a global memory address space which is shared across the plurality of hyperOps. HyperOps can perform reads and writes to the global memory. Global memory accesses across hyperOps are ordered/synchronized through context frame writes, and all global memory accesses are assumed to be free from data races.

[0010] The context memory includes a context memory address space which stores one or more context frames. Each hyperOp instance of a hyperOp is associated with a context frame. A context frame holds operands of an associated hyperOp instance. A hyperOp can perform writes to any context frame. A hyperOp can read only its frame, that is, a hyperOp can only read its own operands.

[0011] The execution model further includes a runtime system that defines one or more functionalities that are implemented in the orchestrator to manage execution of one or more hyperOp instances on one or more compute resources based on the associated hyperOp computations and hyperOp static metadata defined in the HDG. Each compute resource of the one or more compute resources includes the plurality of CEs and each hyperOp instance of the one or more hyperOp instances are executed on a CE. The orchestrator schedules unordered hyperOp instances in parallel and directly loads operands of a hyperOp instance from an associated context frame into a register file of a CE.

[0012] All communications among hyperOps are one-sided communications, that is, only a producer hyperOp initiates and completes a communication. Thus, with sufficient parallelism, all communications can overlap with computations. The amount of parallelism that can be exposed by an application is limited only by the size of the context memory.

[0013] These and other features and advantages of the present invention may be appreciated from a review of the following detailed description of the present invention, along with the accompanying figures in which like reference numerals refer to like parts throughout.

BRIEF DESCRIPTION OF THE DRAWINGS

[0014] FIG. 1 is a diagram that illustrates a host-device interface in a programming framework of a coprocessor in accordance with an exemplary embodiment of the invention.

[0015] FIG. 2a and FIG. 2b are diagrams that illustrate a many-core coprocessor architecture with 64 compute elements (CEs) in accordance with an exemplary embodiment of the invention.

[0016] FIG. 3(a), FIG. 3(b), FIG. 3(c) and FIG. 3(d) are diagrams that collectively illustrate an OpenCL framework programming flow of a coprocessor architecture in accordance with an exemplary embodiment of the invention.

[0017] FIG. 4 is a diagram that illustrates an apparatus for implementing dynamic, data-dependent parallelism for task execution based on an execution model in accordance with an exemplary embodiment of the invention.

[0018] FIG. 5(a) and FIG. 5(b) are diagrams that are representations of a hyperOp dependence graph (HDG) in accordance with an exemplary embodiment of the invention. [0019] FIG. 6 is a diagram that illustrates an abstract machine for implementing a centralized runtime system in accordance with an exemplary embodiment of the invention.

[0020] FIG. 7 is a diagram that illustrates an abstract machine for implementing a distributed runtime system in accordance with an exemplary embodiment of the invention.

[0021] FIG. 8 is a diagram that illustrates execution among two hyperOps in accordance with an exemplary embodiment of the invention.

[0022] FIG. 9a and FIG. 9b are diagrams that illustrate inter hyperOp communication in accordance with an exemplary embodiment of the invention.

[0023] FIG. 10 is a diagram illustrating a hyperOp dependence graph (HDG) implementing fork-join parallelism in accordance with an exemplary embodiment of the invention.

[0024] FIG. 11(a) and FIG. 11(b) are diagrams that illustrate a hierarchical dataflow graph representation in accordance with an exemplary embodiment of the invention.

[0025] FIG. 12(a) and FIG. 12(b) are diagrams that illustrate an execution graph for a code in accordance with an exemplary embodiment of the invention.

[0026] FIG. 13(a) and FIG. 13(b) are diagrams that illustrate a hierarchical dataflow graph representation of a code in accordance with an exemplary embodiment of the invention.

[0027] FIG. 14(a) and FIG. 14(b) are diagrams that illustrate an execution graph for a code in accordance with an exemplary embodiment of the invention. [0028] FIG. 15 is a diagram that illustrates an overview of compiler implementation/flow in C with hyperOps in accordance with an exemplary embodiment of the invention.

[0029] Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help to improve understanding of embodiments of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0030] The following described implementations may be found in the disclosed apparatus for implementing dynamic, data-dependent parallelism for task execution based on a macro dataflow execution model.

[0031] FIG. 1 is a diagram that illustrates a host-device interface in a programming framework of a coprocessor in accordance with an exemplary embodiment of the invention. Referring to FIG. 1, there is shown an abstract representation 100 of REDEFINE OpenCL programming framework, which includes a host-device interface 102, REDEFINE resource manager (RRM) 104, host code 106, and kernel code 108.

[0032] The host-device interface 102 is an interface between the host environment and the device environment realized through the RRM 104. The host side RRM includes an implementation of the host code 106 with OpenCL runtime APIs and the device driver for the REDEFINE many-core accelerator. The device side RRM 104 is a hardware or software module that serves as a gateway between the host processor and the REDEFINE many-core accelerator and includes an implementation of the kernel code 108. [0033] FIG. 2a and FIG. 2b are diagrams that illustrate a many-core coprocessor architecture with 64 compute elements (CEs) in accordance with an exemplary embodiment of the invention. Referring to FIG. 2a, there is shown a many-core architecture 200 of REDEFINE that includes an RRM 202, an off-chip memory controller 204, a network of compute nodes 206, access routers 208 and routers 210.

[0034] The many-core architecture 200 is implemented as the network of compute nodes 206. The RRM 202 enables the host interface to communicate with the network of compute nodes 206 via the off-chip memory controller 204.

[0035] The access routers 208 and routers 210 route interactions or communications from the host interface to the network of compute nodes 206 via the RRM 202.

[0036] Referring to FIG. 2b, there is shown an architecture or composition 212 of a single compute node that includes four CEs 214a-214d, private Ll-cache for global memory address space (Ll$) 216a-216d, cache for context memory address space (CM$) 218, a Distributed Shared Memory (DSM) bank 220, an orchestrator 222 and a router 224.

[0037] The compute node is organized as a cluster of the four CEs 214a-214d. Each CE is associated with a private Ll-cache for global memory address space (Ll$) and communicates with the orchestrator 222.

[0038] The orchestrator 222 communicates with the cache for context memory address space (CM$) 218.

[0039] The DSM bank 220 hosts a region of global memory and context memory.

[0040] Communications or access requests from the host interface are directed to the DSM bank 220 and to the orchestrator 222 via the router 224. [0041] FIG. 3(a), FIG. 3(b), FIG. 3(c) and FIG. 3(d) are diagrams that collectively illustrate an OpenCL framework programming flow of a coprocessor architecture in accordance with an exemplary embodiment of the invention. Referring to FIG. 3(a), FIG. 3(b), FIG. 3(c) and FIG. 3(d), there is shown an OpenCL framework programming flow 300 of the REDEFINE coprocessor architecture.

[0042] Referring to FIG. 3(a), there is shown an application program that includes a host-side program and a device-side (kernel) program or code 302 in the REDEFINE OpenCL framework programming flow 300. As per the OpenCL programming model, the host code uses the OpenCL runtime APIs for device initialization, kernel management, buffer management, sending inputs to the device, and receiving outputs from the device. The device (kernel) code is described using C with hyperOps programming interfaces.

[0043] Referring to FIG. 3(b), there is shown a process for host code compilation that includes host code 304, a standard C compiler (REDEFINE compiler) 306, an OpenCL runtime library 308 and a host binary 310.

[0044] The standard C compiler (REDEFINE compiler) 306 compiles the host code 304 to an executable or the host binary 310 by receiving inputs from the OpenCL runtime library 308.

[0045] Referring to FIG. 3(c), there is shown a process for kernel code compilation that includes kernel code 312, a REDEFINE compiler 314 and kernel binary 316.

[0046] The REDEFINE compiler 314 compiles the kernel code 312 to an executable or the kernel binary 316.

[0047] Referring to FIG. 3(d), there is shown a host-device execution environment that includes the host binary 310, a host processor 318, the kernel binary 316 and a REDEFINE many-core processor 320. [0048] The host binary 310 executes on the host processor 318 and the kernel binary 316 executes on the REDEFINE many-core processor 320.

[0049] FIG. 4 is a diagram that illustrates an apparatus for implementing dynamic, data-dependent parallelism for task execution based on an execution model in accordance with an exemplary embodiment of the invention. Referring to FIG. 4, there is shown an apparatus 400 which includes a memory 402 with two distinct address spaces namely a global memory 404 and a context memory 406, a plurality of compute resources 408a-408n comprising a plurality of CEs (CEi-CE_n), an orchestrator 410, an execution model 412 which further includes a concurrency model 414, a hyperOp dependence graph (HDG) 416, a memory communication model 418, and a runtime system 420.

[0050] The apparatus 400 can be a computer system that may be realized in hardware, or a combination of hardware and software that may include, but are not limited to, the memory 402 with two distinct address spaces, the plurality of compute resources 408a-408n, each of the one or more compute resources comprising a plurality of CEs (CEi-CE_n), and the orchestrator 410. These components are connected via an interconnect. The apparatus 400 may also include a multi-processor data processing system that may be arranged as an on-chip network with various kinds of nodes that may include, but are not limited to, processors, accelerators, memory and input/output (I/O) devices connected via an interconnect fabric.

[0051] The execution model 412 is a macro dataflow execution model for implementing dynamic, data-dependent parallelism in the apparatus 400. In an embodiment, the execution model 412 is a REDEFINE execution model for implementing dynamic, data-dependent parallelism in a many-core processor technology. The execution model 412 defines a plurality of high-level abstractions that determine the behavior of a plurality of hardware components of the apparatus 400 which is realized by way of the concurrency model 414, the memory communication model 418 and the runtime system 420.

[0052] The concurrency model 414 defines one or more primitive operations that enable the runtime construction of the HDG 416 that describes one or more application programs. The HDG 416 is a hierarchical dataflow graph that comprises a plurality of hyperOps represented by nodes of the hierarchical dataflow graph and a plurality of directed edges of the hierarchical dataflow graph that connect the plurality of hyperOps.

[0053] A hyperOp is a multiple-input and multiple-output macro operation, and the plurality of directed edges represent explicit data transfer or execution order requirement between one or more connected hyperOps of the plurality of hyperOps, The HDG 416 is further illustrated in conjunction with FIG. 5(a) and FIG. 5(b).

[0054] HyperOp types, their relations and data dependencies in the one or more application programs are statically defined in the HDG 416. The one or more application programs are organized as hyperOp computations and hyperOp static metadata. The hyperOp static metadata specifies hyperOp composition and its annotations.

[0055] The memory communication model 418 defines one or more primitive operations that enable configuring the two distinct address spaces of the memory 402 as the global memory 404 for shared memory communication and the context memory 406 for synchronization.

[0056] The global memory 404 includes a global memory address space which is shared across the plurality of hyperOps, Accesses to the global memory 404 are data-race free, since they are ordered and synchronized across the plurality of hyperOps through writes to the context memory 406. [0057] The context memory 406 comprises a context memory address space which stores one or more context frames. Each hyperOp instance of a hyperOp is associated with a context frame. A hyperOp can perform only writes to the context memory 406.

[0058] A context frame holds operands of an associated hyperOp instance. A context frame comprises slots that hold operands of a hyperOp instance. These slots are write-once, immutable data structures and the operand values associated with a hyperOp in a context frame are undefined after read by the orchestrator 410 during launch of the hyperOp. Only the orchestrator 410 read the context memory 406.

[0059] The orchestrator 410 implements the functionality of the runtime system 420. In an embodiment, the runtime system 420 is an implicit part of the concurrency model 414. The runtime system 420 defines one or more functionalities that are implemented in the orchestrator 410 to manage execution of one or more hyperOp instances on one or more compute resources of the plurality of compute resources 408a- 108n based on the associated hyperOp computations and hyperOp static metadata defined in the HDG 416..

[0060] Each compute resource includes the plurality of CEs (CEi-CE_n) and each hyperOp instance of the one or more hyperOp instances is executed on a CE.

[0061] The orchestrator 410 may comprise two storage structures, a free-list and a ready-list. The free-list keeps track of unallocated context frames and the ready-list contains addresses of context frames with hyperOps that are ready to execute.

[0062] The orchestrator 410 reads the context memory 406 and schedules unordered hyperOp instances in parallel and directly loads operands of a hyperOp instance from an associated context frame into a register file of a CE.

[0063] In accordance with an embodiment, the plurality of hyperOps include one or more producer hyperOps and one or more consumer hyperOps, A producer hyperOp stores data in the global memory 404 and communicates the address of the data or synchronizes to a corresponding consumer hyperOp through a write to a context frame and the corresponding consumer hyperOp accesses the data through the address received as an operand. The operands are undefined when context frames are created and are defined only once by writes of a producer hyperOp.

[0064] The orchestrator 410 enforces an execution order between the producer hyperOp and the corresponding consumer hyperOp through writes to a context frame.

[0065] The orchestrator 410 is responsible for managing the context memory 406 and scheduling of hyperOps. With a large number of CEs and light-weight tasks, a centralized orchestrator with centralized runtime support, will limit the overall parallelism speedup.

[0066] In accordance with an embodiment, the execution model 412 includes a centralized runtime system that defines one or more functionalities that are implemented in the orchestrator 410 to manage the entire context memory address space and schedule execution of hyperOps on the plurality of CEs (CEi-CE_n). An implementation of the centralized runtime system is further illustrated in conjunction with FIG. 6.

[0067] In accordance with another embodiment, the execution model 412 includes a decentralized runtime system that defines one or more functionalities to configure the orchestrator 410 as a plurality of orchestrator instances. The context memory address space and the plurality of CEs (CEi-CE_n) are partitioned and grouped into a plurality of clusters. Each cluster of the plurality of clusters forms a compute resource, and each orchestrator instance of the plurality of orchestrator instances is associated with a compute resource, and schedules execution of hyperOps on the compute resource. An implementation of the decentralized runtime system is further illustrated in conjunction with FIG. 7. [0068] Details related to the execution model 412 and its hardware implementation are further illustrated as follows.

[0069] The context memory 406 is the storage for context frames. Each hyperOp instance is associated with a context frame and supports dynamic parallelism with a large number of fine-grained tasks which involves frequent allocation and de allocation of frames. In order to perform allocation and de-allocation of the context frames with low runtime overhead, the context memory 406 is partitioned into frames of uniform size and the frame management is implemented in hardware. In this implementation, each frame can hold up to 16-operands. Restricting maximum operands of a hyperOp to 16-operands does not limit the computational dataset size of hyperOps to 16-operands. Apart from operands, a frame also holds metadata of the hyperOp instance called instance metadata.

• HyperOpId is the reference (pointer) to the static metadata of the hyperOp that the context frame holds.

• WaitCount specifies the number of operands yet to be received. A hyperOp is ready to execute when its WaitCount becomes 0.

[0070] A context frame is associated with only one instance of a hyperOp at any instant. A producer hyperOp must know the frame addresses of its consumer hyperOps in order to deliver the data to the consumer hyperOps. Launching a ready hyperOp onto a CE involves reading the operands from the associated frame and loading them into the CE’ s register file. HyperOps can perform writes to any context frame but reads only its associated context frame, through the register file of the CE. The operand slots in a frame are of write-once or single-assignment data structures and its data is destroyed (deleted) after the read access. Thus, the operands’ slots in a frame have only one writer and one reader. This property simplifies the hardware implementation required for ensuring memory consistency for the context memory 406.

[0071] A hyperOp is ready for execution as soon as all its operands are available, and when its execution order or synchronization dependencies are satisfied. Apart from the arithmetic, control, and memory load and store instructions, the execution model 412 includes primitives for adding new nodes (hyperOp instances) and edges (dependencies) to the application execution graph, the HDG 416. Thus, the execution model 412 supports dynamic (data-dependent) parallelism and follows non-preemptive scheduling of hyperOps, therefore, cyclic dependencies are forbidden among hyperOp instances.

[0072] The global memory 404 is the storage for data and code. The code segment includes hyperOp’s instruction sequence and static metadata. HyperOps can perform reads and writes to the global memory 404. Although the context frame size limits hyperOp operands to 16, the dataset size of hyperOp computation is not restricted to 16-operands and hyperOps can exchange data through the global memory 404. A producer hyperOp can store data in the global memory 404 and communicate the address of the data to its consumer hyperOp through a context frame. The consumer hyperOp then accesses the data through the address received as an operand. Such accesses to the global memory 404 require synchronization between reading and writing hyperOps which is enforced through context frame writes.

[0073] The orchestrator 410 manages the context memory 406 and scheduling of hyperOps. The orchestrator 410 includes two storage structures called a free-list and ready-list. The free-list keeps track of unallocated frames and the ready-list contains addresses of the frames with hyperOps that are ready to execute. The functionalities of the orchestrator 410 may include, but are not limited to, allocation and de-allocation of context frames, keeping track of the status of active hyperOps (all updates to the context memory 406 happen through the orchestrator 410), monitoring status of CEs either idle (i.e., ready to accept a hyperOp for execution) or busy, and prioritizing ready hyperOps and launching them onto idle CEs. The abstract machine model assumes that all communications and memory operations are performed instantaneously. In the presence of network delays and memory hierarchy with caches, the orchestrator 410 also manages cache coherence and ensures memory consistency.

[0074] The execution model 412 assumes a CE as an instruction set processor, and computations of hyperOps are specified as a sequence of instructions. Launching a hyperOp onto a CE involves loading the program-counter (PC) with codePointer of the hyperOp computation and loading register-file (RF) with operands of the hyperOp. After executing the hyperOp, the CE invalidates the contents of PC and RF, thereby leaving itself in a “clean” state. Thus, execution of a hyperOp has side- effects only in terms of writes to the global memory 404 (with data produced by the hyperOp), and writes to the context memory 406 (data that serve as operands to other hyperOp(s), and events that enforce execution order among hyperOps).

[0075] In addition to arithmetic and control instructions, Load, Store, FAlloc, FBind, FDelete, WriteCM, Sync, Createlnst, and End form the basic instruction set of a CE in the execution model 412. The operation semantics of these instructions are specified below: r = Load(a)

Load register r with the value from global memory address a.

Store(a, v)

Store value v at global memory address a. r = FAlloc(n)

Allocates n contiguous frames, i.e., an array of n context frames - cf[0], .., cf[n-l], and r is loaded with address of cf[0]. n must be less than or equal to 16. The allocated frames are in inactive state, i.e., not associated with any hyperOp instance.

FBind(cf, hid)

Create an instance of hid hyperOp and bind it with frame cf, where cf is the base address of the frame. Each frame is uniquely identified by its address in the context memory address space. Binding (associating) a frame with a hyperOp instance changes the state of the frame from inactive to active.

FDelete(cf)

Add frame cf to the free-list. Only frames in the inactive state, i.e., frames that are not associated with any hyperOp instance can be deleted. Frames that are allocated by FAlloc can be deallocated using FDelete.

WriteCM(ca, v)

[ca] := v, write value v at the context memory address ca. The WaitCount associated with the frame that contains ca is decremented by 1. If the WaitCount becomes 0, add frame to the ready-list.

Sync(ca, v)

[ca] := [ca]+v, this instruction updates operand at address ca3 by adding value v to it. The sum ([ca]+v) represents the updated synchronization wait- count of the frame (associated hyperOp instance) that contains ca. Sync operates only on the hyperOp instances that are annotated as join and only the 15th operand is used to hold the synchronization wait-count. Thus, ca is assumed to be referring the 15th operand of the frame. If this is the first Sync operating on the frame, then the operand- 15 is initialized with the value v. For a join-hyperOp to be ready, its synchronization wait-count and WaitCount must become 0. The synchronization wait-count can never be negative. It is forbidden to update a synchronization wait-count that is already 0, and the respective Sync instructions are considered illegal r := Createlnst(hid)

Allocate a frame and bind it with an instance of hid hyperOp. The frame address is written to register r. This instruction is equivalent to the instruction sequence - r := FAlloc(l); FBind(r, hid). The frame allocated by Createlnst instruction gets deallocated (deleted) immediately after the execution of its associated hyperOp instance, no need for explicit deallocation using FDelete.

End

Last instruction of the executing hyperOp. CE self-invalidates its PC and RF, and notifies the orchestrator 410 that it is idle.

[0076] Enumeration and description of arithmetic and control instructions are avoided as they only modify a CE’s internal state and are not critical for understanding operations of the execution model 412.

[0077] FIG. 5(a) and FIG. 5(b) are diagrams that are representations of a hyperOp dependence graph (HDG) in accordance with an exemplary embodiment of the invention. Referring to FIG. 5(a) and FIG. 5(b), there is shown two different representations of the HDG 416 for an application depicting hyperOps labeled as Ho, Hi, ¾, and ¾, their types and their relations represented by edges 502.

[0078] The application is described as a hierarchical dataflow graph in which vertices represent hyperOps Ho, Hi, ¾, and ¾ and edges 502 represent explicit data transfer or execution order requirements between connected hyperOps, where Hi is a recursive hyperOp.

[0079] According to the hybrid dataflow/von-Neumann execution model classification, the execution model 412 falls under Dataflow/Control Flow class. A detailed description of hyperOps and each component is provided in accordance with various embodiments of the invention.

[0080] HyperOp representation and operation semantics is described in detail as follows. [0081] HyperOp is the unit of scheduling. The execution model 412 enforces data- driven scheduling at the level of hyperOps. Within a hyperOp, the execution follows the conventional control flow (program-counter driven) scheduling of instructions. Each hyperOp is represented by its computation and its static metadata. The hyperOps computation is encoded as a sequence of instructions. The constituents of a hyperOps static metadata are described as follows:

• codePointer is the reference (code -pointer) to the instruction sequence that represents the hyperOp computation.

• arity specifies the number of operands of the hyperOp.

• annotations give attributes to the hyperOp.

[0082] Table 1 describes a few annotations.

Table 1 [0083] Apart from the usual arithmetic operations, control operations, and memory load and store operations, a hyperOp includes special instructions to communicate, synchronize, and spawn other hyperOps. The operation semantics of these special instructions are further described in conjunction with CEs. At runtime, each instance of a hyperOp is associated with a context frame. Each frame holds operands and instance metadata of its corresponding hyperOp instance. Each hyperOp can have at most 16 operands. Outputs produced during the execution of the hyperOp may serve as inputs (operands) to other hyperOps. A producer hyperOp directly writes operands to the consumer hyperOp’s frame. The consumer hyperOp starts executing as soon as all its operands are available, and its synchronization dependencies are satisfied. Execution of a consumer hyperOp may overlap with the execution of its producer hyperOps.

[0084] FIG. 6 is a diagram that illustrates an abstract machine for implementing a centralized runtime system in accordance with an exemplary embodiment of the invention. Referring to FIG. 6, there is shown an abstract machine 600 implementing a centralized runtime system, which includes the orchestrator 410, the global memory 404, the context memory 406, and the plurality of CEs (CEi- CE„).

[0085] The orchestrator 410 schedules ready hyperOps onto compute resources (CR) that include the plurality of CEs (CEi-CE_n). Each hyperOp instance is executed on a CE.

[0086] FIG. 7 is a diagram that illustrates an abstract machine for implementing a distributed runtime system in accordance with an exemplary embodiment of the invention. Referring to FIG. 7, there is shown an abstract machine 700 implementing a distributed runtime, which includes the global memory 404, the context memory 406 partitioned as CMo, CMi ...CM_n, a plurality of orchestrator instances Orcho, Ocrhi..Orch_n, and the plurality of CEs (CEi-CE_n). [0087] To distribute the orchestrator functionality, the context memory 406 and the plurality of CEs (CEi-CE_n) are partitioned and grouped into clusters of compute resources (CRo, CRi, ...CR_n) 702a-702n. The context memory 406 is partitioned into multiple banks CMo, ... , CM_n. The plurality of orchestrator instances Orcho, Ocrhi..Orch_n, scale with the number of CEs.

[0088] Each orchestrator instance manages CEs and the context memory bank within a CR. The ready-list and free-list of the orchestrator instances hold frames that belong to the context memory bank of the same CR. Thus, each orchestrator instance can allocate and de-allocate frames that belongs to the context memory bank of the same CR. The effect is that an FAlloc or Createlnst instruction executed in a cluster will allocate a frame that belongs to the same CR, and each orchestrator instance schedules hyperOps onto CEs within the same CR. The abstraction of a single address space for the context memory 406 is preserved, and allows a hyperOp executing in one CR to write to context frames of another CR. The bit field representation of a context memory address is as follows:

[0089] A limitation with this distributed runtime system is that the hyperOps created (using Createlnst or FAlloc and FBind) in one CR cannot be executed on another CR. This is an impediment to efficient work distribution. To address this issue, the instruction set is extended with remote frame allocate instruction called RF Alloc with operational semantics described as follows: r := RFAlloc(n, crid)

Allocate an array of n context frames, cf[0], ..., cf[n-l], in the CR indexed with crid and register r is loaded with the address of cf[0] and n must be less than 17. The allocated frames are in inactive state, i.e., not associated with any hyperOp instance. [0090] FBind and FDelete instructions are used to bind a hyperOp instance to a remote frame and to remove a remote frame respectively. With the number of CRs known, the work can be distributed across CRs using RFAlloc and FBind instructions.

[0091] Further, the execution model 412 prescribes a weak ordering memory model that guarantees sequential consistency for data-race-free programs. A weak ordering model relies on explicit synchronization to avoid data races. A data race occurs when at least two unordered memory operations are accessing the same memory location, and at least one of the operations is a write.

[0092] In the execution model 412, WriteCM and Sync instructions are used for synchronizing accesses to the global memory 404. The memory communication model 418 guarantees that all accesses to the global memory 404 that come before a synchronization operation (WriteCM or Sync) in the producer hyperOp’ s program order are observed by the (synchronizing) consumer hyperOp. In conventional shared-memory programming, synchronization enforces an order among shared variable accesses whereas in the execution model 412, synchronization enforces an execution order among hyperOps, which in turn imposes order among shared variable accesses. This does not mean that any two ordered hyperOps enforce an order among their shared variable accesses, as ordered hyperOps can still overlap their execution. Therefore, care must be taken to enforce an order among shared variable accesses, such that the producer hyperOp finishes the shared variable access and then enables the consumer hyperOp. The situation is better illustrated with an example shown in FIG. 8.

[0093] FIG. 8 is a diagram that illustrates execution among two hyperOps in accordance with an exemplary embodiment of the invention. Referring to FIG. 8, there is shown a producer hyperOp H_p, a consumer hyperOp H_c and a plurality of sync instructions si_, S2, S3, S4 and S5. [0094] An execution order among two hyperOps does not guarantee an order among their shared variable accesses. H_p enables H_c by decrementing its synchronization-wait-count using the sync instruction in S2. This creates an execution order between H_p and H_c, such that H_p < H_c. In H_p, the memory communication model 418 guarantees that si S2, thus si S4. However, S3 and S5 are unordered and may create a data race.

[0095] The sync instruction in S2 enforces a read after write dependency on variable ‘a’ in si and S4, thus guarantees that ‘c’ holds value ‘G after the execution of statement S4. However, accesses to variable ‘b’ in S3 and S5 are not synchronized and involves a data race. Such unsynchronized shared variable accesses cause undefined behavior and should be avoided at all costs.

[0096] The execution model 412 assumes that the programs are properly synchronized or data-race-free. With no data races, a program can be reasoned by executing hyperOps in the order of their producer-consumer relationship and executing the instructions of each hyperOp in the program order.

[0097] In accordance with an embodiment, a programming abstraction in C called as C with hyperOps that matches the execution model 412 is illustrated.

[0098] C with hyperOps can be used as an intermediate representation for compilers, and also as a low-level language for efficient programmers. In this programming abstraction, the REDEFINE instructions including, WriteCM, Sync, FAlloc, FBind, FDelete, Createlnst, and RF Alloc, are expressed as function calls. A standard C compiler tool chain is extended to lower these function calls to machine instructions. The instructions and their corresponding function interfaces are illustrated in the table below. Also, there is no equivalent function call for End instruction, and is added implicitly by the compiler.

[0099] Table 2 provides the execution model 412’s instructions and programming interfaces.

Table 2

_ CMAddr type is of 32-bit size and holds the context memory address. _ SMD is a structure that holds hyperOp static metadata.

[00100] Further, each hyperOp’ s computation is expressed as a function. The data transfer between hyperOps is realized through writes to the context memory 406 or through accesses to the global memory 404 using pointers that are explicitly communicated. Thus, hyperOp functions are of void return type. Function prototype for a hyperOp with two operands is shown below:

[00101] The custom function attribute hyperOp is added to specify that it is a hyperOp function. opO and opl are operands of the hyperOp and selfld is address of the context frame associated with the hyperOp instance. The argument selfld is not counted as an operand to the hyperOp. Function prototype of a hyperOp with zero operands is shown below:

[00102] As the current implementation is a 32-bit machine, _ Op32 can hold any data type that fits in 32-bit size. _ Op32 is defined as union data type as shown below:

[00103] As mentioned earlier, each hyperOp code (hyperOp function) is associated with a static metadata and is defined as type struct _ SMD as shown below:

[00104] In struct _ SMD, the ann field holds the hyperOp annotations (refer to Table 1). The arity field holds the hyperOp’s operand count and fptr holds the hyperOp’s function pointer. The variables that hold static metadata (struct _ SMD type) are constants, declared with const qualifier.

[00105] The code below shows declaration of a _ SMD variable smdEnd.

Its hyperOp function is end, it requires two operands, and it is annotated as ANN_END i.e., End-hyperOp (referring to Table 1).

[00106] FIG. 9a and FIG. 9b are diagrams that illustrate inter hyperOp communication in accordance with an exemplary embodiment of the invention.

[00107] FIG. 9a is a diagram that illustrates data transfer through the context memory 406. Referring to FIG. 9a, there is shown a producer hyperOp 902, a consumer hyperOp 904 and a code snippet 906 for data transfer through the context memory 406.

[00108] FIG. 9b is a diagram that illustrates data transfer through the global memory 404. Referring to FIG. 9b, there is shown the global memory 404, the producer hyperOp 902, the consumer hyperOp 904 and process steps 908, 910 and 912. [00109] At 908, the producer hyperOp 902 updates array in the global memory 404. At 910, the code shown in the code snippet 906 enables data transfer between the producer hyperOp 902 and the consumer hyperOp 904 through the context memory 406. At 912, the consumer hyperOp 904 loads the array from the global memory 404.

[00110] The method for communicating a scalar value between two hyperOps is detailed in Listing 1.

Listing 1

[00111] In Listing 1 , the producer hyperOp (lines 4-9) transfers a scalar value to consumer hyperOp (lines 11-14). Before transferring the scalar value, the producer hyperOp needs to know the context-frame address of its consumer hyperOp. The global variable consumerAddr (line 2) holds the context-frame address of consumer hyperOp instance. The consumer hyperOp takes only one operand named v$ (line 11) and this operand gets the 0th operand slot in its context frame. At line 7 the producer hyperOp assigns the 0th operand of the consumer hyperOp with a scalar value x, using _ writeCM instruction. The method of communicating an array of values between two hyperOps is provided in Listing 2.

Listing 2

[00112] In Listing 2, the producer hyperOp (lines 5-13) transfers an array of values to consumer hyperOp (lines 15-22) through global variable data (line 2). The variable consumer Addr (line 3) holds the context- frame address of consumer hyperOp instance. Producer hyperOp first writes the data that needs to be communicated to the consumer to the global memory 404 (lines 7-9) and then communicates the address of the data to consumer hyperOp (line 11). Note that the

_ writeCM instruction (line 11), ensures that producer hyperOp’s write operations

(line 8) happens before the consumer hyperOp’s read operations (line 19) to the shared variable data.

[00113] In an example, Listing 3 shows code for realizing fork-join model of parallelism.

Listing 3

[00114] FIG. 10 is a diagram illustrating a hyperOp dependence graph (HDG) implementing fork-join parallelism in accordance with an exemplary embodiment of the invention. Referring to FIG. 10, there is shown an HDG 1000 described in the code of Listing 3 implementing fork-join parallelism, which includes a master hyperOp 1002, a plurality of worker hyperOp instances 1004a- 1004n represented as workero, worken.. worker_n and a join hyperOp 1006.

[00115] The code in Listing 3 contains three types of hyperOps respresenting the functions, the plurality of worker hyperOp instances 1004a- 1004n (lines 1-8), the join hyperOp 1006 (lines 10-19), and the master hyperOp 1002 (lines 22-42). The plurality of worker hyperOp instances 1004a- 1004n, the join hyperOp 1006, and the master hyperOp 1002 are the hyperOp functions and smdWorker, smdJoin, and smdMaster are their static metadata, respectively. Function _ opAddr(frId, opld) (in line 29 and 39) returns a context memory address of the operand in the frame frld at index opld.

[00116] The parallel computation is distributed among the plurality of worker hyperOp instances 1004a- 1004n. The join function in the fork-join model performs the sequential computation that follows the parallel computation. The join hyperOp 1006 waits for the plurality of worker hyperOp instances 1004a- 1004n to finish execution. This is realized by annotating the join hyperOp 1006 as ANN_JOIN (line- 19) and using sync instruction (line-6). Each worker hyperOp instance of the plurality of worker hyperOp instances 1004a- 1004n finishes execution by synchronizing with the join hyperOp 1006 at line-6. Even though the join function does not receive any operands as arguments (line- 10) its arity is set to ‘1’ (line-19). The synchronization wait-count for the join hyperOp 1006 (referring to Table 1) is one of the operands for the hyperOp but not available as its function argument.

[00117] Further, the master hyperOp 1002 spawns or builds a sub-graph that implements fork-join pattern of parallelism. New nodes are added to the sub-graph at lines 27 and 36. Edges connecting each worker hyperOp or node to the join hyperOp 1006 or node are created at line-39. Here creating an edge means specifying a consumer hyperOp’s operand address to its producer hyperOp. Operand-15 of the join hyperOp 1006 holds the synchronization wait-count, line- 31. Since he plurality of worker hyperOp instances 1004a- 1004nneed to decrement this synchronization wait-count to enable the join hyperOp 1006, the address of the join hyperOp 1006’s operand-15 is forwarded to the plurality of worker hyperOp instances 1004a- 1004n, line-39.

[00118] Listing 4 shows the C code for computing Fibonacci number using recursion.

Listing 4 [00119] Listing 5 shows the C with hyperOps version of the same Fibonacci kernel.

Listing 5

[00120] From here on the C with hyperOps version is referred to as parallel- fib. This parallel-fib code is used as a running example to illustrate certain details of the C with hyperOps programming abstraction. The number of lines in the parallel-fib code is much more than the sequential code in Listing 4. It is because apart from actual computation, parallel-fib includes instructions (statements) to construct the HDG for the application.

[00121] The parallelFib function with the qualifier _ kernel, at line-54, is the entry function to the kernel. All input and output buffers are allocated by the host code (as per OpenCL semantics) and provided as arguments to this function. In this case N and fibN are input and output, respectively. The functions named fib, sum, and end annotated as _ hyperOp _ (hyperOp functions) describe respective hyperOp’s computations. The static metadata (SMD) variables associated with these hyperOps are smdFib, smdSum, and smdEnd. Each hyperOp type represents a static computation. At run time the kernel may involve execution of multiple instances of these hyperOp types, except the end-hyperOp (refer to Table 1). An end-hyperOp is the one with SMD annotation ANN END. In parallel-fib, end is the end-hyperOp. Note that smdEnd’ s ann field is assigned with ANN END, at line-11. The start-hyperOp (refer to Table 1) is not part of the kernel code. The runtime invokes the start-hyperOp, and the start-hyperOp calls the kernel’s entry function, in this case, parallelFib function. From a kernel programmer’s perspective, parallelFib function can be assumed as a hyperOp with no SMD associated with it.

[00122] FIG. 11(a) and FIG. 11(b) are diagrams that illustrate a hierarchical dataflow graph representation in accordance with an exemplary embodiment of the invention. Referring to FIG. 11(a) and FIG. 11(b), there is shown a hierarchical dataflow graph representation 1100 of the code in Fisting 5, which includes hierarchical hyperOps or nodes, ‘start ‘and ‘fib’ and leaf hyperOps or nodes, ‘sum’ and ‘end’.

[00123] The sum node output edge gets bound to the output edge of its parent fib node. Start hyperOp is part of the startup code (C runtime) and calls the kernel entry function, parallelFib. Parallel-fib has dynamic task-parallelism. Thus, its execution graph is input data dependent.

[00124] FIG. 12(a) and FIG. 12(b) are diagrams that illustrate an execution graph for a code in accordance with an exemplary embodiment of the invention. Referring to FIG. 12(a) and FIG. 12(b), there is shown a parallel-fib’s execution graph 1200 as a Directed Acyclic Graph (DAG) for the code in Fisting 5 with input N=3, which includes hierarchical nodes or hyperOps, ‘start’ and ‘fib’ and leaf nodes or hyperOps, ‘sum’ and ‘end’.

[00125] Referring to FIG. 12(a), there is shown a control dependence (parent-child) graph that depicts parent-child relationship or control dependence in the kernel. Only hierarchical nodes have children.

[00126] Referring to FIG. 12(b) there is shown a data dependence graph that depicts data-dependence or dataflow in the kernel.

[00127] In the execution graph 1200, the vertices represent hyperOp instances, and the directed edges represent dependencies between the connecting hyperOp instances. Since parallel-fib has dynamic parallelism, the kernel’s control and data dependence DAG is input dependent. The parallel-fib’s execution graph 1200 is used for computing the 3rd Fibonacci number.

[00128] The execution graph 1200 shows all the hyperOp types and their relations in parallel-fib code as a hierarchical dataflow graph in two different representations.

[00129] The execution graph 1200 comprise nodes and each node represents a hyperOp instance of type specified inside the node. Node ‘fib(i)’ represents instance of fib hyperOp with input i.

[00130] A programmer statically defines the hyperOp types and their relations or control and data dependencies. Dynamic instances of hyperOps may get generated at runtime, and the dependencies between the dynamic instances are the same as the statically defined relations. A hierarchical node or hyperOp contains a child graph, and the child graph itself can have hierarchical nodes and leaf nodes. [00131] Apart from instructions to spawn child graph nodes, a hierarchical node may include instructions to create new edges connecting the child nodes and instructions to bind its output edge(s) with one or more child node’s output edge(s). Here creating edges or binding edges means specifying a consumer hyperOp’s context memory address to the producer hyperOp. In parallel-fib, parallelFib function creates a child graph with two nodes - lines 56-57 spawn one instance of fib and one instance of end hyperOp, and at line-61 an edge is created connecting the output of fib to one of the inputs of end. Similarly, fib hyperOp can conditionally create a child graph with three nodes - lines 32-34 spawns two fib and one sum hyperOp instances, lines 40-41 creates edges connecting the output of fibs to inputs of sum, and line-44 binds the output edge of the child sum with the output edge of parent fib. Leaf nodes, end and sum, do not create any new nodes or edges. In the entire code in Listing 5, only lines 17, 29, 48, 49, 64, and 67 are performing data transfers. The remaining part of the code with REDEFINE specific instructions describes the kernel HDF.

[00132] Further, the process executes a kernel on multiple CRs. In the execution model 412, computational resources are organized in a two-level hierarchy as CEs and CRs. As mentioned, each orchestrator instance schedules hyperOps onto CEs within the same CR. The execution model 412 provides primitives, RF Alloc and FBind, to delegate a hyperOp from one CR to another CR. Using these primitives, a programmer can express the mapping between hyperOps and the CRs. With no dynamic load balancing support, work (load) distribution is part of the kernel code in the execution model 412.

[00133] The computational resources or number of CRs for a kernel are statically known or specified at compile time by the user. The computational resources are specified in terms of rectangular regions of CRs, called fabric. The hardware implementation follows a 2D arrangement of CRs. Thus, a kernel’s resource requirements are specified as dimensions of rectangular fabric. The programming abstraction provides two preprocessing macros named NUMCOL and NUMROW through which the programmer can define kernel resource requirements. For a fabric size of 2 columns and 3 rows, the user must define these macros as compiler option -D NUMCOL =2 -D NUMROW =3. If not defined, default value ‘ 1 ’ is assigned to these macros. NUMCR is another useful macro that holds the number of CRs allocated for the kernel, defined as:

[00134] Table 2 shows the _ rF Alloc API, one of the arguments for

_rF Alloc is the CR Id of type _ Crld defined as follows:

[00135] In a kernel’ s fabric, each CR is uniquely identified with its 2D-index of type Crld. The kernel’s start hyperOp is always executed on CR(0,0). Thus a kernel execution starts at CR(0,0) and then gets distributed to the rest of the fabric.

[00136] In an example, Listing 6 shows the C code for adding two arrays of length N.

Listing 6

[00137] The computation is realized as multiple smaller vector additions of length 32, defined as v32Add function. It is assumed that N is a multiple of 32 and v32Add is the basic unit of work.

[00138] Listing 7 shows a C with hyperOps version of adding two arrays using multiple CRs. With v32Add hyperOp as basic unit of work, it implements the work distribution on a fabric of size 2 x 2.

Listing 7

[00139] Listing 7 shows the vector addition in C with hyperOps, with fabric- size 2x2. Fabric size macros are defined in the compiler option as -D _

_NUMCOL _ =2 -D _ NUMROW _ =2. In Listing 7, VecAdd function executes on CR(0,0) and performs the work distribution across the fabric, lines 56-66. _

_rF Alloc (line-61) and fBind (line-62) spawns a new hyperOp for each CR. Note that the frames allocated using _ rF Alloc is explicitly deleted (garbage collected) using _ fDelete (line-38).

[00140] FIG. 13(a) and FIG. 13(b) are diagrams that illustrate a hierarchical dataflow graph representation of a code in accordance with an exemplary embodiment of the invention. Referring to FIG. 13(a) and FIG. 13(b), there is shown a hierarchical dataflow graph presentation 1300 of the code in Listing 7 depicting crWrk hyperOp (hierarchical node) that encompasses the work allocated for each CR.

[00141] The code in Listing 7 assumes that input vector length N is a multiple of (_ _NUMCR_ _ x 32).

[00142] The HDG of the code is illustrated in Listing 7. The kernel implements a two-level hierarchical fork-join parallelism. The parallelism pattern matches the underlying hierarchical organization of CEs.

[00143] FIG. 14(a) and FIG. 14(b) are diagrams that illustrate an execution graph for a code in accordance with an exemplary embodiment of the invention. Referring to FIG. 14(a) and FIG. 14(b), there is shown an execution graph 1400 for the code in Fisting 7 with input N = 256. [00144] Referring to FIG. 14(a), there is shown a control dependence (parent-child) graph. Referring to FIG. 14(b), there is shown a data dependence graph.

[00145] Each node in the execution graph 1400 represents a hyperOp instance of type specified inside the node. The kernel implements fork-join parallelism at two-levels, within a CR and across CRs. Start and end hyperOps are executed on CR(0,0). Each CR executes two v32Add hyperOps.

[00146] FIG. 15 is a diagram that illustrates an overview of compiler implementation/flow in C with hyperOps in accordance with an exemplary embodiment of the invention. Referring to FIG. 15, there is shown the compiler implementation/flow 1500 that includes a modified Clang 1502, a Low Level Virtual Machine (LLVM) Intermediate Representation (IR) optimizer 1504, a modified (Reduced Instruction Set Computer) RISC-V backend 1506 and a modified RISC-V GNU Binary Utilities (Binutils) 1508.

[00147] C with hyperOps 1510 is added to the modified Clang 1502. The Clang is a compiler front end for the C programming language. From the modified Clang, LLVM bitcode with REDEFINE intrinsics 1512 are added to the LLVM IR optimizer 1504. XR instructions are added to the modified RISC-V backend 1506 as intrinsics. The modified RISC-V GNU Binutils 1508 is updated to recognize the XR instructions.

[00148] In the REDEFINE compiler, the RISC-V instruction- set architecture

(ISA) is used for implementing the REDEFINE many-core processor. RISC-V is an open ISA and supports customization. The opcode map for custom-0 of RISC- V ISA is used for implementing specific instructions of the execution model 412. This custom extension of RISC-V ISA is called XR. Each CE is an in-order single issue 5-stage pipelined RV32IMFXR ISA core. [00149] Table 3 shows XR instructions encoding and illustrates XR - a RISC-V ISA custom extension to support the execution model 412. XR maps to RISC-V ISA’s custom-0 opcode encoding.

Table 3

[00150] The present invention is advantageous in that it provides a macro dataflow execution model for parallel execution of macro operations (hyperOps). The execution model is realized on a chip as both hardware and software functionality, and primitives/interfaces are provided for communication between software (HDG) and hardware (Context Memory, Global Memory, Orchestrator, and Compute Elements).

[00151] Programs are represented as a hierarchical dataflow graph which unfolds dynamically at runtime. Furthermore, the orchestrator functionality may also be realized as distributed/decentralized and enables scheduling and execution of hyperOps on compute resources. Such an execution model provides high performance by efficiently exploiting task parallelism at a finer granularity than conventional multithreading.

[00152] Those skilled in the art will realize that the above recognized advantages and other advantages described herein are merely exemplary and are not meant to be a complete rendering of all of the advantages of the various embodiments of the present invention.

[00153] The present invention may be realized in hardware, or a combination of hardware and software. The present invention may be realized in a centralized fashion, in at least one computer system, or in a distributed fashion, where different elements may be spread across several interconnected computer systems. A computer system or other apparatus/devices adapted to carry out the methods described herein may be suited. A combination of hardware and software may be a general-purpose computer system with a computer program that, when loaded and executed on the computer system, may control the computer system such that it carries out the methods described herein. The present invention may be realized in hardware that comprises a portion of an integrated circuit that also performs other functions. The present invention may also be realized as a firmware which form part of the media rendering device.

[00154] The present invention may also be embedded in a computer program product, which includes all the features that enable the implementation of the methods described herein, and which when loaded and/or executed on a computer system may be configured to carry out these methods. Computer program, in the present context, means any expression, in any language, code or notation, of a set of instructions intended to cause a system with information processing capability to perform a particular function either directly, or after either or both of the following: a) conversion to another language, code or notation; b) reproduction in a different material form. [00155] While the present invention is described with reference to certain embodiments, it will be understood by those skilled in the art that various changes may be made, and equivalents may be substituted without departure from the scope of the present invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the present invention without departure from its scope. Therefore, it is intended that the present invention is not limited to the particular embodiment disclosed, but that the present invention will include all embodiments that fall within the scope of the appended claims.

Claims

1. An apparatus (400) for implementing dynamic, data-dependent parallelism for task execution based on an execution model (412), wherein the execution model (412) defines a plurality of high-level abstractions that determine the behavior of a plurality of hardware components of the apparatus (400), the apparatus (400) comprising: a memory (402) with two distinct address spaces; a plurality of compute resources (408a-408n), each of the one or more compute resources comprising a plurality of compute elements (CEs) (CEi- CEn); and an orchestrator (410); wherein the execution model (412) comprises: a concurrency model (414) that defines one or more primitive operations that enable the runtime construction of a hyperOp dependence graph (HDG) (416) that describes one or more application programs, wherein the HDG (416) is a hierarchical dataflow graph that comprises a plurality of hyperOps represented by nodes of the hierarchical dataflow graph and a plurality of directed edges of the hierarchical dataflow graph connect the plurality of hyperOps, wherein a hyperOp is a multiple-input and multiple-output macro operation, and the plurality of directed edges represent explicit data transfer or execution order requirement between one or more connected hyperOps of the plurality of hyperOps, wherein hyperOp types, their relations and data dependencies in the one or more application programs are statically defined, wherein the one or more application programs are organized as hyperOp computations and hyperOp static metadata, wherein the hyperOp static metadata specifies hyperOp composition and its annotations; a memory communication model (416) that defines one or more primitive operations that enable configuring the two distinct address spaces of the memory (402) as a global memory (404) and a context memory (406), wherein the global memory (404) comprises a global memory address space which is shared across the plurality of hyperOps, and wherein the context memory (406) comprises a context memory address space which stores one or more context frames, wherein each hyperOp instance of a hyperOp is associated with a context frame, wherein a context frame holds operands of an associated hyperOp instance; and a runtime system (420) that defines one or more functionalities that are implemented in the orchestrator (410) to manage execution of one or more hyperOp instances on one or more compute resources of the plurality of compute resources (408a-408n) based on the associated hyperOp computations and hyperOp static metadata defined in the HDG (416), wherein each compute resource of the one or more compute resources comprises the plurality of CEs (CEi-CE_n) and each hyperOp instance of the one or more hyperOp instances executed on a CE, wherein the orchestrator (410) schedules unordered hyperOp instances in parallel and directly loads operands of a hyperOp instance from an associated context frame into a register file of a CE.

2. The apparatus (400) as claimed in claim 1, wherein accesses to the global memory (404) are data-race free, wherein accesses to the global memory (404) are ordered and synchronized across the plurality of hyperOps through writes to the context memory (406).

3. The apparatus (400) as claimed in claim 1, wherein a hyperOp can perform only writes to the context memory (406).

4. The apparatus (400) as claimed in claim 1, wherein a context frame comprises slots that hold operands of a hyperOp instance, wherein the slots are write - once, immutable data structures and the operand values associated with a hyperOp in a context frame are undefined after read by the orchestrator (410) during launch of the hyperOp.

5. The apparatus (400) as claimed in claim 1, wherein only the orchestrator (410) reads the context memory (406).

6. The apparatus (400) as claimed in claim 5, wherein the orchestrator (410) comprises two storage structures, a free-list and a ready-list, wherein the free list keeps track of unallocated context frames and the ready-list contains addresses of context frames with hyperOps that are ready to execute.

7. The apparatus (400) as claimed in claim 1, wherein the plurality of hyperOps comprise at least one producer hyperOp and at least one consumer hyperOp, wherein a producer hyperOp stores data in the global memory (404) and communicates the address of the data or synchronizes to a corresponding consumer hyperOp through a write to a context frame and the corresponding consumer hyperOp accesses the data through the address received as an operand, wherein operands are undefined when context frames are created and are defined only once by writes of a producer hyperOp.

8. The apparatus (400) as claimed in claim 7, wherein the orchestrator (410) enforces an execution order between the producer hyperOp and the corresponding consumer hyperOp through writes to a context frame.

9. The apparatus (400) as claimed in claim 1, wherein the execution model (412) comprises a centralized runtime system that defines one or more functionalities that are implemented in the orchestrator (410) to manage the entire context memory address space, wherein the orchestrator (410) schedules execution of hyperOps on the plurality of CEs (CEi-CE_n).

10. The apparatus (400) as claimed in claim 1, wherein the execution model (412) comprises a decentralized runtime system that defines one or more functionalities to configure the orchestrator (410) as a plurality of orchestrator instances (Orcho, Ocrhi..Orch_n), wherein the context memory address space and the plurality of CEs (CEi-CE_n) are partitioned and grouped into a plurality of clusters, each cluster of the plurality of clusters forms a compute resource, wherein each orchestrator instance of the plurality of orchestrator instances (Orcho, Ocrhi..Orch_n) is associated with a compute resource and schedules execution of hyperOps on the compute resource.