WO2014170036A1 - Method and apparatus for exploiting data locality in dynamic task scheduling - Google Patents

Method and apparatus for exploiting data locality in dynamic task scheduling Download PDF

Info

Publication number
WO2014170036A1
WO2014170036A1 PCT/EP2014/051193 EP2014051193W WO2014170036A1 WO 2014170036 A1 WO2014170036 A1 WO 2014170036A1 EP 2014051193 W EP2014051193 W EP 2014051193W WO 2014170036 A1 WO2014170036 A1 WO 2014170036A1
Authority
WO
WIPO (PCT)
Prior art keywords
function
data
parallel
data structures
source code
Prior art date
Application number
PCT/EP2014/051193
Other languages
French (fr)
Inventor
Sebastian MATTHEIS
Tobias SCHÜLE
Original Assignee
Siemens Aktiengesellschaft
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Siemens Aktiengesellschaft filed Critical Siemens Aktiengesellschaft
Priority to EP14703041.5A priority Critical patent/EP2943877B1/en
Publication of WO2014170036A1 publication Critical patent/WO2014170036A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/451Code distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/42Syntactic analysis
    • G06F8/423Preprocessors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/43Checking; Contextual analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • G06F8/4441Reducing the execution time required by the program code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/451Code distribution
    • G06F8/452Loops
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5033Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering data affinity

Definitions

  • the invention relates to a method and apparatus for
  • Scheduling is the act of time-sharing resources between multiple resource requesters.
  • tasks are scheduled to utilize processing time on available computing resources. Scheduling can be driven by various decisionmaking constraints. For example, tasks have to be scheduled to meet certain deadlines or have to use processing resources efficiently to increase the throughput.
  • the emergence of parallel computing systems introduces additional challenges in scheduling.
  • scheduling comprises the sequencing of tasks to utilize a single processor whereas in a multiprocessor system, tasks have to be distributed to multiple processors to speed up the execution of the program.
  • the processors can have non-uniform memory access times. As a consequence, the execution time of a task can depend on the utilized processor and its memory access time to the data used by the task.
  • the memory access times can be higher, if the execution of a task is mapped to a processor which is located remote to the used data, as if mapped to a processor that is nearby to the used data.
  • the tasks are distributed to different processor cores of processors at runtime. To achieve a high performance, tasks may be executed on processor cores which have already the necessary
  • the used data has first to be loaded which takes additional time.
  • NUMA nonuniform memory access system
  • the data has to be loaded under certain circumstances via a communication network from a remote memory which can lead to a significant reduction of performance of the respective system.
  • a conventional way to avoid such performance losses is to use heuristics in the scheduler, which for instance make sure that child tasks are executed on the same processor cores as the respective parent tasks, as described for instance in Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H.
  • heuristics do not have any information about the location of the data in the cache memories or in the main memory of the computer system.
  • some libraries offer mechanisms which consider the data location for simple loops and use this information for scheduling.
  • An example for such a concept is "affinity partitioners" used in Threading Building Blocks, which is a library of Intel for parallel programming in C++, as described under
  • Another conventional approach is to use in the source code of the application explicitly data location information that influences the scheduling. For this, the software developer has to indicate where specific data is read or changed.
  • the invention provides according to a first aspect a method for scheduling tasks to processor cores of a parallel
  • processor core which is associated to a memory unit of the parallel computing system where the data of the data
  • the capture list of the parallel lambda function indicates external data structures which are used in the function body of the parallel lambda function.
  • the parallel lambda function comprises besides the capture list and the function body a parameter list .
  • the processing of the source code is performed by a compiler unit which generates automatically code to derive data location information on the basis of the capture list and the parameter list of the parallel lambda function whose function body is called by said task.
  • the data location information derived by the code generated by said compiler unit indicates a storage location of the data stored in the specified data structures .
  • the parallel lambda function is used by a library function to create a task, wherein the library function is read from a function library of said computing system.
  • the localize operation localizes data which is stored in the data structures which are specified by the capture list of the parallel lambda function.
  • the localize operation is inserted into the argument list of the library function that creates a task executing the function body.
  • the localize operation localizes as the storage location the memory unit of the parallel
  • an update operation is automatically inserted which updates the stored data location information with respect to the storage location of the stored data of the specified data structures.
  • the update operation is inserted into the function body of the parallel lambda function.
  • the update operation stores the number of the last processor core which had access to the data of the specified data structures in a management list or
  • an apparatus for scheduling of tasks to processor cores of a parallel computing system comprising a compiler unit which processes a source code comprising at least one parallel lambda function having a function body called by a task and which accesses data structures specified by a capture list of said parallel lambda function to derive data location information, wherein the calling task is executed on the processor core which is associated to a memory unit of the parallel computing system which stores the data of the data structures specified in said capture list of said parallel lambda function, wherein the memory unit is selected or localized on the basis of the derived data location information.
  • the memory unit is a cache memory of a processor of said parallel computing system which comprises several processor cores.
  • a computing system which comprises a scheduling apparatus according to the second aspect of the present invention and which comprises several processors each having at least one processor core and distributed memory units each being associated to a corresponding processor.
  • Fig. 1 shows a flow chart of a possible embodiment of a method for scheduling of tasks in a parallel computing system
  • Fig. 2 shows a diagram of a multi-core processor within a parallel computing system for illustrating the operation of a possible embodiment of the method and apparatus according to the present invention
  • Fig. 3 shows a diagram for illustrating a scheduling
  • a source code is loaded and processed, for instance by a compiler unit of the computer system.
  • the loaded source code comprises at least one parallel lambda function which has a function body.
  • the lambda function is an anonymous function.
  • the lambda function or anonymous function is a function or a subroutine which is defined and possibly called without being bound to an identifier.
  • Anonymous functions are used to pass an argument to a higher-order function.
  • anonymous functions are identified by using the keyword lambda so that anonymous functions can be referred to as lambda functions.
  • Anonymous functions are mostly used to contain functionality that does not need to be named.
  • There are many programming languages which support anonymous functions for instance C++ since the standard from 2011 called C++11.
  • C++11 provides anonymous functions, however, no parallel lambda functions.
  • the parallel lambda expression as used by the method according to the present invention has the syntax form:
  • the lambda function refers to identifiers declared outside the lambda function. A set of these variables is commonly called a closure. Closures are defined between the square brackets of the lambda function in the declaration of the lambda expression. The mechanism allows these variables to be captured by value or by reference. The capture list indicates which variables or objects declared outside the lambda function are visible inside the lambda function.
  • parameter list in round brackets specifies parameters and the third part of the lambda function indicates the function body of the lambda function.
  • the data structures can comprise several dimensions.
  • a data structure can be for instance a data array or a data matrix.
  • the data structures can be user-defined data
  • the task calling the function body is assigned to and executed on a processor core of the parallel computing system which is associated to the memory unit of the parallel computing system which does store the data of the data structures specified in the capture list of the parallel lambda function.
  • the memory unit is determined on the basis of the derived data location information.
  • the capture list of the parallel lambda function forming part of the source code indicates external data structures which are used in the function body of the parallel lambda function.
  • the processing of the source code is performed by a compiler unit.
  • the compiler unit generates code which derives from the capture list and the parameter list of the parallel lambda function, whose function body is called by the task, automatically the data location
  • the derived data location information indicates the storage location of the data of the specified data structures. It is possible that the parallel lambda function of the source code is activated or invoked by a library function, wherein the library function can be read from a library of the parallel computing system.
  • a library function is the so-called spawn function.
  • a localize operation "localize” is inserted automatically. This localize operation determines or
  • the compiler unit when processing the source code does in a possible implementation insert automatically this localize operation in the argument list of a library function, for instance into the argument list of a spawn function.
  • the inserted localize operation determines as the storage location the memory unit of the parallel
  • the localize operation acquires information by reading the content of a locality vector.
  • the locality vector can be an array that records locality
  • Each entry in the locality vector can represent a certain data block.
  • the content of the array points to a processor indicating in a possible
  • the compiler unit when processing the source code does further insert automatically an update operation "update" in the parallel lambda function which forms part of the source code.
  • This update operation updates the stored data location information with respect to the storage location of the stored data of the specified data structures.
  • the compiler unit inserts automatically the update operation in the function body of the parallel lambda function.
  • the update operation stores the number or identifier of the processor core which has been the last processor core which had access to the data of the specified data structures in a management list or management table which can be used by the localize operation. Fig.
  • the multiprocessor comprises two levels of cache memories which can be placed on a chip.
  • a first level cache L1C can be private, whereas a second level cache L2C and a last level cache LLC can be shared among multiple processor cores.
  • the cache memories can be used transparently which means that a program can access data as it would reside in the main memory only.
  • the multi-processor comprises four
  • Fig. 3 illustrates a scheduling in a parallel computing system.
  • the scheduling of dynamic multi-tasking computations requires scheduling steps including processor mapping and execution ordering.
  • a scheduler implementation requires a mechanism for resource allocation.
  • a runtime environment is provided to allocate resources such as processors and data structures and to provide a task interface.
  • a parallel task runtime environment can be
  • a task runtime environment TRE can be provided to manage the necessary resources, i.e. processor allocation, thread management and memory allocation for the execution of a multi-tasking application. For that purpose, the task runtime environment TRE can create as many worker threads as
  • processors can be used and pins each worker thread to exactly one processor.
  • Each worker thread can perform an execution loop that continuously fetches and executes tasks in each iteration.
  • the task runtime environment TRE can provide a task-based interface, for instance with a spawn and a sync operation .
  • a dynamic multi-tasking application can be provided on top of the task runtime environment TRE having a task-based
  • the spawn and sync operations allow dynamic task creation and synchronization.
  • an underlying queuing system QS as illustrated in Fig. 3 can provide an interface to the task runtime
  • the queuing system QS can implement in a possible embodiment a scheduling mechanism and data structures to store and schedule tasks.
  • the queuing system QS is capable to obtain runtime information of the task runtime environment to acquire scheduling information, e.g. the number of working threads.
  • a conventional quicksort algorithm which sorts recursively an array can be expressed as follows: void quicksort ⁇ int array [], i.nt left, int right) ⁇
  • the library function spawn takes the lambda function as an argument and generates a new task. After having generated the task, the library function executes the generated task parallel to the current task. The function sync waits until all generated child tasks have been finished. By means of the scheduler, the tasks are distributed during runtime to the different processor cores.
  • the conventional source code of a quicksort algorithm is modified by the use of a parallel lambda function instead of a conventional lambda function. Accordingly, the quicksort algorithm as shown above is implemented using parallel lambda functions as illustrated below: void quicksort (int array [], Int left, int right) ⁇ If t ie ft ⁇ right) ⁇
  • the source code comprises two times the library function spawn and both library functions take a parallel lambda function as argument, where the capture list is specified within two square brackets.
  • An expression of the form x[i:j] indicates that the body of the lambda function only accesses the array x within the interval ranging from element i to element j .
  • the compiler unit automatically inserts a localize operation and an update operation.
  • the localize operation indicates the storage location of those data which is stored in the data structures specified in the capture list of the parallel lambda function. As can be seen, the localize operation is inserted automatically into the
  • the inserted localize operation determines the storage location of the memory unit which is associated to the processor core which has been the last processor core which had access to the data of the specified data structures.
  • the compiler unit inserts automatically during the processing of the source code an update operation into the parallel lambda function of the source code.
  • the update operation updates the data location information with respect to the storage location of the data of the specified data structures.
  • the update operation is inserted automatically into the function body of the parallel lambda function.
  • the update operation stores the number of the processor core which has been the last processor core having access to the data of the data structures in a management list or management table which can be used by the localize operation.
  • the mechanism shown above can be used in an analog way for data structures having several data dimensions such as data matrices. It is also possible to use user-defined objects in the capture list of the parallel lambda function instead of data arrays.
  • the functions "localize” and "update” are implemented as methods of a class. This means for instance for one-dimensional data structures that the class has to implement the following interface : interface Localizable ⁇
  • the result of the above call can be for instance: spawn ⁇ x . localize ( i , j ) , [x, ...] ( ) ⁇
  • the parallel lambda functions can be used for instance in recursive algorithms as illustrated above.
  • the parallel lambda function can be used for other algorithms as well, for instance for parallel loops or pipeline processing.
  • data location information during execution of parallel programs or tasks is derived and utilized automatically.
  • the parallel lambda functions used in the source code allow the compiler unit to generate code that automatically extracts information about data location and memory accesses.
  • the capture lists can be extended to specify relevant regions within regular data structures such as arrays by means of intervals. Further, it is possible to use predefined interfaces for user-defined data structures. With the method according to the present invention, it is possible to automatically insert function and method calls to obtain the required data location information or to update them.
  • the complexity of the required source code is reduced significantly so that the system is less prone to failures. Moreover, it is simpler to read and maintain the used source code because of the reduced complexity of the source code. Further, the method according to the present invention can be used for any parallel computation of different types such as loops, fork-join, divide-and-conquer etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

A method for scheduling tasks to processor cores of a parallel computing system comprising the steps of processing a source code which comprises at least one parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body of said parallel lambda function and used to derive data location information; executing the task calling said function body on the processor core which is associated to a memory unit of the parallel computing system where the data of the data structures specified by said capture list is stored, wherein the memory unit is selected or localized on the basis of the derived data location information.

Description

Method and apparatus for exploiting data locality in dynamic task scheduling
TECHNICAL BACKGROUND
The invention relates to a method and apparatus for
scheduling of tasks in a parallel computing system having several processors each comprising at least one processor core .
Scheduling is the act of time-sharing resources between multiple resource requesters. In a computer system, tasks are scheduled to utilize processing time on available computing resources. Scheduling can be driven by various decisionmaking constraints. For example, tasks have to be scheduled to meet certain deadlines or have to use processing resources efficiently to increase the throughput. The emergence of parallel computing systems introduces additional challenges in scheduling. In a uniprocessor system, scheduling comprises the sequencing of tasks to utilize a single processor whereas in a multiprocessor system, tasks have to be distributed to multiple processors to speed up the execution of the program. However, in a parallel computer system, the processors can have non-uniform memory access times. As a consequence, the execution time of a task can depend on the utilized processor and its memory access time to the data used by the task. For instance, the memory access times can be higher, if the execution of a task is mapped to a processor which is located remote to the used data, as if mapped to a processor that is nearby to the used data. By means of a scheduler, the tasks are distributed to different processor cores of processors at runtime. To achieve a high performance, tasks may be executed on processor cores which have already the necessary
corresponding data used by the task in their respective cache memory. Otherwise, the used data has first to be loaded which takes additional time. This is particularly relevant in a multiprocessor system with distributed memory, e.g. a nonuniform memory access system (NUMA) . In such a system, the data has to be loaded under certain circumstances via a communication network from a remote memory which can lead to a significant reduction of performance of the respective system. A conventional way to avoid such performance losses is to use heuristics in the scheduler, which for instance make sure that child tasks are executed on the same processor cores as the respective parent tasks, as described for instance in Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou "An Efficient Multithreaded Runtime System", Symposium on Principles and Practice of Parallel Programming (PPOPP) , ACM, 1995. However, these kinds of heuristics fail for instance, when the same data is accessed several times by sequential loops. The reason for that is that these
heuristics do not have any information about the location of the data in the cache memories or in the main memory of the computer system. To overcome this problem, some libraries offer mechanisms which consider the data location for simple loops and use this information for scheduling. An example for such a concept is "affinity partitioners" used in Threading Building Blocks, which is a library of Intel for parallel programming in C++, as described under
http : / /threadingbuildingblocks . org . However, these kinds of mechanisms can for instance not be used for recursive
calculations or algorithms.
Another conventional approach is to use in the source code of the application explicitly data location information that influences the scheduling. For this, the software developer has to indicate where specific data is read or changed.
Obviously, a significant disadvantage of this conventional approach is that the developer has to encode the necessary operations within the source code. This increases the
complexity of the source code and makes it more difficult to maintain the developed software code. Accordingly, there is a need for a method and apparatus for scheduling of tasks of a parallel computing system with several processor cores to increase the performance or throughput of the computing system without increasing the complexity of the source code.
SUMMARY OF THE INVENTION
The invention provides according to a first aspect a method for scheduling tasks to processor cores of a parallel
computing system comprising the steps of:
processing a source code which comprises at least one
parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body and used to derive data
location information; and
executing the task calling said function body on the
processor core which is associated to a memory unit of the parallel computing system where the data of the data
structures specified by said capture list is stored, wherein the memory unit is selected or localized on the basis of the derived data location information.
In a possible embodiment of the method according to the present invention, the capture list of the parallel lambda function indicates external data structures which are used in the function body of the parallel lambda function.
In a further possible embodiment of the method according to the present invention, the parallel lambda function comprises besides the capture list and the function body a parameter list .
In a further possible embodiment of the method according to the present invention, the processing of the source code is performed by a compiler unit which generates automatically code to derive data location information on the basis of the capture list and the parameter list of the parallel lambda function whose function body is called by said task.
In a further possible embodiment of the method according to the present invention, the data location information derived by the code generated by said compiler unit indicates a storage location of the data stored in the specified data structures . In a further possible embodiment of the method according to the present invention, the parallel lambda function is used by a library function to create a task, wherein the library function is read from a function library of said computing system.
In a further possible embodiment of the method according to the present invention, during processing of the source code by the compiler unit a localize operation is inserted
automatically, wherein the localize operation localizes data which is stored in the data structures which are specified by the capture list of the parallel lambda function.
In a further possible embodiment of the method according to the present invention, during processing of the source code, the localize operation is inserted into the argument list of the library function that creates a task executing the function body.
In a further possible embodiment of the method according to the present invention, the localize operation localizes as the storage location the memory unit of the parallel
computing system which is associated to the last processor core which had access to the data of the specified data structures .
In a further possible embodiment of the method according to the present invention, during processing of the source code by the compiler unit an update operation is automatically inserted which updates the stored data location information with respect to the storage location of the stored data of the specified data structures. In a further possible embodiment of the method according to the present invention, during processing of the source code the update operation is inserted into the function body of the parallel lambda function. In a further possible embodiment of the method according to the present invention, the update operation stores the number of the last processor core which had access to the data of the specified data structures in a management list or
management table used by the localize operation.
According to a further second aspect of the present
invention, an apparatus for scheduling of tasks to processor cores of a parallel computing system is provided comprising a compiler unit which processes a source code comprising at least one parallel lambda function having a function body called by a task and which accesses data structures specified by a capture list of said parallel lambda function to derive data location information, wherein the calling task is executed on the processor core which is associated to a memory unit of the parallel computing system which stores the data of the data structures specified in said capture list of said parallel lambda function, wherein the memory unit is selected or localized on the basis of the derived data location information.
In a possible embodiment of the apparatus according to the present invention, the memory unit is a cache memory of a processor of said parallel computing system which comprises several processor cores.
According to a further aspect of the present invention, a computing system is provided which comprises a scheduling apparatus according to the second aspect of the present invention and which comprises several processors each having at least one processor core and distributed memory units each being associated to a corresponding processor. BRIEF DESCRIPTION OF FIGURES
In the following, possible embodiments and implementations the method and apparatus for scheduling tasks according to the present invention are described in more detail with reference to the enclosed figures:
Fig. 1 shows a flow chart of a possible embodiment of a method for scheduling of tasks in a parallel computing system;
Fig. 2 shows a diagram of a multi-core processor within a parallel computing system for illustrating the operation of a possible embodiment of the method and apparatus according to the present invention;
Fig. 3 shows a diagram for illustrating a scheduling
mechanism for illustrating the operation of possible embodiments of the method and apparatus according to the present invention;
DETAILED DESCRIPTION OF POSSIBLE EMBODIMENTS
As can be seen in Fig. 1, in a possible implementation of the method for scheduling tasks to processor cores of a parallel computing system there are two main steps. In a first step
SI, a source code is loaded and processed, for instance by a compiler unit of the computer system. The loaded source code comprises at least one parallel lambda function which has a function body. The lambda function is an anonymous function. The lambda function or anonymous function is a function or a subroutine which is defined and possibly called without being bound to an identifier. Anonymous functions are used to pass an argument to a higher-order function. In some programming languages, anonymous functions are identified by using the keyword lambda so that anonymous functions can be referred to as lambda functions. Anonymous functions are mostly used to contain functionality that does not need to be named. There are many programming languages which support anonymous functions, for instance C++ since the standard from 2011 called C++11. C++11 provides anonymous functions, however, no parallel lambda functions. The parallel lambda expression as used by the method according to the present invention has the syntax form:
[ [capture list] ] (parameters) {body}
The lambda function refers to identifiers declared outside the lambda function. A set of these variables is commonly called a closure. Closures are defined between the square brackets of the lambda function in the declaration of the lambda expression. The mechanism allows these variables to be captured by value or by reference. The capture list indicates which variables or objects declared outside the lambda function are visible inside the lambda function. The
parameter list in round brackets specifies parameters and the third part of the lambda function indicates the function body of the lambda function. The body of a parallel lambda
function can be called by a task and can access data
structures which are specified in the capture list of the parallel lambda function. The processing of the source code is performed to derive automatically data location
information of the data of the specified data structures. The data structures can comprise several dimensions. A data structure can be for instance a data array or a data matrix. Further, the data structures can be user-defined data
structures or user-defined data objects. After having generated code for deriving the data location information in step SI, the task calling the function body is assigned to and executed on a processor core of the parallel computing system which is associated to the memory unit of the parallel computing system which does store the data of the data structures specified in the capture list of the parallel lambda function. The memory unit is determined on the basis of the derived data location information. The capture list of the parallel lambda function forming part of the source code indicates external data structures which are used in the function body of the parallel lambda function. In a possible embodiment, the processing of the source code is performed by a compiler unit. The compiler unit generates code which derives from the capture list and the parameter list of the parallel lambda function, whose function body is called by the task, automatically the data location
information. The derived data location information indicates the storage location of the data of the specified data structures. It is possible that the parallel lambda function of the source code is activated or invoked by a library function, wherein the library function can be read from a library of the parallel computing system. An example for such a library function is the so-called spawn function.
In a possible embodiment of the method according to the present invention, during processing of the source code by the compiler unit a localize operation "localize" is inserted automatically. This localize operation determines or
specifies the storage location of those data which is stored in the data structures which have been specified in the capture list of the parallel lambda function. The compiler unit when processing the source code does in a possible implementation insert automatically this localize operation in the argument list of a library function, for instance into the argument list of a spawn function. In a possible
implementation, the inserted localize operation determines as the storage location the memory unit of the parallel
computing system which is assigned to the processor core which has been the last processor core to access the data of the specified data structures, for instance the specified data array. In a possible embodiment, the localize operation acquires information by reading the content of a locality vector. The locality vector can be an array that records locality
information for blocks of data. Each entry in the locality vector can represent a certain data block. The content of the array points to a processor indicating in a possible
implementation that the corresponding data is located in the processor's cache. In a further possible embodiment, the compiler unit when processing the source code does further insert automatically an update operation "update" in the parallel lambda function which forms part of the source code. This update operation updates the stored data location information with respect to the storage location of the stored data of the specified data structures. In a possible implementation during processing of the source code, the compiler unit inserts automatically the update operation in the function body of the parallel lambda function. In a possible embodiment, the update operation stores the number or identifier of the processor core which has been the last processor core which had access to the data of the specified data structures in a management list or management table which can be used by the localize operation. Fig. 2 shows a block diagram of a possible embodiment of a multiprocessor having several processor cores P. In the exemplary embodiment shown in Fig. 2, the multiprocessor comprises two levels of cache memories which can be placed on a chip. A first level cache L1C can be private, whereas a second level cache L2C and a last level cache LLC can be shared among multiple processor cores. The cache memories can be used transparently which means that a program can access data as it would reside in the main memory only. In the shown example of Fig. 2, the multi-processor comprises four
processor cores P. To each processor core, a corresponding memory unit can be associated or assigned. This memory unit can be for instance formed by one of the cache memories integrated on the same chip as the processor core. Fig. 3 illustrates a scheduling in a parallel computing system. The scheduling of dynamic multi-tasking computations requires scheduling steps including processor mapping and execution ordering. Besides the mechanism to map the tasks to processors and to determine the execution order, a scheduler implementation requires a mechanism for resource allocation. A runtime environment is provided to allocate resources such as processors and data structures and to provide a task interface. A parallel task runtime environment can be
provided for parallel execution of dynamic multi-tasking computations .
A task runtime environment TRE can be provided to manage the necessary resources, i.e. processor allocation, thread management and memory allocation for the execution of a multi-tasking application. For that purpose, the task runtime environment TRE can create as many worker threads as
processors can be used and pins each worker thread to exactly one processor. Each worker thread can perform an execution loop that continuously fetches and executes tasks in each iteration. The task runtime environment TRE can provide a task-based interface, for instance with a spawn and a sync operation .
A dynamic multi-tasking application can be provided on top of the task runtime environment TRE having a task-based
interface. The spawn and sync operations allow dynamic task creation and synchronization.
Further, an underlying queuing system QS as illustrated in Fig. 3 can provide an interface to the task runtime
environment TRE with an enqueue and dequeue operation to store and fetch tasks for execution. Internally, the queuing system QS can implement in a possible embodiment a scheduling mechanism and data structures to store and schedule tasks. The queuing system QS is capable to obtain runtime information of the task runtime environment to acquire scheduling information, e.g. the number of working threads.
In the following, the operation of a possible implementation of the method for scheduling tasks to processor cores of a parallel computing system is described by a simple example of a quicksort algorithm. A conventional quicksort algorithm which sorts recursively an array can be expressed as follows: void quicksort {int array [], i.nt left, int right) {
If (left < right) {
int pivot = partition {array, left, right);
spawn ( [array, left, pivot] { ) {quicksort (array, left, pivot.) ; } ) ;
spawn ( (array, pivot, right] () {quicksort (array, pivot+1, righ | ;}) ; sync { ) ;
I
1
The library function spawn takes the lambda function as an argument and generates a new task. After having generated the task, the library function executes the generated task parallel to the current task. The function sync waits until all generated child tasks have been finished. By means of the scheduler, the tasks are distributed during runtime to the different processor cores. The conventional source code of a quicksort algorithm is modified by the use of a parallel lambda function instead of a conventional lambda function. Accordingly, the quicksort algorithm as shown above is implemented using parallel lambda functions as illustrated below: void quicksort (int array [], Int left, int right) { If t ie ft < right) {
Int pivot = partition (array, left, right);
spawn ([ [array [left: pivot] , left, pivot] 3 Π ί
quickso t ( rray, le t , pivot. ) ;
>) ;
spawn ( [ [array [pivofc+i : rightJ , pivot, right]] ( ) {
quicksort { array, pivot+1 , right ) ;
>);
sync ( ) ;
1
}
In the illustrated source code, the source code comprises two times the library function spawn and both library functions take a parallel lambda function as argument, where the capture list is specified within two square brackets. An expression of the form x[i:j] indicates that the body of the lambda function only accesses the array x within the interval ranging from element i to element j .
A call of the library function with a parallel lambda
function as an argument of said library function spawn ([[x[i:j], ...] ] () {body}) corresponds to the following fragment and can be transformed accordingly by the compiler unit: spawn { localize (x, i, ¾ , [x, ,..] { ) {
body;
update (x, i, j);
n
As can be seen, the compiler unit automatically inserts a localize operation and an update operation. The localize operation indicates the storage location of those data which is stored in the data structures specified in the capture list of the parallel lambda function. As can be seen, the localize operation is inserted automatically into the
argument list of the library function "spawn". The inserted localize operation determines the storage location of the memory unit which is associated to the processor core which has been the last processor core which had access to the data of the specified data structures. Moreover, the compiler unit inserts automatically during the processing of the source code an update operation into the parallel lambda function of the source code. The update operation updates the data location information with respect to the storage location of the data of the specified data structures. As can be seen, the update operation is inserted automatically into the function body of the parallel lambda function. In a possible embodiment, the update operation stores the number of the processor core which has been the last processor core having access to the data of the data structures in a management list or management table which can be used by the localize operation.
The mechanism shown above can be used in an analog way for data structures having several data dimensions such as data matrices. It is also possible to use user-defined objects in the capture list of the parallel lambda function instead of data arrays. In a preferred implementation, the functions "localize" and "update" are implemented as methods of a class. This means for instance for one-dimensional data structures that the class has to implement the following interface : interface Localizable {
IOCSlize I i, j ) ;
update (i, j ) ;
} r " The result of the above call can be for instance: spawn { x . localize ( i , j ) , [x, ...] ( ) {
body;
x, update i , j ) ;
H The same is true for high-dimensional user-defined data structures .
The parallel lambda functions can be used for instance in recursive algorithms as illustrated above. The parallel lambda function can be used for other algorithms as well, for instance for parallel loops or pipeline processing.
With the method according to the present invention, data location information during execution of parallel programs or tasks is derived and utilized automatically. The parallel lambda functions used in the source code allow the compiler unit to generate code that automatically extracts information about data location and memory accesses. In a possible embodiment, the capture lists can be extended to specify relevant regions within regular data structures such as arrays by means of intervals. Further, it is possible to use predefined interfaces for user-defined data structures. With the method according to the present invention, it is possible to automatically insert function and method calls to obtain the required data location information or to update them.
With the method and apparatus according to the present invention, the complexity of the required source code is reduced significantly so that the system is less prone to failures. Moreover, it is simpler to read and maintain the used source code because of the reduced complexity of the source code. Further, the method according to the present invention can be used for any parallel computation of different types such as loops, fork-join, divide-and-conquer etc. By supporting user-defined data structures the
flexibility of the method and apparatus is still increased. By using the method and apparatus according to the present invention, the performance of the computing system is increased significantly because of the use of the data location information.

Claims

Claims :
1. A method for scheduling tasks to processor cores of a parallel computing system comprising the steps of:
(a) processing a source code which comprises at least one parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body of said parallel lambda function and used to derive data location information;
(b) executing the task calling said function body on the processor core which is associated to a memory unit of the parallel computing system where the data of the data structures specified by said capture list is stored, wherein the memory unit is selected or localized on the basis of the derived data location information .
2. The method according to claim 1, wherein the capture list of the parallel lambda function indicates external data structures which are used by the function body of said lambda function.
3. The method according to claim 1 or 2, wherein the
parallel lambda function comprises besides the capture list and the function body a parameter list.
4. The method according to one of the preceding claims 1 to
3, wherein the processing of the source code is performed by a compiler unit which generates code to derive the data location information from the capture list and the parameter list of the parallel lambda function whose function body is called by said task.
5. The method according to one of the preceding claims 1 to
4, wherein the data location information derived by the code generated by said compiler unit indicates a storage location of the data stored in the specified data
structures .
6. The method according to one of the preceding claims 1 to
5, wherein the parallel lambda function is used by a library function of the parallel computing system.
7. The method according to one of the preceding claims 1 to
6, wherein upon processing of the source code by the compiler unit a localize operation is automatically inserted, wherein the localize operation localizes the storage location of the data stored in the data
structures which are specified in the capture list of said parallel lambda function.
8. The method according to claim 7, wherein upon processing of said source code the localize operation is inserted into the argument list of the library function.
9. The method according to claim 7 or 8, wherein the
localize operation determines as the storage location the memory unit of said parallel computing system which is associated to the last processor core which had access to the data of the specified data structures.
10. The method according to one of the preceding claims 1 to 9, wherein upon processing of said source code by said compiler unit an update operation is automatically inserted in said parallel lambda function which updates the stored data location information with respect to the storage location of the data stored in said specified data structures.
11. The method according to claim 10, wherein upon processing of said source code the update operation is inserted into the function body of said lambda function. The method according to claim 10 or 11, wherein the update operation stores the number of the last processor core which had access to the data of the specified data structures in a management list or management table to which the localize operation has access.
An apparatus for scheduling of tasks to processor cores of a parallel computing system comprising a compiler unit which processes automatically a source code which
comprises at least one parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body of said parallel lambda function and used to derive data location information, wherein the calling task is executed on the processor core associated to a memory unit which is selected on the basis of the derived data location information and which stores the data of the data structures specified in said capture list of said parallel lambda function.
The apparatus according to claim 13, wherein the memory unit comprises a cache memory of a processor within said parallel computing system comprising at least one
processor core.
A computing system comprising an apparatus according to claim 13, several processors each having at least one processor core and distributed memory units each being associated to a corresponding processor core.
PCT/EP2014/051193 2013-04-18 2014-01-22 Method and apparatus for exploiting data locality in dynamic task scheduling WO2014170036A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP14703041.5A EP2943877B1 (en) 2013-04-18 2014-01-22 Method and apparatus for exploiting data locality in dynamic task scheduling

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13/865,856 2013-04-18
US13/865,856 US9176716B2 (en) 2013-04-18 2013-04-18 Method and apparatus for exploiting data locality in dynamic task scheduling

Publications (1)

Publication Number Publication Date
WO2014170036A1 true WO2014170036A1 (en) 2014-10-23

Family

ID=50068972

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2014/051193 WO2014170036A1 (en) 2013-04-18 2014-01-22 Method and apparatus for exploiting data locality in dynamic task scheduling

Country Status (3)

Country Link
US (1) US9176716B2 (en)
EP (1) EP2943877B1 (en)
WO (1) WO2014170036A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9529622B1 (en) * 2014-12-09 2016-12-27 Parallel Machines Ltd. Systems and methods for automatic generation of task-splitting code
US10241761B2 (en) * 2014-12-29 2019-03-26 Nvidia Corporation System and method for compiler support for compile time customization of code
EP3059690B1 (en) 2015-02-19 2019-03-27 Axiomatics AB Remote rule execution
US10496329B2 (en) * 2017-06-02 2019-12-03 Cavium, Llc Methods and apparatus for a unified baseband architecture
TWI639921B (en) * 2017-11-22 2018-11-01 大陸商深圳大心電子科技有限公司 Command processing method and storage controller using the same
US11960940B2 (en) * 2018-05-29 2024-04-16 Telefonaktiebolaget Lm Ericsson (Publ) Performance of function as a service

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004252728A (en) * 2003-02-20 2004-09-09 Univ Waseda Method for compiling, compiler, compilation device, program code creation method, program, calculation method and device for optimum utilization of cache
US20070028222A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Free/outer variable capture
US20090328047A1 (en) * 2008-06-30 2009-12-31 Wenlong Li Device, system, and method of executing multithreaded applications

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2078315A1 (en) * 1991-09-20 1993-03-21 Christopher L. Reeve Parallel processing apparatus and method for utilizing tiling
US20070143759A1 (en) * 2005-12-15 2007-06-21 Aysel Ozgur Scheduling and partitioning tasks via architecture-aware feedback information
US8667474B2 (en) * 2009-06-19 2014-03-04 Microsoft Corporation Generation of parallel code representations
US9128763B2 (en) * 2011-08-23 2015-09-08 Infosys Limited System and method for job scheduling optimization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004252728A (en) * 2003-02-20 2004-09-09 Univ Waseda Method for compiling, compiler, compilation device, program code creation method, program, calculation method and device for optimum utilization of cache
US20070028222A1 (en) * 2005-07-29 2007-02-01 Microsoft Corporation Free/outer variable capture
US20090328047A1 (en) * 2008-06-30 2009-12-31 Wenlong Li Device, system, and method of executing multithreaded applications

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ROBERT D. BLUMOFE; CHRISTOPHER F. JOERG; BRADLEY C. KUSZMAUL; CHARLES E. LEISERSON; KEITH H. RANDALL; YULI ZHOU: "Symposium on Principles and Practice of Parallel Programming (PPOPP", 1995, ACM, article "An Efficient Multithreaded Runtime System"

Also Published As

Publication number Publication date
EP2943877A1 (en) 2015-11-18
EP2943877B1 (en) 2020-12-16
US9176716B2 (en) 2015-11-03
US20140317636A1 (en) 2014-10-23

Similar Documents

Publication Publication Date Title
EP2943877B1 (en) Method and apparatus for exploiting data locality in dynamic task scheduling
US10861214B2 (en) Graphics processor with non-blocking concurrent architecture
EP1912119B1 (en) Synchronization and concurrent execution of control flow and data flow at task level
KR101759266B1 (en) Mapping processing logic having data parallel threads across processors
JP2866241B2 (en) Computer system and scheduling method
US9542221B2 (en) Dynamic co-scheduling of hardware contexts for parallel runtime systems on shared machines
US20190196881A1 (en) Deterministic parallelization through atomic task computation
US9239739B2 (en) Methods and apparatus for controlling affinity for execution entities
CN110308982B (en) Shared memory multiplexing method and device
US20050066302A1 (en) Method and system for minimizing thread switching overheads and memory usage in multithreaded processing using floating threads
Paudel et al. On the merits of distributed work-stealing on selective locality-aware tasks
JP2019049843A (en) Execution node selection program and execution node selection method and information processor
US11340942B2 (en) Cooperative work-stealing scheduler
Schmid et al. Fine-grained parallelism framework with predictable work-stealing for real-time multiprocessor systems
Ras et al. An evaluation of the dynamic and static multiprocessor priority ceiling protocol and the multiprocessor stack resource policy in an SMP system
Pinho et al. Real-time fine-grained parallelism in ada
US11645124B2 (en) Program execution control method and vehicle control device
Singh Toward predictable execution of real-time workloads on modern GPUs
CN117591242B (en) Compiling optimization method, system, storage medium and terminal based on bottom virtual machine
Schuele Efficient parallel execution of streaming applications on multi-core processors
Chouteau et al. Design and implementation of a Ravenscar extension for multiprocessors
Clapp et al. Parallel language constructs for efficient parallel processing
Queue Task Queue Implementation Pattern
CHᴀIᴍᴏᴠ Performance Analysis of Many-Task Runtimes
Sorgo Data Locality in Large Scale Multiprocessors

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14703041

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2014703041

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE