WO2014170036A1

WO2014170036A1 - Method and apparatus for exploiting data locality in dynamic task scheduling

Info

Publication number: WO2014170036A1
Application number: PCT/EP2014/051193
Authority: WO
Inventors: Sebastian MATTHEIS; Tobias SCHÜLE
Original assignee: Siemens Aktiengesellschaft
Priority date: 2013-04-18
Filing date: 2014-01-22
Publication date: 2014-10-23
Also published as: EP2943877A1; EP2943877B1; US9176716B2; US20140317636A1

Abstract

A method for scheduling tasks to processor cores of a parallel computing system comprising the steps of processing a source code which comprises at least one parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body of said parallel lambda function and used to derive data location information; executing the task calling said function body on the processor core which is associated to a memory unit of the parallel computing system where the data of the data structures specified by said capture list is stored, wherein the memory unit is selected or localized on the basis of the derived data location information.

Description

Method and apparatus for exploiting data locality in dynamic task scheduling

TECHNICAL BACKGROUND

The invention relates to a method and apparatus for

scheduling of tasks in a parallel computing system having several processors each comprising at least one processor core .

Scheduling is the act of time-sharing resources between multiple resource requesters. In a computer system, tasks are scheduled to utilize processing time on available computing resources. Scheduling can be driven by various decisionmaking constraints. For example, tasks have to be scheduled to meet certain deadlines or have to use processing resources efficiently to increase the throughput. The emergence of parallel computing systems introduces additional challenges in scheduling. In a uniprocessor system, scheduling comprises the sequencing of tasks to utilize a single processor whereas in a multiprocessor system, tasks have to be distributed to multiple processors to speed up the execution of the program. However, in a parallel computer system, the processors can have non-uniform memory access times. As a consequence, the execution time of a task can depend on the utilized processor and its memory access time to the data used by the task. For instance, the memory access times can be higher, if the execution of a task is mapped to a processor which is located remote to the used data, as if mapped to a processor that is nearby to the used data. By means of a scheduler, the tasks are distributed to different processor cores of processors at runtime. To achieve a high performance, tasks may be executed on processor cores which have already the necessary

corresponding data used by the task in their respective cache memory. Otherwise, the used data has first to be loaded which takes additional time. This is particularly relevant in a multiprocessor system with distributed memory, e.g. a nonuniform memory access system (NUMA) . In such a system, the data has to be loaded under certain circumstances via a communication network from a remote memory which can lead to a significant reduction of performance of the respective system. A conventional way to avoid such performance losses is to use heuristics in the scheduler, which for instance make sure that child tasks are executed on the same processor cores as the respective parent tasks, as described for instance in Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou "An Efficient Multithreaded Runtime System", Symposium on Principles and Practice of Parallel Programming (PPOPP) , ACM, 1995. However, these kinds of heuristics fail for instance, when the same data is accessed several times by sequential loops. The reason for that is that these

heuristics do not have any information about the location of the data in the cache memories or in the main memory of the computer system. To overcome this problem, some libraries offer mechanisms which consider the data location for simple loops and use this information for scheduling. An example for such a concept is "affinity partitioners" used in Threading Building Blocks, which is a library of Intel for parallel programming in C++, as described under

http : / /threadingbuildingblocks . org . However, these kinds of mechanisms can for instance not be used for recursive

calculations or algorithms.

Another conventional approach is to use in the source code of the application explicitly data location information that influences the scheduling. For this, the software developer has to indicate where specific data is read or changed.

Obviously, a significant disadvantage of this conventional approach is that the developer has to encode the necessary operations within the source code. This increases the

complexity of the source code and makes it more difficult to maintain the developed software code. Accordingly, there is a need for a method and apparatus for scheduling of tasks of a parallel computing system with several processor cores to increase the performance or throughput of the computing system without increasing the complexity of the source code.

SUMMARY OF THE INVENTION

The invention provides according to a first aspect a method for scheduling tasks to processor cores of a parallel

computing system comprising the steps of:

processing a source code which comprises at least one

parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body and used to derive data

location information; and

executing the task calling said function body on the

processor core which is associated to a memory unit of the parallel computing system where the data of the data

structures specified by said capture list is stored, wherein the memory unit is selected or localized on the basis of the derived data location information.

In a possible embodiment of the method according to the present invention, the capture list of the parallel lambda function indicates external data structures which are used in the function body of the parallel lambda function.

In a further possible embodiment of the method according to the present invention, the parallel lambda function comprises besides the capture list and the function body a parameter list .

In a further possible embodiment of the method according to the present invention, the processing of the source code is performed by a compiler unit which generates automatically code to derive data location information on the basis of the capture list and the parameter list of the parallel lambda function whose function body is called by said task.

In a further possible embodiment of the method according to the present invention, the data location information derived by the code generated by said compiler unit indicates a storage location of the data stored in the specified data structures . In a further possible embodiment of the method according to the present invention, the parallel lambda function is used by a library function to create a task, wherein the library function is read from a function library of said computing system.

In a further possible embodiment of the method according to the present invention, during processing of the source code by the compiler unit a localize operation is inserted

automatically, wherein the localize operation localizes data which is stored in the data structures which are specified by the capture list of the parallel lambda function.

In a further possible embodiment of the method according to the present invention, during processing of the source code, the localize operation is inserted into the argument list of the library function that creates a task executing the function body.

In a further possible embodiment of the method according to the present invention, the localize operation localizes as the storage location the memory unit of the parallel

computing system which is associated to the last processor core which had access to the data of the specified data structures .

In a further possible embodiment of the method according to the present invention, during processing of the source code by the compiler unit an update operation is automatically inserted which updates the stored data location information with respect to the storage location of the stored data of the specified data structures. In a further possible embodiment of the method according to the present invention, during processing of the source code the update operation is inserted into the function body of the parallel lambda function. In a further possible embodiment of the method according to the present invention, the update operation stores the number of the last processor core which had access to the data of the specified data structures in a management list or

management table used by the localize operation.

According to a further second aspect of the present

invention, an apparatus for scheduling of tasks to processor cores of a parallel computing system is provided comprising a compiler unit which processes a source code comprising at least one parallel lambda function having a function body called by a task and which accesses data structures specified by a capture list of said parallel lambda function to derive data location information, wherein the calling task is executed on the processor core which is associated to a memory unit of the parallel computing system which stores the data of the data structures specified in said capture list of said parallel lambda function, wherein the memory unit is selected or localized on the basis of the derived data location information.

In a possible embodiment of the apparatus according to the present invention, the memory unit is a cache memory of a processor of said parallel computing system which comprises several processor cores.

According to a further aspect of the present invention, a computing system is provided which comprises a scheduling apparatus according to the second aspect of the present invention and which comprises several processors each having at least one processor core and distributed memory units each being associated to a corresponding processor. BRIEF DESCRIPTION OF FIGURES

In the following, possible embodiments and implementations the method and apparatus for scheduling tasks according to the present invention are described in more detail with reference to the enclosed figures:

Fig. 1 shows a flow chart of a possible embodiment of a method for scheduling of tasks in a parallel computing system;

Fig. 2 shows a diagram of a multi-core processor within a parallel computing system for illustrating the operation of a possible embodiment of the method and apparatus according to the present invention;

Fig. 3 shows a diagram for illustrating a scheduling

mechanism for illustrating the operation of possible embodiments of the method and apparatus according to the present invention;

DETAILED DESCRIPTION OF POSSIBLE EMBODIMENTS

As can be seen in Fig. 1, in a possible implementation of the method for scheduling tasks to processor cores of a parallel computing system there are two main steps. In a first step

SI, a source code is loaded and processed, for instance by a compiler unit of the computer system. The loaded source code comprises at least one parallel lambda function which has a function body. The lambda function is an anonymous function. The lambda function or anonymous function is a function or a subroutine which is defined and possibly called without being bound to an identifier. Anonymous functions are used to pass an argument to a higher-order function. In some programming languages, anonymous functions are identified by using the keyword lambda so that anonymous functions can be referred to as lambda functions. Anonymous functions are mostly used to contain functionality that does not need to be named. There are many programming languages which support anonymous functions, for instance C++ since the standard from 2011 called C++11. C++11 provides anonymous functions, however, no parallel lambda functions. The parallel lambda expression as used by the method according to the present invention has the syntax form:

[ [capture list] ] (parameters) {body}

The lambda function refers to identifiers declared outside the lambda function. A set of these variables is commonly called a closure. Closures are defined between the square brackets of the lambda function in the declaration of the lambda expression. The mechanism allows these variables to be captured by value or by reference. The capture list indicates which variables or objects declared outside the lambda function are visible inside the lambda function. The

parameter list in round brackets specifies parameters and the third part of the lambda function indicates the function body of the lambda function. The body of a parallel lambda

function can be called by a task and can access data

structures which are specified in the capture list of the parallel lambda function. The processing of the source code is performed to derive automatically data location

information of the data of the specified data structures. The data structures can comprise several dimensions. A data structure can be for instance a data array or a data matrix. Further, the data structures can be user-defined data

structures or user-defined data objects. After having generated code for deriving the data location information in step SI, the task calling the function body is assigned to and executed on a processor core of the parallel computing system which is associated to the memory unit of the parallel computing system which does store the data of the data structures specified in the capture list of the parallel lambda function. The memory unit is determined on the basis of the derived data location information. The capture list of the parallel lambda function forming part of the source code indicates external data structures which are used in the function body of the parallel lambda function. In a possible embodiment, the processing of the source code is performed by a compiler unit. The compiler unit generates code which derives from the capture list and the parameter list of the parallel lambda function, whose function body is called by the task, automatically the data location

information. The derived data location information indicates the storage location of the data of the specified data structures. It is possible that the parallel lambda function of the source code is activated or invoked by a library function, wherein the library function can be read from a library of the parallel computing system. An example for such a library function is the so-called spawn function.

In a possible embodiment of the method according to the present invention, during processing of the source code by the compiler unit a localize operation "localize" is inserted automatically. This localize operation determines or

specifies the storage location of those data which is stored in the data structures which have been specified in the capture list of the parallel lambda function. The compiler unit when processing the source code does in a possible implementation insert automatically this localize operation in the argument list of a library function, for instance into the argument list of a spawn function. In a possible

implementation, the inserted localize operation determines as the storage location the memory unit of the parallel

computing system which is assigned to the processor core which has been the last processor core to access the data of the specified data structures, for instance the specified data array. In a possible embodiment, the localize operation acquires information by reading the content of a locality vector. The locality vector can be an array that records locality

information for blocks of data. Each entry in the locality vector can represent a certain data block. The content of the array points to a processor indicating in a possible

implementation that the corresponding data is located in the processor's cache. In a further possible embodiment, the compiler unit when processing the source code does further insert automatically an update operation "update" in the parallel lambda function which forms part of the source code. This update operation updates the stored data location information with respect to the storage location of the stored data of the specified data structures. In a possible implementation during processing of the source code, the compiler unit inserts automatically the update operation in the function body of the parallel lambda function. In a possible embodiment, the update operation stores the number or identifier of the processor core which has been the last processor core which had access to the data of the specified data structures in a management list or management table which can be used by the localize operation. Fig. 2 shows a block diagram of a possible embodiment of a multiprocessor having several processor cores P. In the exemplary embodiment shown in Fig. 2, the multiprocessor comprises two levels of cache memories which can be placed on a chip. A first level cache L1C can be private, whereas a second level cache L2C and a last level cache LLC can be shared among multiple processor cores. The cache memories can be used transparently which means that a program can access data as it would reside in the main memory only. In the shown example of Fig. 2, the multi-processor comprises four

processor cores P. To each processor core, a corresponding memory unit can be associated or assigned. This memory unit can be for instance formed by one of the cache memories integrated on the same chip as the processor core. Fig. 3 illustrates a scheduling in a parallel computing system. The scheduling of dynamic multi-tasking computations requires scheduling steps including processor mapping and execution ordering. Besides the mechanism to map the tasks to processors and to determine the execution order, a scheduler implementation requires a mechanism for resource allocation. A runtime environment is provided to allocate resources such as processors and data structures and to provide a task interface. A parallel task runtime environment can be

provided for parallel execution of dynamic multi-tasking computations .

A task runtime environment TRE can be provided to manage the necessary resources, i.e. processor allocation, thread management and memory allocation for the execution of a multi-tasking application. For that purpose, the task runtime environment TRE can create as many worker threads as

processors can be used and pins each worker thread to exactly one processor. Each worker thread can perform an execution loop that continuously fetches and executes tasks in each iteration. The task runtime environment TRE can provide a task-based interface, for instance with a spawn and a sync operation .

A dynamic multi-tasking application can be provided on top of the task runtime environment TRE having a task-based

interface. The spawn and sync operations allow dynamic task creation and synchronization.

Further, an underlying queuing system QS as illustrated in Fig. 3 can provide an interface to the task runtime

environment TRE with an enqueue and dequeue operation to store and fetch tasks for execution. Internally, the queuing system QS can implement in a possible embodiment a scheduling mechanism and data structures to store and schedule tasks. The queuing system QS is capable to obtain runtime information of the task runtime environment to acquire scheduling information, e.g. the number of working threads.

In the following, the operation of a possible implementation of the method for scheduling tasks to processor cores of a parallel computing system is described by a simple example of a quicksort algorithm. A conventional quicksort algorithm which sorts recursively an array can be expressed as follows: void quicksort {int array [], i.nt left, int right) {

If (left < right) {

int pivot = partition {array, left, right);

spawn ( [array, left, pivot] { ) {quicksort (array, left, pivot.) ; } ) ;

spawn ( (array, pivot, right] () {quicksort (array, pivot+1, righ | ;}) ; sync { ) ;

I

1

The library function spawn takes the lambda function as an argument and generates a new task. After having generated the task, the library function executes the generated task parallel to the current task. The function sync waits until all generated child tasks have been finished. By means of the scheduler, the tasks are distributed during runtime to the different processor cores. The conventional source code of a quicksort algorithm is modified by the use of a parallel lambda function instead of a conventional lambda function. Accordingly, the quicksort algorithm as shown above is implemented using parallel lambda functions as illustrated below: void quicksort (int array [], Int left, int right) { If t ie ft < right) {

Int pivot = partition (array, left, right);

spawn ([ [array [left: pivot] , left, pivot] 3 Π ί

quickso t ( rray, le t , pivot. ) ;

>) ;

spawn ( [ [array [pivofc+i : rightJ , pivot, right]] ( ) {

quicksort { array, pivot+1 , right ) ;

>);

sync ( ) ;

1

}

In the illustrated source code, the source code comprises two times the library function spawn and both library functions take a parallel lambda function as argument, where the capture list is specified within two square brackets. An expression of the form x[i:j] indicates that the body of the lambda function only accesses the array x within the interval ranging from element i to element j .

A call of the library function with a parallel lambda

function as an argument of said library function spawn ([[x[i:j], ...] ] () {body}) corresponds to the following fragment and can be transformed accordingly by the compiler unit: spawn { localize (x, i, ¾ , [x, ,..] { ) {

body;

update (x, i, j);

n

As can be seen, the compiler unit automatically inserts a localize operation and an update operation. The localize operation indicates the storage location of those data which is stored in the data structures specified in the capture list of the parallel lambda function. As can be seen, the localize operation is inserted automatically into the

argument list of the library function "spawn". The inserted localize operation determines the storage location of the memory unit which is associated to the processor core which has been the last processor core which had access to the data of the specified data structures. Moreover, the compiler unit inserts automatically during the processing of the source code an update operation into the parallel lambda function of the source code. The update operation updates the data location information with respect to the storage location of the data of the specified data structures. As can be seen, the update operation is inserted automatically into the function body of the parallel lambda function. In a possible embodiment, the update operation stores the number of the processor core which has been the last processor core having access to the data of the data structures in a management list or management table which can be used by the localize operation.

The mechanism shown above can be used in an analog way for data structures having several data dimensions such as data matrices. It is also possible to use user-defined objects in the capture list of the parallel lambda function instead of data arrays. In a preferred implementation, the functions "localize" and "update" are implemented as methods of a class. This means for instance for one-dimensional data structures that the class has to implement the following interface : interface Localizable {

IOCSlize I i, j ) ;

update (i, j ) ;

} r " The result of the above call can be for instance: spawn { x . localize ( i , j ) , [x, ...] ( ) {

body;

x, update i , j ) ;

H The same is true for high-dimensional user-defined data structures .

The parallel lambda functions can be used for instance in recursive algorithms as illustrated above. The parallel lambda function can be used for other algorithms as well, for instance for parallel loops or pipeline processing.

With the method according to the present invention, data location information during execution of parallel programs or tasks is derived and utilized automatically. The parallel lambda functions used in the source code allow the compiler unit to generate code that automatically extracts information about data location and memory accesses. In a possible embodiment, the capture lists can be extended to specify relevant regions within regular data structures such as arrays by means of intervals. Further, it is possible to use predefined interfaces for user-defined data structures. With the method according to the present invention, it is possible to automatically insert function and method calls to obtain the required data location information or to update them.

With the method and apparatus according to the present invention, the complexity of the required source code is reduced significantly so that the system is less prone to failures. Moreover, it is simpler to read and maintain the used source code because of the reduced complexity of the source code. Further, the method according to the present invention can be used for any parallel computation of different types such as loops, fork-join, divide-and-conquer etc. By supporting user-defined data structures the

flexibility of the method and apparatus is still increased. By using the method and apparatus according to the present invention, the performance of the computing system is increased significantly because of the use of the data location information.

Claims

Claims :

1. A method for scheduling tasks to processor cores of a parallel computing system comprising the steps of:

(a) processing a source code which comprises at least one parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body of said parallel lambda function and used to derive data location information;

(b) executing the task calling said function body on the processor core which is associated to a memory unit of the parallel computing system where the data of the data structures specified by said capture list is stored, wherein the memory unit is selected or localized on the basis of the derived data location information .

2. The method according to claim 1, wherein the capture list of the parallel lambda function indicates external data structures which are used by the function body of said lambda function.

3. The method according to claim 1 or 2, wherein the

parallel lambda function comprises besides the capture list and the function body a parameter list.

4. The method according to one of the preceding claims 1 to

3, wherein the processing of the source code is performed by a compiler unit which generates code to derive the data location information from the capture list and the parameter list of the parallel lambda function whose function body is called by said task.

5. The method according to one of the preceding claims 1 to

4, wherein the data location information derived by the code generated by said compiler unit indicates a storage location of the data stored in the specified data

structures .

6. The method according to one of the preceding claims 1 to

5, wherein the parallel lambda function is used by a library function of the parallel computing system.

7. The method according to one of the preceding claims 1 to

6, wherein upon processing of the source code by the compiler unit a localize operation is automatically inserted, wherein the localize operation localizes the storage location of the data stored in the data

structures which are specified in the capture list of said parallel lambda function.

8. The method according to claim 7, wherein upon processing of said source code the localize operation is inserted into the argument list of the library function.

9. The method according to claim 7 or 8, wherein the

localize operation determines as the storage location the memory unit of said parallel computing system which is associated to the last processor core which had access to the data of the specified data structures.

10. The method according to one of the preceding claims 1 to 9, wherein upon processing of said source code by said compiler unit an update operation is automatically inserted in said parallel lambda function which updates the stored data location information with respect to the storage location of the data stored in said specified data structures.

11. The method according to claim 10, wherein upon processing of said source code the update operation is inserted into the function body of said lambda function. The method according to claim 10 or 11, wherein the update operation stores the number of the last processor core which had access to the data of the specified data structures in a management list or management table to which the localize operation has access.

An apparatus for scheduling of tasks to processor cores of a parallel computing system comprising a compiler unit which processes automatically a source code which

comprises at least one parallel lambda function having a function body called by a task and having a capture list specifying the data structures accessed in the function body of said parallel lambda function and used to derive data location information, wherein the calling task is executed on the processor core associated to a memory unit which is selected on the basis of the derived data location information and which stores the data of the data structures specified in said capture list of said parallel lambda function.

The apparatus according to claim 13, wherein the memory unit comprises a cache memory of a processor within said parallel computing system comprising at least one

processor core.

A computing system comprising an apparatus according to claim 13, several processors each having at least one processor core and distributed memory units each being associated to a corresponding processor core.