WO2018197695A1 - A computer-implemented method, a computer-readable medium and a heterogeneous computing system - Google Patents
A computer-implemented method, a computer-readable medium and a heterogeneous computing system Download PDFInfo
- Publication number
- WO2018197695A1 WO2018197695A1 PCT/EP2018/060932 EP2018060932W WO2018197695A1 WO 2018197695 A1 WO2018197695 A1 WO 2018197695A1 EP 2018060932 W EP2018060932 W EP 2018060932W WO 2018197695 A1 WO2018197695 A1 WO 2018197695A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- runtime
- processing unit
- data structure
- segment
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5044—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/52—Program synchronisation; Mutual exclusion, e.g. by means of semaphores
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/48—Indexing scheme relating to G06F9/48
- G06F2209/486—Scheduler internals
Definitions
- Embodiments of the present invention relate to computer-implemented methods, in particular to computer-implemented methods for executing numerical algorithms on heterogeneous computing systems, a heterogeneous computing system and a computer- readable media.
- GPL general purpose programming languages
- C/C++ C/C++
- C# C#
- FORTRAN FORTRAN
- Pascal Java
- Such languages generally work on a low granularity of data, dealing with scalar data as basic data elements and arbitrary control instructions.
- compiler technology has been successfully improved for single processor execution, compilation of (GPL) programs for heterogeneous computing systems remains challenging.
- the compiler has to make certain decisions: which processing units to target, how to instruct the processing units to process the data and how to provide the data to the processing units.
- Challenges include the broad diversity of (rapidly changing) computing hardware, the complexity of arbitrary GPL language elements and instruction graphs, the dynamic nature of the data to be processed, and the times when all constraining information become eventually available.
- the compiler often requires significant decision making support from the programmer.
- the parallelization potential of many software programs e.g. most GPL based programs is typically only partly accessible and efficiently transformable to heterogeneous systems, respectively.
- the so far used automated parallelization approaches may only be worth for relatively large data sizes and large code segments, respectively.
- the method includes initializing a first processing unit of a heterogeneous computing system with a first compute kernel and a second processing unit of the heterogeneous computing system with a second compute kernel. Both the first compute kernel and the second compute kernel are configured to perform a numerical operation derived from a program segment which is configured to receive a first data structure storing multiple elements of a common data type.
- the program segment includes a function meta information including data related to a size of an output of the numerical operation, a structure of the output, and/or an effort for generating the output.
- the function meta information and a data meta information of a runtime instance of the first data structure are used to calculate first expected costs of executing the first kernel on the first processing unit to perform the numerical operation with the runtime instance and to calculate second expected costs of executing the second kernel on the second processing unit to perform the numerical operation with the runtime instance.
- the data meta information includes a runtime size information of the runtime instance and a runtime location information of the runtime instance.
- the method further includes one of executing the first compute kernel on the first processing unit to perform the numerical operation on the runtime instance if the first expected costs are lower than or equal to the second expected costs, and executing the second compute kernel with the second processing unit to perform the numerical operation with the runtime instance if the first expected costs are higher than the second expected costs.
- the data meta information may further include a runtime synchronization information of the runtime instance and/or a runtime type information of the runtime instance.
- processing unit intends to describe a hardware device, in particular a processor, which is capable to perform operations and/or calculations according to instructions typically stored in a unit specific compute kernel.
- the processing unit may be a single core CPU (central processing unit), a multicore CPU, a CPU with vector extension (such as SSE/AVX), a GPU (graphics processing unit), a CPU integrating at least one GPU, a coprocessor supporting numerical computations, SIMD accelerator (single instruction, multiple data accelerator), a microcontroller, a microprocessor such as a DSP (digital signal processor), and an FPGA (field-programmable gate array) or a combination of one or more of those devices typically forming a so-called virtual processor.
- the term "effort for generating the output” intends to describe a numerical effort (purely computational costs) for generating the output.
- the respective function meta information may include a numerical effort factor or even several numerical effort factors.
- the function meta information Prior to runtime, the function meta information may be updated in accordance with hardware properties (properties of the processing units).
- expected costs intends to describe expected overall costs for performing the numerical operation with a runtime instance on a given processing unit at runtime.
- the expected costs typically refer to computational costs on the given processing unit at runtime and may include transfer costs (when the processing unit has not yet access to the runtime instance).
- the expected costs may be calculated using the function meta information, e.g.
- the term "compute kernel” intends to describe compiled software code which is capable of performing a specific numerical operation with a runtime instance of a data structure capable of storing multiple elements of a common data type when executed by a processing unit.
- the first data structure and/or the runtime instance of the first data structure may be capable to store a variable number of the respective common data type such as integer, float, complex and the like.
- the first data structure and/or the runtime instance of the first data structure are typically n-dimensional rectilinear data structures.
- the first data structure and/or the runtime instance of the first data structure may be a collection (in the sense of the .NET framework), a vector, an array, a list, a queue, a stack, a graph, a tree or a pointer to such a structure.
- the data structure is assumed to represent a multidimensional array (a vector, a matrix or higher dimensional mathematical data), and the numerical operation is assumed to be an array operation in the detailed description if not stated otherwise.
- rectilinear array and matrix are used synonymously. Rectilinear arrays are often used in scientific or technical algorithms.
- the pixel elements of a display screen or a camera image sensor arranged in rows and columns, respectively may be stored in a corresponding 2D/ or 3D-rectilinear array.
- the runtime instance of the first data structure stores data representing technical information in various stages of completion being processed throughout the program, such as measurement data, images, sound data, simulation data, control data, state data, and so forth, as well as intermediate results.
- the numerical operation can be executed by the processing unit which is at the particular (runtime) configuration best suited for this purpose. Compared to other approaches this can be achieved with lower overhead. As a result, the heterogeneous computing system can handle data structures of different size (on average) more efficiently compared to other approaches.
- the costs of performing the numeric operation of calculating the sum of the square of all elements of a matrix A can be expressed as a function of the size of A. If the statement is to be executed multiple times, individual invocations may associate individual sizes of A.
- the operation sum(A A 2) is efficiently performed on a GPU if A is sufficiently large, in particular for element types supported on the GPU.
- A contains only a few or even only a single element, a CPU may be better suited, in particular if the CPU but not the GPU has already direct access to or even stores the matrix A at the time the operation is about to be performed.
- the expected costs typically depend on the numeric operation, the size of the runtime instance, the storage place of the runtime instance of the data structure and the (computational) properties of the available processing units (CPU, GPU,.).
- the approach described herein allows to perform the desired actions (in particular collecting information and making decisions) at particularly opportune times, namely at compile time, at start-up time and at runtime.
- the decision about the execution path is particularly made, such as only made, at runtime and thus at a time when all desired information is available both for the runtime instance of the data structure(s) and for the hardware (processing units)/current hardware load. This allows reducing the overhead and automatically adopting to a broad variety of data sizes and hardware configurations.
- a time consuming simulation (typically several hours) for determining execution times for a given numerical operation on a given hardware with different trial data falling into different size classes is, prior to deciding at runtime for an execution path based on the size class of the workload data, not required. While such an approach may be efficient for some large data and some small (complexity) algorithms, it cannot guarantee the best execution path as the (multi-dimensional) size-grading has to be course to avoid even longer simulation times. Further, this approach can neither take into account the hardware load nor the actual storage locations of the workload data at runtime. Even further, a complete new simulation is typically required in this approach if a user amends the program code or the processing hardware is changed.
- the approach described herein is also able to scale up to algorithms of any complexity or even whole programs.
- No a-priori simulation or optimization pass is required for anticipating the optimal execution path for several subsequent program segments at runtime and for the runtime data. While such a-priori simulation or optimization is commonly very time and/or energy consuming the quality of solutions (execution path configurations) found this way is limited.
- the number of subsequent segments has to be kept small for simulation / optimization (for a reasonably small size of the parameter space) which leads to coarse-grained segmentation and/or limits the overall program size.
- a fine-grained segmentation is often required for reacting to changed execution paths due to changed workload data values.
- the entangled system "processor and instruction" is unbound and replaced by functions (instructions), which are capable of handling arrays instead of scalar data, are abstract, i.e. not assigned to a concrete processor, and are able to provide cost relevant information (function meta data) ahead of their execution.
- functions instructions
- main processor is replaced with the entirety of suitable execution devices.
- a user generated program code may be parsed to identify program segments with respective (known) numerical operations in the program code at compile time. Further, function meta information of the respective numerical operation may be determined and a respective runtime segment including a respective first compute kernel, a respective second compute kernel, and the respective function meta information may be formed at compile time.
- the program code may include several typically subsequent program segments forming a program sequence.
- sequence intends to describe a user program or a section of the user program comprising program segments.
- program segment intends to describe software code referring to instructions typically on the abstract level for handling input and/or output arguments of one or more data structures capable of storing multiple elements of respective common data type.
- the "program segment” may include simple language entities such as array functions, -statements, -operators and/or combinations thereof which are identifiable by a compiler.
- the expression 'sin(A) + linsolve(B + 3 * cos(A))', with matrices A and B may be a segment.
- 'sin(A)' and 'linsolve(B + 3 * cos(A))' may be respective segments.
- runtime segment intends to describe a compiled representation of a program segment.
- the compiled representation is executable on a heterogeneous computing system and typically includes respective compute kernels for each processing unit of the heterogeneous computing system and control code for calculating the expected cost for runtime instances of the data structure(s) and for selecting a processing unit for executing its compute kernel and/or for distributing the workload over two or more processing units using the expected costs. Additionally, further information such as current workload of the processing unit and/or locality of data may be taken into account for selecting the processing unit and/or for distributing the workload.
- Runtime segments may be created, merged and/or revised or even optimized by the compiler and instantiated and updated at runtime.
- Function meta data are typically stored in the runtime segment. Often, function meta information are stored with and maybe derived from program segments. The compiler may update and maintain the information when segments are merged, revised and/or optimized and when concrete device capabilities are known at runtime.
- the numerical operations in the program code may correspond to items of a reduced set of computing instructions for data structures storing multiple elements of a common data type. This may facilitate efficient generating and/or executing of compute kernels and/or function meta information.
- the program code may also include intermediate portions without numerical operations identifiable by the compiler. These portions may be handled in a traditional way.
- the program segments may first be translated into an intermediate representation, in particular a byte code representation or into a respective representation facilitating the execution on the processing units, e.g. using a compiler infrastructure such as LLVM (Low Level Virtual Machine) or Roslyn.
- LLVM Low Level Virtual Machine
- Roslyn a compiler infrastructure
- the intermediate representations may be compiled into directly executable compute kernels comprising byte code and/or machine instructions.
- suitable processing units of the particular heterogeneous computing system may be determined, and the runtime segments may be updated in accordance with the determined processing units.
- This may include updating the compute kernels, the intermediate representation of the compute kernels and/or the function meta information.
- a CPU kernel template and a GPU kernel template may be updated in accordance with properties of the actual devices.
- numerical effort factors in the function meta information and/or lengths and/or types of vector data types in the intermediate representation of the compute kernels for executing a specific numerical operation on the actual processing units may be updated in the templates.
- processing units may be initialized with the respective compute kernels.
- the function meta information and the data meta information of runtime instances of data structures are used to determine at runtime respective expected costs of executing the compute kernels on the respective processing units.
- a lazy execution scheme may be implemented. This involves delaying potentially time consuming initialization tasks to the time when the result of the initialization is required for execution for the first time.
- the computing kernels of a specific runtime segment for a specific processing unit may be allocated and initialized on the processing unit only once it was decided for the corresponding runtime segment to execute the corresponding compute kernel on the corresponding processing unit.
- both computational costs and transfer costs referring to possible cost related to a transfer of the runtime instance to another processing unit) are taken into account for determining the expected costs.
- a polynomial cost function may be used for determining the expected costs.
- a bi-linear cost function has been found to be suitable for determining the expected costs in many cases.
- Many simple numerical operations have a numerical effort (computational costs) which depends only or substantially on the number of stored elements, such as calculating a function result for all elements individually. Accordingly, the number of elements may simply be multiplied with a respective numerical effort factor representing the parallel compute capabilities of the processing unit to calculate expected computational costs.
- expected transfer costs may be calculated by multiplying the number of elements to be copied with a transfer factor (or zero if the processing unit has direct access to the stored elements) and added to the expected computational costs to calculate the expected (total) costs for performing the respective numerical operation under the current circumstances on each of the suitable processing units.
- the expected computational costs may also depend on the size of the output of the numerical operation, the structure or shape of the output or a value, the structure and/or the shape of the runtime instance of the input.
- different numerical algorithms may be used depending on properties of the input data (sparse matrix vs. dense matrix, single vs. double floating point precision, matrix symmetry, size etc.)
- information related to the output of the numerical operation may be derived by a compiler from meta data of the program segment.
- the program segments may be further configured to receive a second data structure capable of storing multiple elements of the common data type or another common data type.
- a program segment for calculating a matrix-vector product may be configured to receive a matrix and a vector.
- a program segment for adding two matrices may be configured to receive two matrices.
- the computer- readable medium includes instructions which, when executed by a computer comprising a first processing unit and a second processing unit cause the computer to carry out the method as described herein.
- the computer is typically a heterogeneous computer having two (i.e. the first and second processing units) different processing units such as a CPU and a GPU or more different processing units.
- the different processing units are different to each other with respect to computational properties and/or capabilities, typically resulting from different physical properties.
- two different processing units may also have the same computational properties and/or capabilities but are differently utilized (have a different workload) at the time when a program segment is to be executed.
- the computer-readable medium is typically a non-transitory storage medium.
- the heterogeneous computing system includes at least two different processing units and a memory storing a runtime segment of a program segment configured to receive a first data structure capable to store multiple elements of a common data type.
- the program segment includes, provides and/or allows determining (by a compiler) a function meta information including data related to a size of an output of a numerical operation of the program segment, a structure of the output, and/or a numerical effort for generating the output.
- Each of the at least two different processing units has access to and/or forms at least a part of the memory storing a respective compute kernel of the runtime segment for each of the at least two processing units.
- Each compute kernel implements the numerical operation.
- the runtime segment includes or refers to executable code for determining a data meta information of a runtime instance of the first data structure.
- the data meta information includes at least one of a runtime size information of the runtime instance, a runtime location information of the runtime instance, a runtime synchronization information of the runtime instance and a runtime type information of the runtime instance.
- the runtime segment further includes or refers to executable code which is configured to use the function meta information and the data meta information to calculate for each of the at least two processing units respective expected costs of executing the respective compute kernel to perform the numerical operation with the runtime instance.
- the runtime segment further includes or refers to executable code for selecting one of the at least two different processing units for executing the respective compute kernel so that the determined expected costs of the selected one of the at least two processing units corresponds to a lowest value of the determined expected costs.
- a sequence of runtime segments may be stored in the memory.
- the executable code for calculating the expected cost and the executable code for selecting the one of the at least two different processing units may be included in a control kernel of the runtime segment.
- the heterogeneous computing system includes a base processor that may form a processing unit and is coupled with a first processing unit and a second processing unit of the at least two processing units.
- the base processor is typically configured to perform control functions, in particular to determining properties of the processing units, to determine processing units of the heterogeneous computing system which are suitable for performing the numerical operation, to update the function meta information in accordance with the determined properties of the processing units such as number and/or computational properties of cores of a CPU and shaders of a GPU, to load the respective compute kernels to the processing units, to (numerically) determine the expected cost (at runtime), to select the (desired) processing unit based on the determined expected costs, and/or to initiate executing the respective kernel on the selected processing unit.
- the base processor may host and execute a control kernel.
- the base processor is also referred to as control processor.
- one, more or even all of the control functions may be performed by one of the processing units.
- the control functions may also be distributed among the processing units and the base processor.
- the control kernel may have sub-kernels running on different hardware units.
- the base processor may be one of the processing units and act as one of the processing units, respectively, even if the computational power is low compared to the (other) processing units of the heterogeneous computing system.
- the base processor may perform the numerical operation when the number of elements stored in the runtime instance of the (first) data structure is comparatively low.
- Each of the processing units may be formed by a single core CPU (central processing unit), a multicore CPU, a CPU with vector extension (such as SSE/AVX), a GPU (graphics processing unit), a CPU integrating at least one GPU, a coprocessor supporting numerical computations, a SIMD accelerator (single instruction multiple data accelerator), a microprocessor, a microcontroller such as a DSP (digital signal processor), and an FPGA (field-programmable gate array) or by a combination of one or more of those devices typically forming a so-called virtual processor.
- the heterogeneous computing system is typically a heterogeneous computer including a host controller having a host processor typically forming the base processor.
- the heterogeneous computer further includes a CPU and one or more GPUs each forming a respective processing unit.
- the heterogeneous computing system may also be a workstation, i.e. a computer specially designed for technical, medical and/or scientific applications and/or for numeric and/or array-intensive computations.
- the workstation may be used to analyze measured data, simulated data and/or images, in particular simulation results, scientific data, scientific images and/or medical images.
- the heterogeneous computing system may however also be implemented as grid or cloud computing system with several interconnected nodes each acting as a processing unit, or even as an embedded computing system.
- Figure 1 illustrates a computer-implemented method according to an embodiment
- Figure 2 illustrates a computer-implemented method according to an embodiment
- Figure 3 illustrates a computer-implemented method according to embodiments
- Figure 4 illustrates method steps of a computer-implemented method and a heterogeneous computing system according to an embodiment
- Figure 5 illustrates method steps of a computer-implemented method according to embodiments
- Figure 6 illustrates method steps of a computer-implemented method according to embodiments
- Figure 7 illustrates method steps of a computer-implemented method according to an embodiment
- Figure 8 illustrates method steps of a computer-implemented method according to an embodiment
- Figure 9 illustrates method steps of a computer-implemented method according to an embodiment
- Figure 10 illustrates method steps of a computer-implemented method according to an embodiment.
- processing units of a heterogeneous computing system are initialized with respective compute kernels of a runtime segment compiled from a (higher level) program segment configured to receive a first data structure capable of storing multiple elements of a common data type. This may include copying the compute kernel to a memory of the respective processing unit or a shared memory of the heterogeneous computing system, and/or allocating the compute kernels on the respective processing unit.
- Each compute kernel may be configured to perform a numerical operation derived from the program segment.
- the runtime segment typically includes or has access to a function meta information including data related to and/or representing a size of an output of the numerical operation, a structure of the output, and/or an effort for generating the output.
- the function meta information and a data meta information of a runtime instance of the first data structure are used to determine, typically to calculate respective expected costs of executing the compute kernels on the processing units.
- the data meta information may include a runtime size information of the runtime instance, a runtime (storage) location information of the runtime instance, and/or a runtime type information of the runtime instance.
- a runtime instance of the data structure stored in a memory of one of the processing units or a shared memory of the heterogeneous computing system may be accessed by the control code of the runtime segment.
- the expected costs may be used to decide on which of the processing units the respective compute kernel is executed to perform the numerical operation with the runtime instance.
- the runtime instance may be copied to the processing unit selected for execution if the selected processing unit has not yet access to the runtime instance.
- the compute kernel may be executed on the selected processing unit.
- Method 2000 is similar to the method 1000 explained above with regard to Fig. 1 .
- a compile phase I may be used prior to the start-up phase II for identifying program segments in a program code and to compile runtime segments for the identified program segments in a block 2100.
- processing units which are suitable for performing the numerical operation derived from the identified program segments are typically determined in a block 2200 of the start-up phase II of method 2000.
- a subsequent block 2300 of the start-up phase II the determined processing units are initialized with respective compute kernels of the runtime segments.
- data meta information of a runtime instance of the first data structure are determined. This may include determining a dimension of the first data structure, determining a number of stored elements in each of dimensions of the first data structure, determining a number of stored elements in the first data structure, determining the data type of the elements, determine a type information of the runtime instance, and/or determining a runtime location information of the runtime instance.
- the latter may be facilitated if the runtime location information is stored within the runtime instance of the first data structure.
- data meta information may be determined for all runtime instances in block 2400.
- the function meta information and the data meta information of the runtime instance(s) are used to determine respective expected costs of executing the kernels on the processing units.
- the expected costs may be used to select a processing unit for executing the respective kernel to perform the numerical operation with the runtime instance(s).
- the (respective) compute kernel may be executed on the selected processing unit in a block 2700.
- the blocks 2400 to 2700 may be executed in a loop if several runtime segments are to be executed.
- out of order execution schemes may be implemented.
- Method 3000 is similar to the method 2000 explained above with regard to Fig. 2.
- Method 3000 starts with a block 31 10 for identifying a sequence with program segments in a program code.
- a parser may be used to search for numerical operations in the program code.
- several subsequent runtime segments may be created.
- Each runtime segment may be configured to receive one or more data structures storing multiple elements of a respective common data type that may be the same or not.
- each runtime segment may include a respective function meta information comprising data related to a size of one or more outputs of the numerical operation, a structure or shape of the output(s), and/or an effort for generating the output(s).
- each program segment may be translated into an intermediate representation. This may be done using a compiler infrastructure.
- Each runtime segment may include the respective intermediate representation and respective function meta information referring to a respective numerical operation to be performed with the runtime instances of one or more data structures.
- the intermediate representation may also be created from or using meta data stored with or referring to the sequence or the corresponding program segments.
- processing units which are suitable for performing the numerical operation derived from the identified program segments are typically determined.
- the intermediate representations may be compiled into respective machine language representations (compute kernels). This may be achieved using a Just-in-time-Compiler which may or may not be specialized for the available corresponding processing unit and may facilitate better customizing the machine language and/or the intermediated representation to the properties of the available processing units.
- a Just-in-time-Compiler which may or may not be specialized for the available corresponding processing unit and may facilitate better customizing the machine language and/or the intermediated representation to the properties of the available processing units.
- the determined processing units are initialized with respective compute kernels formed in block 3270.
- data meta information of a runtime instance(s) of data structure(s) may be determined.
- the function meta information and the data meta information of the runtime instance(s) may be used to determine respective expected costs of executing the kernels on the processing units.
- the expected costs determined in block 3500 may be used to select a processing unit for executing the respective kernel to perform the numerical operation with the runtime instance(s).
- the compute kernel may be executed on the selected processing unit in a block 3700.
- the dashed arrow in Fig. 3 indicates that the blocks 3400 to 3700 may be executed in a loop if several runtime segments are to be executed.
- block 3600 is replaced by a block in which the expected costs are used to distribute the workload of executing the compute kernel and the numerical operation of the compute kernels, respectively, over at least two processing units which may perform the respective numerical operations on respective portions of the runtime instance in a replacement block for block 3700.
- Figure 4 illustrates exemplary components of a heterogeneous computing system implemented as a heterogeneous computer 70 and typical processes of executing a sequence of runtime segments at a runtime phase III of a computer-implemented method 4000.
- the heterogeneous computer 70 is equipped with a CPU 71 also forming a main or host processor of a host controller 73 and a GPU 72.
- a complex user program is partitioned into sequences which are segmented into segments.
- the segments may be continuously lined up in the order derived from the sequence.
- each processing unit 71 , 72, 73 found to be capable of handling a sequence has been initialized for handling all runtime segments 30 to 34 of the sequence.
- the processing units are also referred to as devices (device 3, device 2, and device 1 respectively).
- compute kernels 150 to 154 of the same runtime segments were compiled and allocated on the GPU 72 and form an executable representation 1 12 of the original sequence on the GPU 72.
- scalar compute kernels 130 to 134 representing a scalar version of the code of the segments may be prepared for each runtime segment.
- Scalar compute kernels 130-134 mostly run on the CPU, using the same host language the program was provided in.
- the scalar code kernels 130 to 134 form a representation 100 of the original sequence on the CPU but are adapted, typically optimized for scalar input data and/or sequential processing of - typically relatively small - instances of input data structures.
- Each of the runtime segments 30-34 may store a reference to the corresponding compiled and allocated compute kernels 130 -154 - at least one for each of the determined devices - which can be used to trigger its execution at any time.
- all segments of a sequence may store one reference to an executable version of the program segments for each device supported.
- each segment is able to execute its code in three different ways with very low overhead: one version for scalar data (for device 1 and the host controller 73, respectively), one array version for running on the CPU (71 and device 2) and one version for running on the GPU (72 and device 3).
- Each segment 30 - 31 may implement a control code 50 which is used to switch the execution path at runtime to select one of the prepared execution paths.
- the execution path is illustrated in Fig. 4 as thick dashed arrow.
- the decision which path is to be used for execution is typically based on multiple factors, like the data size, data shape, and data locality, the computational costs of the segments' individual execution paths, the properties of the devices, the current utilization of the devices, heuristics or measurement data about additional overhead associated with each device, heuristics about fixed or dynamic thresholds associated with data sizes.
- the decision which path is to be used may further depend on previous and/or successive segments, and the data locality configuration corresponding to a lowest execution time for these segments. Associated information may also be obtained by means of historical usage data of the segment and/or by speculative processing analysis and/or -evaluation.
- the best execution path may be found to be the one which finishes the execution of the segment code earliest or which uses lowest power during the execution. This is explained in more detail below.
- the execution path of the sequence may switch quickly between any of the prepared devices.
- Each array object (runtime instance) A may track the locations (device memory) its data has been copied to.
- its data are typically copied to the device memory and a reference 20 - 22 to the array A is typically stored with the array A. In doing so, data transfer between the devices can be minimized when the array A is used several times as input argument.
- this device When an array is used as result storage for a segment on a device, this device may be marked in the array as the only device the array is currently stored on. Any other device locations the array may have been copied to are cleared from a list of array locations of the array.
- a user program 200 is analyzed. Sequences of the program 200 are identified which allow a determination of execution cost at runtime. For sake of clarity, the exemplary user program has only one sequence 210.
- Sequences may include or even consist out of simple instructions dealing with scalar data 205.
- a compiler may derive cost information based solely on the executable code generated for the instruction.
- the sequence 210 may have more complex instructions, potentially dealing with more complex numerical operations and/or one or more data structures 10 capable of storing multiple elements of respective common data type.
- data structures may be any predefined array data type available to users of the language used to write the program 200 in.
- the data structure may originate from a library or a stand-alone module or an external module (dynamic link library/ statically linked library) which is imported to the program.
- Such data structure are typically able to store multiple elements of the same or similar data type and accompanying information about the data structure, like the number of elements, the number and lengths of individual dimensions in the data structure (as the 'shape' for arrays and rectilinear arrays), the type of elements for such languages supporting types and locality information (storage place(s)).
- the data structure 10 may encapsulate a more general data structure to enable adding necessary information about the general data structure and making it available as a common data structure 10 if the general data structure does not provide this information naturally.
- At least parts of the user program 200 are composed out of instructions performing numerical operations on data structure(s) 10. Such instructions are either known to the compiler or provide a way to become recognized by the compiler in an automated way.
- One example for such instructions is the set of functions in technical languages such as Matlab (The MathWorks Inc.), numpy (numpy.org), ILNumerics (ILNumerics GmbH), Julia (julialang.org), Fortran, Scilab (scilab.org), Octave, FreeMat (freemat.sf.net), and Mathematica (Wolfram Research) working on array data.
- Such languages may provide their own compiler and development infrastructure.
- the language might be realized as a domain specific language extension (DSL) of another, typically more general language (GPL), may or may not exploiting, in full or partially the compiler and development tools of the GPL.
- DSL domain specific language extension
- GPL typically more general language
- the data structure 10 may be realized as a rectilinear array of arbitrary size and dimensionality. Some of these functions take one or more arrays of elements (multivalued input parameter) and perform their operation on many or even all elements of the parameter(s), as map (performing a function on the elements), sin (computing the sinus of the elements), add (adding elements of two arrays). Other functions may create one or more new array data based on zero or more array or scalar input parameters, like 'zeros' (creating an array of zero-valued elements), 'ones' (creating arrays of 1 -valued elements), 'rand' (creating arrays of random valued elements), 'clone' (creating a copy of another array).
- 'zeros' creating an array of zero-valued elements
- 'ones' creating arrays of 1 -valued elements
- 'rand' creating arrays of random valued elements
- 'clone' creating a copy of another array.
- Typical functions perform aggregation, like 'reduce (performing an aggregation function on a dimension of an array), 'sum' (summing the elements along a specific dimension), 'max' (computing the maximum value along a dimension).
- Other functions may create subarrays from data structures or perform concatenation, like, for example, 'vertcat' and 'concat' (concatenating arrays along a specific dimension). Further, more complex operations may be used, e.g.
- Functions may utilize additional supplementary parameters, e.g. to control the operations within.
- One example is the 'sum(A,1 )' function, where '1 ' is a supplementary scalar parameter determining the index of the dimension of the input array A to sum the elements contained along.
- Some functions produce a single output array of the same type and the same size than the input array(s).
- Other functions may produce multiple outputs, some of the output arguments may differ in size or type, e.g. the broadcasting binary function 'add', 'find' (locating nonzero elements), 'max(A,1 ,l)' (locating elements with the maximum value along dimension in A and also give the indices of such elements).
- Yet other functions may generate new arrays from scratch or by extracting parts of and/or modifying parts of input array(s).
- Another example of identifiable instructions in the user program 200 is the set of functions similarly provided as runtime supplement for languages like FORTRAN, C/C++, Perl, C#, Visual Basic, F#, python, Javascript.
- One such exemplary set of functions is described by the static functional members of the .NET CLI class System. Math. It contains, but is not limited to, the range of functions whose names starting with: Abs, Acos, Asin, Atan, Atan2, Ceiling, Cos, Cosh and ending with Sin, Sinh, Sqrt, Tan, Tanh, Truncate.
- the compiler of the method described may detect common language looping construct patterns in the user code (for, while, goto- loops) in order to transform iterations over elements of arrays into single array function calls, suitable to be processed into program segments according to the method described.
- One embodiment of the disclosed method may be implemented in a dynamically typed language (e.g. duck typing) such as Python, Julia or MATLAB. However, it is expected to achieve better runtime performance with an implementation utilizing a statically typed language such as C# or Java. Both, embodiments implemented in a statically typed language and such embodiments implemented in a dynamically typed language are applicable to both categories of programs: statically typed programs and dynamically typed programs. If the program is statically typed and the disclosed method is implemented in a dynamically typed language type information about the data and functions of the program may be omitted.
- a dynamically typed language e.g. duck typing
- a statically typed language such as C# or Java.
- a compiler may inspect the program code and infers the type of data structure elements and program segments from program code expressions (e.g. constants), markups (attributes, type declarations) found in the code or at compiled binary representations of the program segments or in supplementing materials, from a list of default types available to the compiler and/or from consecutive program segments or other program context.
- the sequence 210 of instructions may have been created by the user.
- the compiler may combine individual instructions of the sequence 210 to form program segments 220 - 222 of known instructions.
- the primary goal of such grouping (combining) is to improve execution performance.
- the numeric costs of the program segments 220 may automatically be derived by the compiler based on the cost functions of the individual instructions grouped into the program segment as described below.
- a program segment 220 may contain a single array instruction only.
- the sequence 210 consists of three program segments 220 - 222.
- the current program sequence 210 may be ended and a new program sequence is started with the next identifiable instruction.
- the sequence 210 consists out of the full program line of code in the user program 200, with two input arrays A, B and one supplementary scalar input i:
- assignments may be excluded from or multiple assignments may be included in program segments. This allows the program segments to span an arbitrary number of instructions, instruction lines and instruction sections.
- program segments may or may not include control structures and/or looping constructs. In any such cases a sequence boundary may be created once the compiler identifies one of the restricted language constructs.
- marks in the user program 200 may also indicate boundaries of sequences or program segments. Such marks may be created by the user or by an automated process. For example, a pre-processor might analyze the program code 200 and create and store marks with the program to indicate hints to the subsequent compilation steps.
- the user may edit, create, use and/or store marks, for example in the form of language attributes to provide information to the compiler to support or to control the identification, segmentation of and/or the extraction of required metadata about a user provided function or program.
- the compiler may however also ignore some or even all marks.
- the process of dividing the program 200 into sequences 210 and dividing sequences 210 into program segments 220 -222 may involve the creation of a graph data structure representing the sequence to be identified.
- the tree may be created as an abstract syntax tree (AST) in which case it holds all information necessary to identify data dependencies and program flow of the instructions and corresponding data in the sequence.
- This information in the graph together with accompanying information such as metadata of the functions may also be used to decide for program segment boundaries and/or to compute program segment costs, program segment parameter set, and/or program segment output size.
- the compiler may try to increase the size of program segments by iteratively combining neighboring instructions in the program code until at least one certain limit is reached and the addition of further instructions to the new segment is stopped.
- limit may be established by a compiler specific rule (heuristic) or compiler specific limit relating to the complexity of the combined operation for the new segments, the number and/or type or structure of input / output parameters of the new segment, the ability of the compiler logic to create / derive required function meta information (as function size and function effort) as a functional of the properties of input / output data structures of the segment based on meta information of the combined instructions, and/or the ability to create an intermediate representation of a compute kernel representing the combined numerical operation from the individual instructions.
- Another way to define program segment borders with combined instructions is to categorize the set of known instructions into instructions which do not cause a change of size for any output parameter in relation to the input parameter sizes (map-type instructions) and further instructions which potentially do cause a change of the size of at least one output parameter in relation to the input parameter sizes (reduce-type instructions, subarray, and creational instructions).
- a program segment includes at most one size changing instruction including any number of known map-type instructions, derived from direct neighbors of the size changing instruction in the user program code.
- a method of supporting the decision for program segment border placement which a person experienced in the art will find especially helpful is to limit the complexity (cost) of determining the effort to execute the computing kernels of a segment with instances of input /output data structures at runtime for each processing unit.
- such effort is computed by functionals derived from the segment function meta information which are created / adopted when new instructions are added to a program segment at compile time during segmentation (see below).
- the complexity of such effort functions is typically not increased when map-type computing instructions are added to the segment. Other types of instructions may lead to a more complex cost function, requiring higher effort for evaluation at runtime.
- the compiler may place a segment border whenever the addition of a new instruction would increase the complexity or the effort for evaluating the cost of a program segment.
- a semantic analysis may be performed on the program or on parts thereof.
- the goal of such analysis may be to determine the type, or other information of the instruction and/or of the data the instruction is handling.
- simple text matching may be performed on tokens of the program 200.
- Semantic analysis may also be used to infer data types when the program 200 does not provide sufficient type information directly.
- One example refers to the collection of dynamically or weakly typed languages and/or structures which are common for technical prototyping languages.
- runtime segment 240 for each program segment identified is typically created.
- the runtime segment(s) 240 typically store(s) all information required to execute the numerical operation of the program segment 220 on the processing units (71 , 72 ,73) at runtime.
- the runtime segment 240 is realized as a compiler generated class definition 249 which is going to be instantiated at startup time II.
- the class stores the intermediate representation of the numerical operation for supported device categories (kernels 241 , 242, 243), argument adaptors 245 facilitating the determination, caching and loading of array arguments for the segments computational kernels 241 - 243, the function meta information 248 for the (combined) instructions of the segment and a device selector code 247 which contains code performing the determination of processing costs and the selection of the processing unit at runtime.
- the device selector 247 may serve as the entry point of the runtime segment 240, 249 when the program is executed.
- the device selector 247 may provide a function interface supporting the same number and the same types of input and output arguments as the program segment 220. This enables the compiler to use the device selector 247 as a direct substitution for the instructions the program segment 220 is made out of.
- substitution may be realized via code injection by modification of the program code 200 or by modification of a compiled assembly (executable binary) derived from the user program 200.
- runtime segments may be realized as a separate module, assembly, library, remote service or local service.
- runtime segments code might be added to the user program 200.
- the startup phase I may be implemented and triggered by instantiating the runtime segment 249 as a static context variable in the program.
- information about the available processing units (73, 71 , 72) is collected, the intermediate representations 241 , 242, 243 of the program segment 220 are adjusted for each supported processing unit (73, 71 , 72), the computing kernels are compiled and loaded to the processing units (73, 71 , 72).
- some of the initialization steps for some of the resources described above may be delayed to the time the utilization of the corresponding resources is required for the first time during the execution.
- the cost of executing each individual segment is determinable at runtime.
- the compiler does not attempt to determine the actual cost of an instruction at compile time or prior to allocating the segment on devices (processing units) of the heterogeneous computing system. Instead, all devices are similarly equipped with executable runtime segment(s) and the expected costs are determined at runtime, typically right before execution.
- the expected cost of a runtime segment is typically evaluated based on a cost function taking into account the intrinsic cost of the instructions executable representation on a specific processing unit and the concrete data dealt with by the numerical operation derived from the instructions of the program segment.
- Intrinsic segment costs may be derived from the cost of each single instruction in the program segment utilizing the graph of instructions of the corresponding sequence 210.
- the costs of individual instructions may be combined in a suitable manner to represent the intrinsic cost of the whole segment in a way enabling efficient evaluation at runtime.
- Such a representation 248 may be stored with a runtime segment 240 and used at runtime to determine the actual cost of executing the runtime segment compiled from the program segment 220 for the concrete input data structure 10.
- the computational cost of a runtime segment is a function of such input arguments and may be implemented in different ways.
- the compiler may utilize several ways to provide the cost function to the runtime segment, some of which are described below.
- the compiler 250 will pick a cost function implementation which is expected to result in best execution performance or lowest power consumption at runtime. Such selection often corresponds to the least number of information required to obtain at runtime.
- the cost function can be determined in terms of the number of operations performed to produce a single element C in the output.
- the overall cost may than be the result of multiplying the cost of a single element C with a factor N, where N corresponds to the number of output elements (to be) produced or a certain power thereof, depending on the effort of the instruction.
- N would equal the number of elements in the input array A.
- N would be equal to the square of the number of elements (n) in A.
- Other instructions may have higher or lower efforts, leading to other powers, including non-integer powers.
- the computational cost for a single element C can be acquired from lookup tables or caches, be determined by measurements, for example on the concrete hardware used to execute the instruction or by analyzing the compute kernel of a runtime segment resulting from a compilation of the instruction for a specific processing unit (CPU, GPU etc.) of a heterogeneous computing system.
- a specific processing unit CPU, GPU etc.
- cost function may depend not only on the number of elements in the output result but also on the number of dimensions and/or the structure (i.e.: the lengths of individual dimensions) of at least one input argument (i.e. runtime instance(s) of respective data structures), the value of elements in at least one input argument, the existence, structure, type, value of other input parameters and/or output parameters and/or the existence and value of supplementary parameters.
- the information of the cost function may also be automatically derived from the instruction by the compiler 250. Automatic derivation of the cost may involve analyzing the function and is especially useful for simple instructions.
- the compiler 250 may use at least one of the code, the name, the varying number of parameters, and/or the executable binary representation of the instruction in order to automatically derive the cost of the function.
- supplementary metadata associated with the function suitable to evaluate the computational cost at runtime may be used by the compiler.
- metadata may be created during the authoring of the function.
- the programmer of the program 200 may extend the collection of supported instructions with own functions and / or operations.
- the compiler 250 may also know a predefined set of instructions and associated cost functions and/or may be able to match instructions in the program code with the cost function of the known instructions. [00182] The compiler 250 may decide to combine the cost functions of corresponding instructions into a single segment cost function or to replace the cost function for individual or for all combined instructions with a replacement cost function similarly representing the effort of the segment for a certain instance of the input data structures.
- the segment cost function may be implemented and/or evaluated by considering the cost functions of all individual instructions in the segment. This may be realized by walking the nodes of the sequence graph 210 and evaluating the cost function of each node representing an instruction.
- the compiler 250 may apply optimizations in order to simplify the resulting cost function(s). Such optimizations may include the inlining of instruction cost function, the omitting of such parameters which do not (significantly) contribute to improving the quality / exactness of the cost value for a segment, partial or complete aggregation of instructions in a segment and other methods which reduce the computational overhead of evaluating a segment cost function at runtime. For example, if the compiler 250 identifies for a program segment that the cost do solely rely on the number of elements in its input parameter 10 other information as, for example the number of dimensions and/or the structure of the input parameter 10 may be omitted from the resulting cost function.
- instruction information is provided by metadata attributes.
- Each instruction may provide a way to query metadata for the instruction to the compiler.
- every suitable ILNumerics function is decorated with a certain attribute that supports the compiler in identifying the function as an instruction.
- the attribute type is known to the compiler and can be used to provide comprehensive metadata of the function via a common interface implemented by the attribute type.
- the function metadata typically contain data for determining the size and the structure of each generated instance of output data structures that may be an array, data for determining the 'effort' the instruction (numerical operation) spends for generating an instance of an output data structure, and optional data for deriving kernel code for a specific processing unit and/or a generic processing unit type, the processing units being suitable for performing the numerical operation on a set of instances of input data structures.
- the above supporting function metadata is typically provided individually for each output the instruction is able to produce and for each type of processing unit the framework supports.
- the high-level ILNumerics function abs(lnO) for calculating absolute or magnitude element values of instances of input data InO is decorated with an ILAcceleratorMetadata-attribute which contains a link to a class ILMetadata_abs001 . Both are provided by the author of the abs function of the ILNumerics framework.
- the compiler can instantiate the class and query its interface to gather the following information: size information of instances of the output OutO data structures produced by the abs(lnO) instruction in terms of instances of input data InO.
- Individual size information may be provided, including but not limited to (1 ) the total number of elements stored in OutO (Numel(lnO)), (2) the number of dimensions used to describe the structure of OutO (Numdim(lnO)), and (3) the length (number of elements) for each dimension of OutO (Size(ln0, 0) to Size(ln0, 2)) as illustrated in table I typically representing a segment size vector or list:
- Such information is typically provided in terms of a function receiving the instance of the input data structure InO and producing a vector (size descriptor) with the size information stored as consecutive elements. Since there is only one output produced by abs(lnO) the exemplary ILMetadata_abs001 class provides a single size descriptor only.
- the size descriptor function may requires further arguments, corresponding to the number of arguments defined for the instruction. For example, the add(ln0, In 1 ) instruction adding the elements of two input arrays requires two array arguments which may or may not are required for, provided to and used by the size descriptor function.
- function meta information of the abs(lnO) instruction may be the 'effort' of the instruction in terms of instances of its input data InO, on a (generalized) CPU platform, relating individually to (1 ) the number of elements in InO, (2) the number of dimensions in InO, (3) the length of each dimension of InO.
- Such effort information may be provided as normalized values or constants forming an effort vector of floating point or integer values.
- Each element of the effort vector corresponds to information of the elements of a size descriptor vector such that when performing a vector product (i.e.: computing the scalar product) on both, the effort information vector typically describing the normalized effort for the instruction (i.e. effort per element of the input data structure) and the size descriptor vector corresponding to the instance of the input data structure the resulting value corresponds to the effort of executing the instruction with the instance of input data structure.
- the 'effort' of the abs(lnO) function in terms of its input data InO on a (generalized) GPU and/or any other processing unit may be provided.
- Other embodiments may use other methods of providing the effort information of the instruction at runtime. Such methods can be simpler than the one described above. It may omit information from the computation or substitute some information with suitable estimates. Or the methods are more complex, taking the same and/or other properties of related entities into account.
- the effort information may be provided as a constant instead of a function.
- the computation may take into account the values and/or the types of elements of the input data structure instances, state information of the processing unit, of the host controller and/or of other system components, further information provided by the input/output data structure instances, the function meta information and/or the compute kernels.
- the function meta information may further include a template kernel code (compute kernel) suitable to be used as a template for a specialized kernel function body on a concrete CPU platform.
- the template is typically completed later in the program run when the concrete processing units are known.
- the template may contain placeholders which are going to be replaced later, for example at startup time, for example with properties of the processing unit. Such properties may include the number of processors, cores, cache size, cache line size, memory size and speed, bitrate, features and/or instruction set supported by the processing unit, power consumption and/or frequency of the processing unit.
- kernel code suitable to be used as a template for a specialized kernel function body on a concrete GPU and/or any other processing unit including a host controller may be provided as part of the function meta information.
- Such template kernel code may be equipped with placeholders which are replaced at start-up time with properties of the concrete processing unit, like the number of processors, number of cores, cache size, cache line size, memory size and speed, bitrate, features and/or instruction set supported by the processing unit, power consumption and/or frequency of the processing unit.
- the compiler 250 uses the gathered information to create a runtime segment.
- the runtime segments may contain executable code in the same language the user program was made of.
- the compiler 250 may modify the program code in a way which causes the execution of the runtime segment instead of the original program part the segment was derived from.
- the compiler may create executable instructions compatible with the execution system and/or modify an executable resource created from the user program 200 to cause the execution of the runtime segment instead of the corresponding part of the user program 200 at runtime.
- a runtime segment may span a single instruction identified.
- the segment effort (explained below) corresponds to the instruction effort and the segment size (explained below) corresponds to the instruction size.
- a segment may span multiple instructions and the segment size and effort are computed from the set of multiple instruction efforts and -sizes. See below for a description of the creation of segment effort and -size.
- the runtime segment may implement multiple execution paths for the original intent of the instructions.
- One path executes the runtime segment on the CPU, potentially utilizing the same runtime system the program code was made for.
- Other paths may use low-level accelerating interfaces, like OpenCL or CUDA or OpenGL etc. to execute the runtime segment on an accelerating device, like a GPU or similar.
- a fast track is provided for scalar or small input structures. All paths are readily initialized by preloading respective compute kernels to the corresponding processing units, even though such loading may be performed in a lazy manner at the time when a compute kernel is to be executed for the first time.
- the decision which path (which device) is to be used for fastest / most efficient result creation often only depends on dynamic runtime data, for example the data meta information, in particular the size, structure, and/or values of runtime instances of the input data structures. All other conditional information has been 'baked' into the prepared segment effort by the compiler at compile time and/or at start-up time already. This is typically especially true for such information which is known at compile- or start-up time and is not a matter of dynamic changes at runtime.
- each runtime segment implements a fast switch for the dynamic decision of the execution path, based on the segment size and the segment effort.
- the segment effort corresponds to the number of operations associated with the computation of a single result element by the runtime segment. Multiple efforts can exist for a runtime segment if the runtime segment is able to produce multiple output results.
- Segment efforts are typically realized and/or implemented as functions of the size of the input data structure instance.
- Segment efforts may carry a measure of the cost of the instructions implemented in the runtime segment and accumulated from the instructions assembling the segment. Since the way the instruction is implemented on individual platforms may differ, a segment effort is typically computed from the assembling instructions for each supported platform individually.
- the segment effort is implemented as a segment effort vector of length n corresponding to the size description of a multi-dimensional array, where n corresponds to the number of dimensions maximal supported by the system incremented by 2.
- Each segment effort vector SEV0, SEV1 may include or consists out of the following elements:
- Element #0 effort of the function by means of the number of input dimensions
- Element #1 effort by means of the number of input elements
- Elements of the segment effort vector can be numeric constants or scalar numeric functions.
- the compiler may be configured to decide to fill and maintain all elements of the segment effort vector, and/or to improve and/or even optimize performance by leaving out some elements from the computations depending on the properties (complexity) of the functions in the segment.
- the segment size corresponds to the size description of an n-dimensional array and may be implemented as a segment size vector of length (n + 2), where n corresponds to the number of dimensions maximal supported by the system.
- segment size vector may include or consists out of the following elements:
- Element #1 Number of elements
- the elements of the segment size vector can be numeric constants or scalar numeric functions. Typically, the elements comprise scalar functions of instances of the input data structures. Again, the compiler may (typically prior to runtime) decide to adjust and/or even optimize the segment size vector for better performance.
- the desired processing unit for executing the runtime segment is selected by computing the expected costs for the execution of respective compute kernels on each available device.
- segment size vector SSV and the segment effort vectors SEV0, SEV1 may be determined.
- the segment size vector SSV is implemented as explained above with regard to table I and determined in accordance with the array instance A.
- segment size vector SSV For any element of the segment size vector SSV which relates to a function, the function is evaluated to obtain the element as a number. This is to be performed only once, since the size of the result of the segment outcome does not change between devices. See below on how this may be achieved.
- the segment effort vector SEVO, SEV1 is typically determined for each processing unit (CPU - index 0, GPU - index 1 ). Any related element functions may be evaluated (see below for further detail).
- the computational costs for each device may be determined by computing a scalar product between the segment size vector and the segment effort vector.
- the expected (total) costs can be determined by adding computational costs and costs for a potential data transfers to each device based on the segment size vector and a current data locality information as maintained in the array object A.
- the device corresponding to the lowest expected cost may be selected for execution.
- more information may be taken into consideration of the expected costs. For example, additional device properties (cache sizes, SIMD lane length on CPU vector extensions etc.), a factor corresponding to an overhead required for initiating data transfers (next to the actual data transfer costs), an overhead for triggering the execution of preloaded compute kernels, processor frequency, utilization of processing units and others may be considered.
- the device selection for execution may be based on fewer information than described above.
- a transfer cost table TCT is typically created at start-up time (startup phase) once all devices suitable for performing numerical operations (e.g. based on a predefined version of OpenCL) are identified.
- the table TCT may be cached for decreased startup overhead.
- array data may live on any device (not only on the host).
- the costs of the transfer may be estimated based on a transfer factor corresponding to the normalized transfer costs from the current device to the other device.
- the transfer factor may be acquired by measurement at start-up or based on certain heuristics delivered with the systems.
- the transfer factor may be multiplied with the actual array size at runtime. [00227] Further, the transfer factor may be refined during the execution of the system based on actual measurement data. This can help to account for resource state changes of the runtime computing system.
- location vectors LV are used. Runtime instances of data structures (arrays) may co-exist on multiple storage locations. A location vector of references to such locations is typically maintained by each array operation.
- the length of the location vector LV may correspond to the number of supported devices and is typically of constant length during the execution of the program, but may vary between individual execution runs and/or computing systems.
- Each of the elements of the location vector LV typically corresponds to one (fixed) device index 0, 1 for all runtime segments.
- Device indices may be assigned at startup time of the program as part of the determination of the processing units.
- the element values of the location vector LV may point to a reference to the data buffer on the respective device.
- the elements of the location vector LV may be non-zero, if the array data are (currently) available on the device corresponding to the element index, and zero otherwise.
- exemplary transfer costs of 0.2 per element of the input array A have to be considered for the GPU as illustrated in Fig. 6 by the transfer cost vector TCV1 , while no transfer costs have to be considered for the CPU resulting in a corresponding transfer cost vector TCV0 filled with zeros.
- the expected costs for executing the runtime segment on the CPU are much higher than the expected costs for executing the runtime segment on the GPU (90000 compared to only 16980).
- the GPU may be selected for execution.
- Figure 6 only illustrates cost determination at runtime for a numerical operation (3+sin(A)) which is to be performed element by element resulting (for the exemplary processing units) in segment effort vectors SEVO, SEV1 each having only one non-zero element (namely the second one referring to costs depending on the total number of stored elements).
- the compiler When the compiler starts building the segment at compile time, it may first identify the code tree or abstract syntax tree (AST) of the segment shown on the right of Figure 7.
- AST abstract syntax tree
- the tree is relatively simple. More complex trees are supported as well. Typically, each node of the tree corresponds to an instruction of the program or to data arguments (constants, arrays). The compiler starts by walking along the nodes of the tree and querying the metadata of every instruction involved.
- the compiler may build segment effort vectors, segment size vectors, and segment kernel templates for every supported processing unit category such as generic GPU, generic CPU, CPU host, generic DSP, scalar device and so forth. Note that at this stage the effort vectors created for individual processing unit categories do not carry any information specific to concrete devices yet. Rather, the individual effort vectors account for individual ways to implement and execute the instructions on the various processing unit categories. For example, some device categories require the kernel code to also implement looping structures and/ or thread management overhead in order to execute some instructions efficiently while other device categories don't. [00242] The segment size vector has to be built only once since it does not change on individual processing units.
- Every output slot of the segment may get its own segment size vector assigned.
- the complexity of the instructions dealt with in the AST results produced by individual output slots of the segment may correspond to individual computational costs.
- Some embodiments take such differences into account by maintaining individual effort vectors for individual outputs of a segment. However, for the sake of clarity this description assumes the segments to be sufficiently simple so that all outputs are created for every execution of the segment at once, hence a single effort vector for each processing unit (category) is able to represent the actual computational costs for all outputs.
- Instructions in a segment which do not change the size of its inputs may be aggregated into a single size function. Segments comprising such instructions which do not change the size of its inputs typically only get a size function vector which reflects the size descriptor of the corresponding input assigned to the output slot.
- the final segment size of each output slot may be realized as a tree of "size-changing nodes" corresponding to the segment syntax tree. Every node size function may modify the size of its input(s) accordingly and 'generates' a new size which serves as the input for the size function of its parent node.
- the size tree described above is typically stored in the segment for later evaluation.
- the size information may be adjusted and/or even optimized by the compiler and/or stored in a specialized format in order to facilitate more efficient segment size evaluation at runtime.
- the segment 3+sin(A) implements two instructions: Add() and Sin().
- Add() is a binary operator, adding the elements of two input arrays. Since one of the two arguments is the scalar constant '3' the size of the output generated by Add() solely depends on the size of the second input (array) argument to Add(), according to common dimension broadcasting rules. Therefore, Add() is considered a non-size changing instruction here.
- the Sin() instruction computes the sinus of all elements in the input array.
- the size of the output generated by Sin() equals the size of the input array. Therefore, both instructions in the segment are non-size changing and the segment size corresponds to the input size.
- the size functions may have to be evaluated recursively at runtime, based on the tree of instructions and corresponding metadata.
- the compiler may decide to limit the span of functions for a segment accordingly, in order to keep the segment size evaluation simple.
- the segment effort vectors SEV0 to SEV1 may be created in a similar way. Metadata of every instruction are queried by the compiler for each specific processing unit category individually. The effort for each instruction may be provided as a simple vector of scalars or of scalar functions. Typically, the segment effort vectors SEV0 to SEV1 correspond to the respective normalized instruction effort of computing a single element of the corresponding output. Note that individual processing units PU0 to PU2 may associate different efforts to compute a single output element, depending on, for example, different instruction sets being supported by the processing unit, different implementations of the intended numerical operation on the processing units, and/or different support for the required numerical element data type or numerical precision.
- the effort data vectors for non-size changing array instructions are typically aggregated into a single effort vector by addition of the effort vectors of non-size changing instructions involved.
- Figure 8 illustrates one embodiment, where the effort of computing a single element of the output produced by the segment 3+sin(A) is shown. Both instructions are non-size changing instructions. Hence, every element of the output requires the computation of one addition and one evaluation of the (intrinsic) sinusoidal function. The efforts for both operations may be or are provided by the author of the instruction in terms of instruction metadata and are added to form the segment effort vector. [00254] Often, the effort of computing a single element of the output is associated with the overall number of elements to be computed.
- segment effort vector is sparse in that only a single element is non-zero, namely the element corresponding to the index of the information of the number of elements in a size descriptor.
- the effort to create a single element is stored in the segment effort vectors SEV0 to SEV2 at index 1 (2 nd index value), corresponding to the position where the number of elements of an array is stored in an array size descriptor and the SSV, respectively.
- the stored value 6 is the result of adding the single element effort 1 of the Add() instruction and the single element effort 5 of the Sin() instruction.
- Other elements of the segment size vector are 0, indicating that the segment effort can solely be computed in terms of the overall number of elements in the input data structure at runtime.
- Other segments may implement other instructions, for example the aggregating instruction Sum().
- the corresponding instruction effort may be computed based on other size information, for example the number of dimensions or the length of individual dimensions of the input data structures.
- the effort evaluation tree can be used at runtime to gather the segment effort vector of the segment for a certain set of input data structure instances. Therefore, the instruction effort functions and the instruction size functions stored in the nodes of the tree are evaluated appropriately (e.g. recursively or by walking the nodes of the tree), taking into account the concrete sizes of the data structure instances (input- and intermediate argument sizes). [00259] This way the effort evaluation tree allows the anticipation of the effort required to compute the numerical operation of the segment on a specific processing unit with specific data structure instances based on associated metadata - without actually having to carry out the operation on the processing unit.
- the instruction effort vectors are maintained by the instruction author for each device category individually. This takes into account the fact that individual devices may implement the same operation in a different manner, or the same operation causes different execution costs on individual device categories.
- segment effort vectors after the compiler stage may represent the effort of creating a single element
- the segment effort vectors may be further refined by taking further data into account.
- Figure 9 illustrates the adoption of the segment effort vector with information about the number of cores available for two processing units, a CPU and a GPU.
- the number of cores of the available processing units may be determined and the predetermined segment effort vectors SEV0 to SEV2 updated accordingly.
- the detected 4 cores for the CPU allows the execution of 4 instructions at the same time.
- the elements of the predetermined corresponding segment effort vector SEV1 are divided by 4.
- the GPU may has 96 cores and the segment effort vector SEV2 for the GPU is divided by 96 (or multiplied by -0.01 ).
- the predetermined segment effort vector SEV0 (for reasons of clarity not shown in Fig. 9) representing the effort to execute the segment in a sequential / scalar manner on the CPU may or may not be updated here.
- further and/or other data are taken into account in order to create and/or update the segment effort information and/or further and/or other factors than the number of elements in the size descriptor of data structure instances may be used.
- further device specific properties like kernel execution trigger overhead, length of CPU vector extension units, memory bus transfer rates between devices, core grouping information, and/or actual or intended device utilization may be considered.
- further size information may be used to relate effort information to the actual data structures instantiated for execution of a segment.
- constant factors may be included in a size descriptor or added to the computed effort.
- like number of dimensions, number of elements, length of dimensions, integer or fractional powers of such or further numbers may be used to describe the size of a data structure instance.
- Any information related to the influence which is done by the data structure instance to the effort of executing the segment may be used inside a segment effort evaluation, including, for example values of elements of the data structure, symmetry information, data range information, boolean or integer flags describing further properties of the data structure.
- kernel template At compile stage, one kernel template is typically created for each device category.
- the kernel templates implement the numerical operation intended by the instructions of the segment.
- Such templates may be used to efficiently implement and adapt the kernel code at start-up stage when the actual execution units are known.
- Examples of adaptation include the consideration of device properties like cache sizes, cache line lengths, SIMD vector extension unit lengths, supported feature set, precision supported by the device, frequency, number of cores, etc.
- Such information may be used to adapt the kernel and/or the kernel templates for faster execution on the processing units.
- the overhead of the method is desired to be kept as small as possible. This down-shifts the break-even point, more particular the size of array data instances above which the advantages of the method by execution on supplementary devices exceeds the overhead added by the method itself.
- arrays data structure instances
- the arrays may store the locations of device buffers, where their elements have been copied to. This way, the information stored in an array may exist on multiple device memories at the same time. All such devices can access the array with little overhead.
- the compiler may decide to support memory management on the processing units by optimizing segment implementations for the intent of its arguments.
- 'intent' refers to at least one of the lifetime and the mutability of an argument. Therefore, distinction between the segment arguments intent can be built into the language (distinct InArray, OutArray, LocalArray and ReturnArray types). Utilizing immutability (read-only input arrays) and volatility (self-destructing return array types) saves array copies and realizes an early, deterministic disposal for array storages - even for languages which do not support deterministic object destruction from the start.
- Device buffer storage may be allocated for input array arguments on demand and gets the same lifetime as the array's storage itself. Once copied to the device the buffer stays there until the array is disposed or changed. This way a cheap re-utilization of the array (buffer) in a potential subsequent segment invocation is enabled.
- Segments output arguments typically invalidate (dispose / release) all other storage locations for the array and a new buffer becomes the only storage location for the array.
- Released buffers may also stay on the device for later reusing (pooling).
- Arrays data storage is typically maintained separately from the array size descriptors. This allows sharing (large) data storages among subarrays of the same source array, differing only by individual size descriptors (sometimes referred to as 'views').
- Arrays may create lazy copies of their storages only ('lazy copy on write').
- runtime segments may be executed asynchronously on devices supporting it.
- one device at a time (often the host controller) is responsible for determining the best suited processing unit for a runtime segment and/or for queueing the runtime segment to the determined processing unit asynchronously. I.e.: the controlling device does not wait for the execution of the runtime segment's compute kernel to have finished until the method proceeds with the next runtime segment.
- Figure 10 illustrates method steps associated with the asynchronous execution of multiple runtime segments and instructions involving the creation-, accessing and releasing of and the computation with data structures capable of storing multiple elements of a common data type such as array objects.
- two additional items are stored and maintained in array objects for each supported device, namely a buffer handle and a sync handle.
- the buffer handle stores a reference to the buffer of the array data (data structure instance) on the device specific memory, if applicable.
- the type of the buffer handle may vary with the device type and / or the interface used to access the devices.
- the synch handle may be used to synchronize access to the buffer storage for a specific device.
- Synch handles may either be obtained from device specific interfaces (e.g. OpenCL, MPI) or created and maintained by the host application codes.
- device specific interfaces e.g. OpenCL, MPI
- a CPU host device implementing memory as managed heap (device 0) and a GPU (device 2) with dedicated GPU memory.
- the location indices assigned to the devices are a matter of choice. While the CPU was assigned the index 0 to and the GPU was assigned the index 2 in the figure 10, the (device) location indices in the set of supported devices can be different without affecting this scheme.
- start-up start-up phase
- the supported devices are identified and a fixed index is typically assigned to each device. This index is used throughout the whole program execution to locate the device in the location array.
- Figure 10 shows, with the focus on memory management and synchronization, a schematic view of the actions used to asynchronously compute the following array instruction for an input array A: [00290]
- the lines [1] and [7] establish a code block which associates a life time scope to the array A.
- Arrays scope is included into the following consideration since it allows for deterministic destruction, release, and reusing of the array storages. This is expected to outperform non-deterministic memory management via, e.g. garbage collectors (GC) in most situations, since it allows for releasing and reusing memory in a timely, deterministic manner and safes the overhead of garbage collection and associated costs.
- GC garbage collectors
- explicit scoping simplifies the memory management by allowing the host to work in single-threaded mode, saving synchronization overhead otherwise introduced by locking or other synchronization schemes. Some GC implementations may interfere here due to their multithreaded, non- deterministic object destruction and/or finalization. In this description it is assumed that the language used is able to provide scoping and lifetime information of its objects, respectively.
- the array A is created as a matrix with exemplary 1000 rows and 2000 columns filled with random numbers on the host controller.
- the compiler may identify this array instruction as a (single instruction) segment and at runtime it may be decided to compute the random number on the GPU (device 2).
- the synch handle may be a simple shared, atomic (universal or main) counter. Alternatively, it might be a more sophisticated synchronization object provided by the host framework and / or supported by the operating system or other related execution layers of the system.
- a pointer to the buffer may be provided and stored into the same device slot of the array A that corresponds to the GPU device.
- the buffer handle is used to access the actual storage on the device. Its type may depend on the technology used to access / interface with the device (e.g. OpenCL, MPI, OpenMP).
- the buffer handle may be provided in the same call which provided the synch handle. Or it may be provided in a later call using the synch handle to query the buffer pointer from the device. Or the creation of the buffer may be implemented in a synchronous manner altogether.
- the execution of the selected compute kernel is triggered. Such triggering may be performed by queueing a corresponding command on a command queue or similar, as provided by the low-level interface used to access the processing unit.
- Both, the synch handle and the buffer handle are typically provided to the kernel.
- the call again, is performed asynchronously. It returns without waiting for the completion of any (potentially time consuming) action associated with the call.
- the compute kernel waits on the synch handle in case that the last operation was not finished and the buffer may not be accessible yet. This waiting and any potentially expensive and/or time consuming computations in the compute kernel are performed without delaying the host processor thread.
- the call which triggers the asynchronous execution of the compute kernels provides a second synch handle to the host.
- the second synch handle is used to identify the kernel operation later on. Note how both asynchronous operations are chained: the kernel execution waits for the buffer creation. Hence the second synch handle returned is suitable to identify a ready state of the device buffer after completion of all previous commands. Therefore, it is sufficient to store the second synch handle by replacing the first one - which is no longer needed at this point in the program.
- the synch handles may not be chainable. Each operation may require a new handle to be generated. Synchronization may be performed without support by the low - level device interface.
- the host After the computation of rand(1000,2000) (seg_001 in Figure 10) was triggered, the host stores the synch handle returned by the call in the array storage location slot for device 2 and immediately continues with subsequent operations.
- the existence of the synch handle in the storage location slot informs other consumers of the array that an operation utilizing the array's storage is still in progress. Operations which may have to consider sync handle existence include accessing the buffer as output buffer (i.e.: modifying the buffer by a kernel) and the release of the buffer.
- the host synchronizes the new segments execution with the previous execution by providing the buffer handle and the synch handle stored in the array location slot for device 2.
- the kernel immediately returns a new synch handle and waits for the synch handle provided (if any) before starting execution.
- synch handles in general are optional. The whole scheme or arbitrary combinations of individual devices may also work synchronously or implement other schemes for synchronization. Note further that the output of seg_0042 is omitted here for clarity.
- the next operation requests a local host array representing the array elements as a 1 D system array object. Since A's data so far exist only on the GPU device 2 elements of A need to be copied to the host device 0. This copy, again, is performed asynchronously. A new synch handle is returned immediately from the copy command to the device 0 - this time corresponding to a new buffer on device 0. Similar to the array buffer creation above, a buffer handle may be returned immediately or requested / stored later on. Alternatively, the copy of the data may be preceeed by an explicit buffer creation command. [00311] Note that the buffer handle is read from the buffer on device 2 only. Since no modifications are performed to the buffer on device 2, no new synch handle for device location 2 is needed in this step.
- a memory barrier may be introduced which allows to wait for the completion of the copy operation by means of the synch handle in the device storage location 0.
- Such memory barrier works in the same way known from common synchronization methods and can be used to wait synchronously for the completion of operations on any device. Similar wait mechanisms may be used, for example, when attempting to release a buffer.
- array A leaves the scope of the current code block. Since A is not referenced anymore afterwards, the storages associated with A may be disposed as soon as possible to free the memory areas consumed on the (limited) device memory. Such disposals take the existence of any synch handles into account.
- the method may uses implicit memory barriers (as with the WaitSynchHandle() method in Fig. 10) before releasing the memory. Or asynchronous release operations may be used if the device interface supports such.
- the method may or may not implement a 1 :1 relation between array objects and device buffers.
- reference counting could be used to decrement a reference counter of the shared storage object. The buffers are released as soon as the reference counter indicates that no further arrays exist referencing this storage.
- Asynchronous execution can help to keep the devices of a heterogeneous computing system busy and to utilize the existing computing resources more efficiently and therefore to finish the execution of the sequence earlier or to consume less energy.
- a measure for the cost of the segment is computed (anyways). This measure can also be used as a representation of the number of instructions 'waiting' in the queue. The accumulated measure of all segments currently 'waiting' in the queue for execution gives an indication about the operations 'ahead' in the device queue.
- each item may correspond to an individual number of (numeric) operations in the item can be taken into account.
- the decision for the 'optimal' device may also be based on this cost 'ahead'.
- a device has many segments in the queue it can be better to invest the cost of a data buffer copy to another device and perform /queue the segment on the other device instead.
- a low-level interface e.g. OpenCL
- OpenCL a low-level interface
- the low level device interface or the higher level management layer decides that a different order than the order the segments were originally lined up in the sequence is advantageous in terms of performance or other aspects and that exchanging the order does not introduce negative side effects to the sequence results, it may rearrange the order of execution.
- Rearranging the order of execution may even allow executing some runtime segments on multiple devices concurrently. For example, rearranging a sequence of segments may group together such segments which depend on each other's result(s). Also, it can group together such segments, which require the same or mostly the same arguments / data buffers. The grouping can also happen based on the current locality of the data buffers. All this can increase the chance for such groups of segments to be concurrently executable in an efficient manner on multiple devices. For example, a first group of rearranged runtime segments may be asynchronously executed on the first device while subsequent runtime segments forming a second group may be queued to the second device for asynchronous execution. Since data dependencies between the groups are kept low, both runtime segment groups can be executed with low synchronization effort.
- the method is able to keep many or even all of the devices busy to an advantageous degree.
- computing resources are better utilized.
- a computer-implemented method includes identifying a program segment in a program code, typically a sequence of program segments.
- the program segment includes a computing instruction, typically a numerical operation, for a first data structure capable of storing multiple elements of a first common data type and a function meta information comprising data related to an output of the computing instruction.
- a first processing unit and a second processing unit of a computing system which are suitable for executing a respective compiled representation (typically a compute kernel) of the program segment are determined.
- the first processing unit is different from the second processing unit.
- the first processing unit is initialized with a first compiled representation of the program segment and the second processing unit is initialized with a second compiled representation of the program segment.
- the function meta information and a data meta information of a runtime instance of the first data structure are used to determine first expected costs of executing the first compiled representation on the first processing unit and to determine second expected costs of executing the second compiled representation on the second processing unit.
- the first expected costs and the second expected costs are used to select either the first processing unit or the second processing unit for executing the respective compiled representation, typically the processing unit with lower expected costs.
- the first and second costs are typically numerically determined (numerically calculated).
- the first expected costs and the second expected costs may be used to select either only one of the first processing unit and the second processing unit for executing the respective compiled representation or to determine respective shares of a workload of executing the respective compiled representation on the first processing unit and on the second processing unit.
- Sharing the workload may be invoked if the computing instruction is suitable for executing on portions of the runtime instance. This is in particular the case for computing instructions that can be performed element by element or on separable regions of the data. Examples include numerical operations such as adding matrices, calculating a function of a matrix A such as sin(A) and smoothing data.
- Sharing the workload is typically achieved by assigning a first portion of the runtime instance to the first processing unit and assigning a second portion of the runtime instance to the second processing unit.
- the workload may be shared according to the expected costs, for example inversely related to the expected costs, and / or expected execution time.
- the expected execution time may be derived from the expected costs and the actual availability or workload of the processing units.
- the workload is shared if a runtime size of the runtime instance is sufficiently large and/or if it is found or anticipated that the utilization of at least one processing unit available on the system is low compared to other processing units.
- the compiler at runtime or a scheduling module may decide to share the underutilized processing unit(s) with one or more devices with higher utilization in order to distribute at least some of the workload from the busy device(s) to the underutilized device(s).
- Shared devices may form so called 'virtual devices' and are especially useful if asynchronous execution and out of order execution is not available for the processing units. This may be the case if the underlying low-level interface handling the access of the processing units does not support such features and/or if the instructions show too many dependencies so that a reordering of instructions / segments and/or a distribution of segments over multiple devices is not possible or not feasible.
- virtual devices enable the devices involved in the execution of segments with such distributed instances to be used according to this invention.
- the virtual device is thereby considered a processing unit and handled as described above.
- FIG. 1 Another scenario where virtual devices are advantageous for the performance of execution sequences are shared memory devices, for example CPUs sharing memory with GPU devices.
- a virtual device comprising all or parts of the CPU cores and all or parts of the GPU cores may expose higher computational capabilities than the underlying separate devices. Since no transfer costs are involved (due to the shared memory) substituting either one of the underlying devices with the virtual device will lead to higher execution performance.
- the runtime segment includes the function meta information, a compute kernel for each of the determined processing units and executable code for determining expected costs for executing the compute kernels with a runtime instance of the first data structure using the function meta information and a data meta information of the runtime instance of the first data structure.
- the data meta information may include a runtime size information of the runtime instance, a runtime location information of the runtime instance, and a runtime type information of the runtime instance.
- Determining the function meta information may include reading the function meta information if they are already stored in the software code, searching in a look-up table storing function meta information for (already) known numerical operations such as adding, sinus, FFT (fast Fourier transform) etc. and/or even deriving the function meta information based on test runs.
- the method includes initializing a first processing unit of a heterogeneous computing system with a first compute kernel and a second processing unit of the heterogeneous computing system with a second compute kernel. Both the first compute kernel and the second compute kernel are configured to perform a numerical operation derived from a program segment which is configured to receive a first data structure storing multiple elements of a common data type.
- the program segment includes a function meta information including data related to a size of an output of the numerical operation, a structure of the output, and/or an effort for generating the output.
- the function meta information and a data meta information of a runtime instance of the first data structure are used to determine first expected costs of executing the first kernel on the first processing unit to perform the numerical operation with the runtime instance and to determine second expected costs of executing the second kernel on the second processing unit to perform the numerical operation with the runtime instance.
- the data meta information includes at least one of a runtime size information of the runtime instance, a runtime location information of the runtime instance, a runtime synchronization information of the runtime instance and a runtime type information of the runtime instance.
- the method further includes one of executing the first compute kernel on the first processing unit to perform the numerical operation on the runtime instance if the first expected costs are lower than or equal to the second expected costs, and executing the second compute kernel with the second processing unit to perform the numerical operation with the runtime instance if the first expected costs are higher than the second expected costs.
- the method comprises initializing a heterogeneous computing system comprising a first processing unit and a second processing unit with a runtime segment for performing a numerical operation with a first data structure storing multiple elements of a common data type.
- the runtime segment comprises a first compute kernel configured to perform the numerical operation on the first processing unit, a second compute kernel configured to perform the numerical operation on the second processing unit, and a function meta information comprising a first numerical effort for performing the numerical operation with one element of the common data type on the first processing unit and a second numerical effort for performing the numerical operation with one element of the common data type on the second processing unit.
- a runtime instance of the first data structure is created in the heterogeneous computing system.
- a data meta information of the runtime instance is determined.
- the data meta information comprises a runtime size information of the runtime instance and a runtime location information of the runtime instance.
- the data meta information and the function meta information are used to (numerically) calculate first expected costs of executing the first compute kernel on the first processing unit to perform the numerical operation with the runtime instance and to (numerically) calculate second expected costs of executing the second compute kernel on the second processing unit to perform the numerical operation with the runtime instance.
- the first compute kernel is executed on the first processing unit to perform the numerical operation with the runtime instance if the first expected costs are lower than or equal to the second expected costs. Otherwise, the second compute kernel is executed on the second processing unit to perform the numerical operation with the runtime instance.
- the function meta information may further comprise data related to a size of an output of the numerical operation, and /or data related to a structure of the output.
- the function meta information Prior to calculating the first and second expected costs, the function meta information, in particular the first and second numerical efforts are typically updated in accordance with determined properties of the first processing unit and determined properties of the second processing unit.
- first compute kernel and/or the second compute kernel may be updated in accordance with determined properties of the first processing unit and determined properties of the second processing unit, respectively, prior to calculating the first and second expected costs.
- the runtime segment typically includes control code for calculating the first and second costs and/or dynamically deciding based on the first and second costs on which of the processing units the respective compute kernel is to be executed.
- the runtime segment may be derived from a program segment defining the numerical operation.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Devices For Executing Special Programs (AREA)
- Complex Calculations (AREA)
- Stored Programmes (AREA)
Priority Applications (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US16/607,263 US11144348B2 (en) | 2017-04-28 | 2018-04-27 | Heterogeneous computing system and method including analyzing expected costs of compute kernels |
| CN201880016139.1A CN110383247B (zh) | 2017-04-28 | 2018-04-27 | 由计算机执行的方法、计算机可读介质与异构计算系统 |
| JP2019547267A JP7220914B2 (ja) | 2017-04-28 | 2018-04-27 | コンピュータに実装する方法、コンピュータ可読媒体および異種計算システム |
| EP18720259.3A EP3443458B1 (en) | 2017-04-28 | 2018-04-27 | A computer-implemented method, a computer-readable medium and a heterogeneous computing system |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| DE102017109239.0 | 2017-04-28 | ||
| DE102017109239.0A DE102017109239A1 (de) | 2017-04-28 | 2017-04-28 | Computerimplementiertes verfahren, computerlesbares medium und heterogenes rechnersystem |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2018197695A1 true WO2018197695A1 (en) | 2018-11-01 |
Family
ID=62063083
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2018/060932 Ceased WO2018197695A1 (en) | 2017-04-28 | 2018-04-27 | A computer-implemented method, a computer-readable medium and a heterogeneous computing system |
Country Status (6)
| Country | Link |
|---|---|
| US (1) | US11144348B2 (https=) |
| EP (1) | EP3443458B1 (https=) |
| JP (1) | JP7220914B2 (https=) |
| CN (1) | CN110383247B (https=) |
| DE (1) | DE102017109239A1 (https=) |
| WO (1) | WO2018197695A1 (https=) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4227795A1 (en) | 2022-02-15 | 2023-08-16 | ILNumerics GmbH | A computer-implemented method and a computer-readable medium |
| EP4465167A1 (en) | 2023-05-15 | 2024-11-20 | ILNumerics GmbH | A computer-implemented method and a computer-readable medium |
Families Citing this family (19)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11188348B2 (en) * | 2018-08-31 | 2021-11-30 | International Business Machines Corporation | Hybrid computing device selection analysis |
| US11604757B2 (en) * | 2019-07-17 | 2023-03-14 | International Business Machines Corporation | Processing data in memory using an FPGA |
| CN111090508B (zh) * | 2019-11-29 | 2023-04-14 | 西安交通大学 | 一种基于OpenCL的异构协同并行计算中设备间动态任务调度方法 |
| CN113590086B (zh) * | 2020-04-30 | 2023-09-12 | 广东中砼物联网科技有限公司 | 快速开发软件的方法、计算机设备、及存储介质 |
| CN112083956B (zh) * | 2020-09-15 | 2022-12-09 | 哈尔滨工业大学 | 一种面向异构平台的复杂指针数据结构自动管理系统 |
| CN112364053B (zh) * | 2020-11-25 | 2023-09-05 | 成都佳华物链云科技有限公司 | 一种搜索优化方法、装置、电子设备及存储介质 |
| CN112486684B (zh) * | 2020-11-30 | 2022-08-12 | 展讯半导体(成都)有限公司 | 行车影像显示方法、装置及平台、存储介质、嵌入式设备 |
| CN112783503B (zh) * | 2021-01-18 | 2023-12-22 | 中山大学 | 一种基于Arm架构的NumPy运算加速优化方法 |
| CN115390921A (zh) * | 2021-05-21 | 2022-11-25 | 华为技术有限公司 | 一种调度方法、装置、系统和计算设备 |
| CN114003973B (zh) * | 2021-10-13 | 2024-12-24 | 杭州趣链科技有限公司 | 数据处理方法、装置、电子设备和存储介质 |
| CN114138686B (zh) * | 2021-12-03 | 2025-06-17 | 中国航空工业集团公司西安飞行自动控制研究所 | 基于加解锁访问机制的异构处理器数据读取装置及方法 |
| CN114185687B (zh) * | 2022-02-14 | 2022-05-24 | 中国人民解放军国防科技大学 | 一种面向共享内存式协处理器的堆内存管理方法和装置 |
| CN115167637B (zh) * | 2022-09-08 | 2022-12-13 | 中国电子科技集团公司第十五研究所 | 一种易扩展可重构的计算机系统及计算机 |
| KR102625797B1 (ko) * | 2023-03-06 | 2024-01-16 | 주식회사 모레 | 파이프라인 병렬 처리 컴파일링 방법 및 장치 |
| CN116089050B (zh) * | 2023-04-13 | 2023-06-27 | 湖南大学 | 一种异构自适应任务调度方法 |
| US12360805B2 (en) * | 2023-07-10 | 2025-07-15 | Azurengine Technologies Zhuhai Inc. | Vectorized scalar processor for executing scalar instructions in multi-threaded computing |
| CN118939275B (zh) * | 2024-10-10 | 2025-01-24 | 杭州长川科技股份有限公司 | 分类参数的编译方法、调用方法、装置、设备和程序产品 |
| CN119938344A (zh) * | 2025-04-09 | 2025-05-06 | 中国科学院计算技术研究所 | 数据中心的资源调度方法、装置、存储介质及电子设备 |
| CN120950074B (zh) * | 2025-10-14 | 2025-12-26 | 长沙科梁科技有限公司 | 编程语言迁移运行方法和装置 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017016590A1 (en) * | 2015-07-27 | 2017-02-02 | Hewlett-Packard Development Company, L P | Scheduling heterogenous processors |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8074059B2 (en) * | 2005-09-02 | 2011-12-06 | Binl ATE, LLC | System and method for performing deterministic processing |
| JP4936517B2 (ja) * | 2006-06-06 | 2012-05-23 | 学校法人早稲田大学 | ヘテロジニアス・マルチプロセッサシステムの制御方法及びマルチグレイン並列化コンパイラ |
| US8312346B2 (en) * | 2009-05-01 | 2012-11-13 | Mirics Semiconductor Limited | Systems and methods for communications |
| US8375392B2 (en) | 2010-01-12 | 2013-02-12 | Nec Laboratories America, Inc. | Data aware scheduling on heterogeneous platforms |
| US8522217B2 (en) * | 2010-04-20 | 2013-08-27 | Microsoft Corporation | Visualization of runtime analysis across dynamic boundaries |
| US20150309808A1 (en) * | 2010-12-31 | 2015-10-29 | Morphing Machines Pvt Ltd | Method and System on Chip (SoC) for Adapting a Reconfigurable Hardware for an Application in Runtime |
| US8782645B2 (en) * | 2011-05-11 | 2014-07-15 | Advanced Micro Devices, Inc. | Automatic load balancing for heterogeneous cores |
| US8566559B2 (en) * | 2011-10-10 | 2013-10-22 | Microsoft Corporation | Runtime type identification of native heap allocations |
| EP2812802A4 (en) * | 2012-02-08 | 2016-04-27 | Intel Corp | DYNAMIC CPU GPU LOAD BALANCING USING POWER |
| JP2014102683A (ja) * | 2012-11-20 | 2014-06-05 | Fujitsu Ltd | 情報処理装置の制御プログラム、情報処理装置の制御方法および情報処理装置 |
| US9235801B2 (en) * | 2013-03-15 | 2016-01-12 | Citrix Systems, Inc. | Managing computer server capacity |
| WO2015150342A1 (en) * | 2014-03-30 | 2015-10-08 | Universiteit Gent | Program execution on heterogeneous platform |
-
2017
- 2017-04-28 DE DE102017109239.0A patent/DE102017109239A1/de not_active Withdrawn
-
2018
- 2018-04-27 CN CN201880016139.1A patent/CN110383247B/zh active Active
- 2018-04-27 US US16/607,263 patent/US11144348B2/en active Active
- 2018-04-27 JP JP2019547267A patent/JP7220914B2/ja active Active
- 2018-04-27 EP EP18720259.3A patent/EP3443458B1/en active Active
- 2018-04-27 WO PCT/EP2018/060932 patent/WO2018197695A1/en not_active Ceased
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2017016590A1 (en) * | 2015-07-27 | 2017-02-02 | Hewlett-Packard Development Company, L P | Scheduling heterogenous processors |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| EP4227795A1 (en) | 2022-02-15 | 2023-08-16 | ILNumerics GmbH | A computer-implemented method and a computer-readable medium |
| US12254296B2 (en) | 2022-02-15 | 2025-03-18 | Ilnumerics Gmbh | Computer-implemented method and a computer-readable medium |
| EP4465167A1 (en) | 2023-05-15 | 2024-11-20 | ILNumerics GmbH | A computer-implemented method and a computer-readable medium |
| WO2024235717A1 (en) | 2023-05-15 | 2024-11-21 | Ilnumerics Gmbh | A computer-implemented method and a computer-readable medium |
Also Published As
| Publication number | Publication date |
|---|---|
| JP7220914B2 (ja) | 2023-02-13 |
| DE102017109239A1 (de) | 2018-10-31 |
| EP3443458C0 (en) | 2024-02-28 |
| JP2020518881A (ja) | 2020-06-25 |
| US11144348B2 (en) | 2021-10-12 |
| US20200301736A1 (en) | 2020-09-24 |
| EP3443458B1 (en) | 2024-02-28 |
| CN110383247A (zh) | 2019-10-25 |
| CN110383247B (zh) | 2023-04-28 |
| EP3443458A1 (en) | 2019-02-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3443458B1 (en) | A computer-implemented method, a computer-readable medium and a heterogeneous computing system | |
| Planas et al. | Self-adaptive OmpSs tasks in heterogeneous environments | |
| US11385931B2 (en) | Method, electronic device, and computer program product for processing computing job | |
| Ben-Nun et al. | Memory access patterns: The missing piece of the multi-GPU puzzle | |
| Lalami et al. | GPU implementation of the branch and bound method for knapsack problems | |
| CN112148472A (zh) | 用于提高执行软件的异构系统的利用率的方法和装置 | |
| WO2015099562A1 (en) | Methods and apparatus for data-parallel execution of operations on segmented arrays | |
| US11960982B1 (en) | System and method of determining and executing deep tensor columns in neural networks | |
| JP2019049843A (ja) | 実行ノード選定プログラム、実行ノード選定方法及び情報処理装置 | |
| Fraguela et al. | Optimization techniques for efficient HTA programs | |
| Zhang et al. | Optimizing the Barnes-Hut algorithm in UPC | |
| Papadimitriou et al. | Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes | |
| Liu et al. | swTVM: exploring the automated compilation for deep learning on sunway architecture | |
| US12254296B2 (en) | Computer-implemented method and a computer-readable medium | |
| Brezany | Input/output intensive massively parallel computing: language support, automatic parallelization, advanced optimization, and runtime systems | |
| Lebedev et al. | Automatic parallelization of affine programs for distributed memory systems | |
| Hutter et al. | ParaTreeT: A Fast, General Framework for Spatial Tree Traversal | |
| US20250370768A1 (en) | A computer-implemented method and a computer-readable medium | |
| Planas et al. | Selection of task implementations in the Nanos++ runtime | |
| Schnetter | Performance and optimization abstractions for large scale heterogeneous systems in the cactus/chemora framework | |
| Khan et al. | GPU Native Computation of Scalable Tensor Programs | |
| KR20260001464A (ko) | 프로세싱 장치를 이용한 연산 및 통신을 위한 시스템 및 방법 | |
| Li | Facilitating emerging applications on many-core processors | |
| Reddy | High-Throughput Data Structures for GPU-Accelerated Computing | |
| Hong | Code Optimization on GPUs |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WWE | Wipo information: entry into national phase |
Ref document number: 2018720259 Country of ref document: EP |
|
| ENP | Entry into the national phase |
Ref document number: 2018720259 Country of ref document: EP Effective date: 20181116 |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18720259 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2019547267 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |