WO2010121228A2 - Système, procédés et appareil destinés à une optimisation de programmes pour des architectures de processeurs à plusieurs fils - Google Patents

Système, procédés et appareil destinés à une optimisation de programmes pour des architectures de processeurs à plusieurs fils Download PDF

Info

Publication number
WO2010121228A2
WO2010121228A2 PCT/US2010/031524 US2010031524W WO2010121228A2 WO 2010121228 A2 WO2010121228 A2 WO 2010121228A2 US 2010031524 W US2010031524 W US 2010031524W WO 2010121228 A2 WO2010121228 A2 WO 2010121228A2
Authority
WO
WIPO (PCT)
Prior art keywords
computing apparatus
execution
memory
program
loop
Prior art date
Application number
PCT/US2010/031524
Other languages
English (en)
Other versions
WO2010121228A3 (fr
Inventor
Allen K. Leung
Benoit Meister
Nicolas T. Vasilache
David E. Wohlford
Cedric Bastoul
Peter Szilagyi
Richard A. Lethin
Original Assignee
Reservoir Labs, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/561,152 external-priority patent/US8572590B2/en
Application filed by Reservoir Labs, Inc. filed Critical Reservoir Labs, Inc.
Publication of WO2010121228A2 publication Critical patent/WO2010121228A2/fr
Publication of WO2010121228A3 publication Critical patent/WO2010121228A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/456Parallelism detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation

Definitions

  • the present invention generally concerns computer programming. More particularly, the invention concerns a system, methods, and apparatus for source code compilation. Background of the Invention
  • compilers are responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena occur and interact simultaneously; this requires the optimizer to combine multiple program transformations. For instance, there is often a tradeoff between exploiting parallelism and exploiting locality to reduce the ever widening disparity between memory bandwidth and the frequency of processors: the memory wall. Indeed, the speed and bandwidth of the memory subsystems have always been a bottleneck, which worsens when going to multi-core. This memory wall is further exacerbated by non-contiguous memory accesses.
  • the polyhedral model is a powerful framework to unify coarse grained and fine-grained parallelism extraction with locality and communication contiguity optimizations. To date, this promise has not yet been completely fulfilled as no existing affine scheduling, fusion and communication contiguity technique can perform all these optimizations in a unified (i.e., non-phase ordered) and unbiased manner.
  • parallelism optimization algorithms optimize for degrees of parallelism, but cannot be used to optimize both locality and contiguity of communications. In like manner, algorithms used for locality optimization cannot be used both for extracting parallelism and optimizing the contiguity of communications. Additional difficulties arise when optimizing source code for the particular architecture of a target computing apparatus.
  • the present invention provides a system, apparatus and methods for overcoming some of the difficulties presented above.
  • Various embodiments of the present invention provide a method, apparatus, and computer software product for optimization of a computer program on a first computing apparatus for execution on a second computing apparatus.
  • computer program source code is received into a memory on a first computing apparatus.
  • the first computing apparatus' processor contains at least one multi-stage execution unit.
  • the source code contains at least one arbitrary loop nest.
  • the provided method produces program code that is optimized for execution on a second computing apparatus.
  • the second computing apparatus contains at least two multi-stage execution units. With these units there is an opportunity for parallel operations.
  • the first computing apparatus takes into account the opportunity for parallel operations and locality and analyses the tradeoff of execution costs between parallel execution and serial execution on the second computing apparatus.
  • the first computing apparatus minimizes the total costs and produces code that is optimized for execution on the second computing apparatus.
  • a custom computing apparatus contains a storage medium, such as a hard disk or solid state drive, a memory, such as a Random Access Memory
  • the at least one processor contains at least one multi-stage execution unit.
  • the storage medium is customized to contain a set of processor executable instructions that, when executed by the at least one processor, configure the custom computing apparatus to optimize source code for execution on a second computing apparatus.
  • the second computing apparatus in this embodiment, is configured with at least two multi-stage execution units. This configuration allows the execution of some tasks in parallel, across the at least two execution units and others in serial on a single execution unit.
  • the at least one processor takes into account the tradeoff between the cost of parallel operations on the second computing apparatus and the cost of serial operations on a single multi-stage execution unit in the second computing apparatus.
  • a computer software product contains a computer readable medium, such as a CDROM or DVD medium.
  • the computer readable medium contains a set of processor executable instructions, that when executed by a multistage processor within a first computing apparatus configure the first computing apparatus to optimize computer program source code for execution on a second computing apparatus.
  • the second computing apparatus contains at least two execution units. With at least two execution units there is an opportunity for parallel operations.
  • the configuration of the first computing apparatus includes a configuration to receive computer source code in a memory on the first computing apparatus and to optimize the costs of parallel execution and serial execution of tasks within the program, when executed on the second computing apparatus. The configuration minimizes these execution costs and produces program code that is optimized for execution on the second computing apparatus.
  • FIG. 1 is an overview of an exemplary compiler architecture consistent with provided embodiments
  • FIG. 2 illustrates the operational flow of one embodiment of a provided local memory compaction module
  • FIG. 3 illustrates the operational flow of another provided local memory compaction module, in which array references are partitioned into groups and algebraic and geometric data re-indexing functions are computed;
  • FIG. 4 illustrates the operational flow of an additional local memory compaction module in which inefficiencies in memory usage are determined using lattices of integer points
  • FIG. 5 illustrates the operational flow of an additional local memory compaction module for reducing the inefficiencies in local memory usage by extracting representative array references and producing re-indexing functions using Hermite factorizations;
  • FIG. 6 illustrates the operational flow of an additional local memory compaction module for computing data re-indexing functions by producing linear constraints and solving a series of linear programming problems.
  • FIG. 7 illustrates the operational flow of an additional local memory compaction module for computing data re-indexing functions by finding a prism of triangular base that encloses the accessed data set and reducing the memory requirements of the enclosed data region by transforming the data elements that lie within a subset of the prism of triangular base.
  • FIG. 8 illustrates the operational flow of an additional local memory compaction module using data re-indexing information to produce abstract communication commands and schedule computations and communications for the program in such a way that their executions overlap;
  • FIG. 9 illustrates a computing apparatus and computer software product consistent with provided embodiments
  • FIG. 10 illustrates a computer network and a computing apparatus consistent with provided embodiments
  • FIG. 11 illustrates processors with multi-stage execution units
  • FIG. 12 illustrates a processor with multiple multi-stage execution units
  • FIG. 13 illustrates an embodiment of a provided method
  • FIG. 14 illustrates an embodiment of a provided method
  • FIG. 15 illustrates an embodiment of a provided method
  • FIG. 16 illustrates an embodiment of a provided method
  • FIG. 17 illustrates an embodiment of a provided method
  • FIG. 18 illustrates an embodiment of a provided method
  • FIG. 19(a) and 19(b) illustrate an embodiment of a provided method
  • FIG. 20 illustrates an embodiment of a provided method
  • FIGs. 21 (a) and 21 (b) illustrate an embodiment of a provided method
  • FIG. 22 illustrates an embodiment of a provided method
  • FIG. 23 illustrates embodiments of provided methods
  • FIG. 24 illustrates other embodiments of provided methods
  • FIGs. 25a and 25b illustrate other embodiments of provided methods.
  • FIGs. 26a and 26b illustrate other embodiments of provided methods.
  • Compilers are responsible for translating the abstract operational semantics of the source program, i.e., a text description of what the program's execution is supposed to perform, into an executable form that makes efficient use of a highly complex heterogeneous machine.
  • Multiple architectural phenomena occur and interact simultaneously within the targeted computer during the execution of the program; this requires the optimizing compiler to combine multiple program transformations in order to define a program execution that takes advantage of those architectural phenomena.
  • multi-core computers there is often a trade-off between exploiting more processing elements simultaneously (parallelism) and exploiting data access locality to reduce memory traffic. Indeed, the speed and bandwidth of the memory subsystems are almost always a bottleneck. The problem is typically worse for multi-core computers. Since, in traditional compilers, optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved in general, resulting in poor scalability and disappointing sustained performance of the supposedly optimized program.
  • a compiler performs program transformations to the code with respect to different, sometimes conflicting, performance criteria. Any program transformation must ultimately respect the dependence relations in order to guarantee the correct execution of the program.
  • a class of transformations targeting the loop nests of a program (such as "DO" loops in the FORTRAN language, and “for” and “while” loops in languages derived from the C language) are known to account for the most compute intensive parts of many programs.
  • the polyhedral model is a representation of a program's structure particularly suited for expressing complex sequences of loop nests, complex sequences of loop nest transformations, and other relevant information such as for instance dependences, communications, and array layouts.
  • a polyhedron is defined as a set of points verifying a set of affine inequalities and equalities on a number of variables.
  • polyhedrons such as the one based on a combination of vertices, rays and lines proposed by Minkowski.
  • representations often based on the alternate definitions. While the present disclosure teaches using one of those definitions and representations to illustrate the various embodiments, various embodiments are in no way restricted to a particular definition or representation.
  • a polyhedral domain is defined as a finite union of polyhedrons.
  • One of the main interests in using polyhedral domains is that they provide a precise representation of sets and relations among sets, on which many optimization problems can be phrased and solved using a rich set of algorithms, which are mostly available in the literature.
  • Some embodiments of the sets in question represent loop iterations, mono- and multi-dimensional data sets, sets of processing elements, data transfers, synchronizations, and dependences.
  • essential characteristics of the execution of a program can be summarized into compact mathematical objects, polyhedrons, which can be manipulated and transcribed into an executable program that has desired execution properties.
  • the set of iterations of such a loop nest depends directly upon the value of the parameters.
  • the parametric domain that represents the set of iterations is called a "parametric iteration domain".
  • the values of the loop counters are integer.
  • the set of values of i and j also lie on a regular lattice of integer points (the standard lattice Z 2 in the current example).
  • FIG. 1 An exemplary embodiment of such a system is illustrated in FIG. 1 , where it is described as being part of a compiler.
  • Flow of the exemplary embodiment starts in block 1 , where the compiler is processing a program.
  • the compiler analyzes the program to decide if there are portions of the program that should be optimized and mapped using a polyhedral system. If it is the case, the flow goes to block 2, where the compiler provides the system of optimizing components with a polyhedral representation of a portion of the program to be optimized. If not, the compiler continues to process the program without using the system of optimizing components and completes.
  • the components of the system are in charge of a part of the global optimization of the input program.
  • the polyhedral representation of the input code is analyzed in block 2 to produce dependence information.
  • Flow continues in block 3 where such information is used in a local memory compaction component or module that modifies array layouts in a way that removes some dependencies, schedules loop iterations in a way that exposes loops that scan independent iterations, and schedules the execution of operations using a same data to be executed within a close time interval.
  • the flow continues to decision block 5, which decides which block is next in the flow as a function of the target machine. If the target machine requires the execution of explicit communication commands to transfer data to and from its processing elements, flow goes to block 6, where the representation of the program thus modified is then processed by a series of optimizing modules which define a new layout for data that is transferred to a processing element's local memory. Otherwise, the flow goes to block 9.
  • the flow continues to block 9, where an optimizing component processes the polyhedral representation of the program obtained from the previous components by inserting a polyhedral representation of synchronization operations, which ensure that the execution of the modified program produces the same results or similar results as the original input program.
  • the flow of the exemplary embodiment then goes to block 11 , where an optimizing component partitions the tasks into subtasks whose execution reduces traffic between the processing elements' memories and their registers.
  • the polyhedral representation of the optimized program is transformed by polyhedral code generation component into a representation (Abstract Syntax Tree, high-level language code, or a compiler's internal representation) that can be either processed by a compiler or processed further by the user.
  • the flow continues back to block 1 , where it may cycle again through the whole flow if there is more code to be optimized using the system of optimizing components.
  • the optimizing components or modules comprise processor executable code that when executed by a processor, convert source code into other forms of source code, or in some instances machine code.
  • various modules may be implemented in hardware such as monolithic circuits, Application Specific Integrated Circuits (ASIC), or Field Programmable Gate Arrays (FPGA). These modules may comprise software, hardware, firmware, or a combination of these implementations. It is important to note that various embodiments are illustrated in specific programming languages, these illustrations are mere examples and the scope is not therefore limited to any particular programming language.
  • FIG. 2 illustrates the flow of a provided method for local memory compaction.
  • Flow begins in block 10 where source code is received into memory.
  • the source code represents loops with arbitrary parametric affine iteration domain and contains at least one array reference.
  • An array reference is an operation that represents an access, typically a read or a write, to an array. Such a reference represents, either explicitly or implicitly, for instance by using programming language conventions, a function to retrieve the memory address of an element of the array.
  • that function is typically a direct or indirect function of the loop indices and of some loop-constant values.
  • arrays are typically referenced through mono- and multi-dimensional affine functions of some input values.
  • array size implicitly define the address of an array element as a function of the input values to references to this array, declaring "char A[100][200]” allocates an array of 20000 elements (100x200), named A, and defines that for any two integer values x and y, the memory address of the element of A referenced through A[x][y] is b+200x+y, where b is a value called the "base address" of array A.
  • b is constant for each array and is determined at some point in the compilation process.
  • Flow continues to block 20 where inefficiencies in memory usage in the at least one array are identified. In one embodiment, the inefficiencies are related to access and memory footprint of the array. Flow then continues to block 30 where at least one local array is allocated, and in block 40 a portion of the array with inefficient memory usage is mapped into the local array.
  • the mapping portion of the module outputs code that is more efficient than the original code in terms of the memory size requirements of the local array versus the original array.
  • the accessed data is arbitrarily complex.
  • the mapping produces a piecewise affine index function for the local arrays.
  • Other embodiments include the rendering of a visualization of the optimized code on a monitor.
  • Arrays are typically allocated sets of contiguous memory blocks. Some loop operations may access only portions of the allocated memory.
  • 900,000 contiguous memory blocks are allocated, but only 100 are accessed in this operation.
  • access to the array is not contiguous, but contains gaps, and thus will have less than optimal locality.
  • the original data layout (and array size) in a remote processor is extremely inefficient.
  • the local memory cannot hold the entirety of the array and the program cannot be executed properly.
  • One embodiment of a provided method, illustrated in FIG. 2, would map this code fragment into a local array with 100 elements.
  • An exemplary mapping would produce the following pseudo-code fragment, in which the storage requirement of a local array is reduced from 300X300 elements to the optimal 100 elements.
  • One feature of this embodiment is that it provides a method of compacting local memory in a computing apparatus.
  • This method provides a more efficient memory structure in terms of both access to the elements and the amount of memory occupied by the data that is actually accessed. The memory requirements are reduced from the initial allocation to an allocation that is large enough to contain the data that is actually used in the operations.
  • the provided method handles loops whose iteration domains are non-rectangular, and loops that have a parametric iteration domain. In this document we refer to polyhedral iteration domains that are either non-rectangular or parametric or both as "arbitrary parametric iteration domains".
  • the provided methods handle non-convex accessed data sets.
  • Imaging applications typically utilize significant multidimensional arrays where data representations of physical objects and systems are stored. Many image processing steps, such as discrete wavelet transforms for example, only utilize discrete portions of the stored data. In these situations, various embodiments provide significant optimizations to local data storage.
  • mapping block 40 includes partitioning references to form compatible references in block 50; determining a relation within compatible references in block 60; grouping compatible references based on the relation in block 70; performing algebraic simplification in block 80; and performing geometric arrangement through re-indexing the elements of the local array in block 90.
  • the set of references partitioned are references that access a portion of the array. The following pseudo-code example illustrates this embodiment.
  • Performing transformations of the way data are allocated in memory, i.e., transforming the data layouts, has a combinational aspect, since the data sets accessed through each array reference may overlap with one or more data sets accessed by other array references. Since each one of those overlaps entail constraints in the way that data layouts can be transformed, analyzing all the combinations of overlaps for all the references is a source of high computational complexity. Hence, references are grouped into sets in such a way that data accessed through one set of references does not overlap data accessed through another set of references. In this embodiment, references of the same set are called "compatible references". Since there is no overlap among sets of compatible references, the following parts of the memory layout transformation, which consider the overlaps, can be applied independently to each set of compatible references. In particular, they will decide if the overlapping data sets accessed by a set of compatible references should be partitioned further and how.
  • compatible references are identified by overlapping memory footprints during the execution of a particular subset of loop iterations.
  • the provided method identifies array references having overlapping memory footprints; duplicates a portion of the identified references; and associates each of the duplicates with disjoint subsets of the memory footprint.
  • An example pseudo-code illustrates this embodiment.
  • the two references A[i] 0] and A[j] [i] overlap when i j.
  • the references are allocated together, it is impossible to reduce the local memory usage using only affine transformations. This is because the data footprint of the two references is a 2-dimensional set (a cross), while the data footprints of the individual references are both 1 -dimensional.
  • one embodiment first estimates how much overlapping is in the references. If the references are read-only, and if the overlapping data set is a small percentage of the overall data set, the embodiment splits the references into two distinct references to one-dimensional data sets. In the above example, the embodiment will generate the following local memory allocation. Note that the center element of the data foot print, A[i][i], has been replicated and put into the locations AJ [i] and A_2 [i].
  • the geometric re-arrangements provided by a further exemplary embodiment are defined by a piecewise affine transformation.
  • the transformation applied to the references is defined as a set of functions, each element of the set being valid within a polyhedral domain of the loop values, the parameters and the coordinates of the data accessed through the set of compatible references.
  • the written data subset and a subset of the data set that is only read define a partition for the piecewise affine transformation.
  • the data set accessed by the both references to array A form a two-dimensional set, while the data sets accessed through each reference are one-dimensional.
  • the data accessed through both references overlap in A[i][i].
  • a piecewise transformation of A is applied, which separates A into two subsets, one for each one-dimensional data set, and marks one of them as receiving the updates (let us call it the "writing reference") to the duplicated data.
  • Such a partition of the iteration domain is obtained by defining the iterations accessing duplicate data through "non-writing" references and replacing those accesses with an access through the writing reference.
  • the result of the piecewise affine transformation can be represented by the following pseudo-code, which uses only two arrays as a replacement for the original array A, has quasi-optimal memory requirements (198 memory cells, while the optimal would be 197):
  • the geometric rearrangement is a piecewise affine transformation that defines a partition of the iteration domain and of the data sets in such a way that the number of references to a local array varies from one element of the partition to another.
  • the possible values of variable i are ⁇ 0 ⁇ i ⁇ 99900 ⁇
  • the maximum amount of memory that has to be allocated for any value of i is 200 memory cells (as compared to 10000), and it is 100+i when i is less than 100.
  • the resulting transformation can be represented as pseudo-code as follows:
  • the partition of i is obtained by computing the domain in which both data sets intersect, by projecting the intersection onto the vector space of the parameters (in the current example, the parameter is i and the projected domain is ⁇ i ⁇ 100 ⁇ .
  • FIG. 4 The operation flow of a further provided embodiment of a local memory compaction module is illustrated in FIG. 4. In this embodiment, flow begins at block 10 where source code is received in memory. Similar to the above embodiment, the source code represents loops with arbitrary parametric affine iteration domains and contain at least one array reference.
  • the identification of inefficiencies includes block 100 where strides in the polyhedral domain that defines the accessed dataset are identified, and block 110 where a lattice of integer points within the domain is extracted from the domain. These integer points indicate that only a regular subset of the accessed data region is accessed. In this manner, more efficient allocation of local arrays is accomplished because portions of the array that are not accessed are identified and excluded from the mapping from the array to the local array.
  • FIG. 5 An additional provided embodiment is illustrated in FIG. 5.
  • flow begins at block 10 where source code is received in memory. Similar to the above embodiment, the source code represents loops with arbitrary parametric affine iteration domain and contains at least one array reference.
  • Flow continues to block 20 where inefficiencies in memory usage in the at least one array are identified.
  • Flow then continues to block 30 where at least one local array is allocated, and in block 40 a portion of the array with inefficient memory usage is mapped into the local array.
  • FIG. 5 An additional provided embodiment is illustrated in FIG. 5.
  • flow begins at block 10 where source code is received in memory. Similar to the above embodiment, the source code represents loops with arbitrary parametric affine iteration domain and contains at least one array reference.
  • Flow continues to block 20 where inefficiencies in memory usage in the at least one array are identified.
  • Flow then continues to block 30 where at least one local array is allocated, and in block 40 a portion of the array with inefficient memory usage is mapped into the local array.
  • mapping block 40 includes partitioning references to form compatible references in block 50; determining a relation within compatible references in block 60; grouping compatible references based on the relation in block 70; performing algebraic simplification in block 80; and performing geometric arrangement in block 90.
  • the algebraic simplification block 80 includes block 130 where a representative array reference is extracted from a set of references accessing a portion of the array.
  • the representative array reference represents a set of references which access polyhedral datasets whose accessed points all lie on a lattice of integer points that is not the standard lattice, on which any integer point lies.
  • Hermite factorization One purpose of Hermite factorization is to reduce the dimension of the reference to the actual geometric dimension of the data footprint.
  • the access pattern contains strides, i.e., regular intervals between accessed data
  • using the non-unimodular matrix that results from the Hermite factorization in the transformation removes these strides in the resulting local references. For example, given an affine access function f(x, y) on loop indices x and parameters y, we first decompose it into the sum of g(x) + h(y), where g(x) is a linear function on x and h(y) is an affine function on y.
  • Hermite factorizations have many uses is lattice computations.
  • H called the "Hermite normal form"
  • U is a canonical representation of the lattice (also) represented by G.
  • U is a unimodular matrix, which entails that U, when used as a transformation, always transforms any point that has integer coordinates into another point that has integer coordinates. Also, any point that has integer coordinates can be obtained by transforming a point with integer coordinates using a unimodular transformation. This is important since most programming language conventions enforce that data elements, and particularly array elements, must have integer coordinates.
  • mapping block 40 includes partitioning references to form compatible references in block 50; determining a relation within compatible references in block 60; grouping compatible references based on the relation in block 70; performing algebraic simplification in block 80; and performing geometric arrangement in block 90.
  • Geometric rearrangement 90 contains blocks 150 where linear constraints are formed, block 160 where sets of linear programming problems are formed from the linear constraints and solved, and block 170 where a data reindexing is computed. In some embodiments, the flow goes back to block 150. In such embodiments, geometric rearrangements are applied iteratively until no reindexing function is found that reduces memory requirements.
  • a parallelotope is a finite polyhedron defined by 2d faces, and whose faces are pair-wise parallel.
  • a canonical rectangular parallelotope is a parallelotope for which the normal vectors to its faces are either a canonical vector or the negation of a canonical vector. Examples of rectangular parallelotopes are a cube (in a 3-dimensional space) and a rectangle (in a 2- dimensional space).
  • the transformation is a unimodular reindexing of the accessed data that minimizes the size of the smallest canonical rectangular parallelotope that encloses the accessed dataset.
  • this is accomplished by formulating a first set of linear constraints through the use of Farkas Lemma.
  • This first set of linear programming constraints is decomposed dimension by dimension to form a set of integer linear programming problems.
  • This set of problems is then solved to provide the data reindexing function which can then be applied to the at least one local array.
  • Unimodular reindexings transform integer points into integer points. Hence, the convention that data elements have integer coordinates is preserved by such a reindexing.
  • the linear part of the transformation can be represented by a unimodular matrix.
  • Farkas lemma is a basic linear algebra theorem which is often used to obtain, from a set of affine constraints (i.e., inequalities and equalities) on variables with unknown coefficients, constraints that apply to the unknown coefficient themselves.
  • it is used to obtain a set of constraints involving the coefficients of the unimodular data reindexing function (which is represented as a matrix) and the width of the enclosing rectangular parallelotope along each dimension. From those obtained constraints, the method embodiment finds values of the coefficients of the unimodular data reindexing function for which the width is minimal, using integer linear programming. For example, the data set accessed through reference B[i+j][j] in the following pseudo-code can be reindexed so as to occupy only 100 memory cells:
  • the coordinates (x- ⁇ ,X2) of the elements of array B accessed by that loop node are defined by the constraints D: ⁇ n ⁇ x 2 ⁇ n+10; n ⁇ x-i ⁇ n+10 ⁇ .
  • the embodiment finds values of the coefficient of a matrix U such that U is unimodular and the coordinates x'i and x' 2 of the reindexed data are defined by:
  • the set of possible values of the coefficients of U, as well as the possible values of t1 , t2, toi and t O 2 are defined from the set of constraints D and the constraints that the data (x'i,x'2) are enclosed in a rectangular parallelotope of size (s-i, S 2 ) using Farkas lemma. Then, a value for those coefficients is computed for which the size of the smallest enclosing rectangular parallelotope (s-i, S 2 in our example) is minimal. Those values are computed by solving, dimension by dimension of the data set, an integer linear programming problem.
  • An integer linear programming problem defines a linear function of a set of variables, called the "objective function” and whose minimal (or, alternatively, maximal) value over a polyhedral domain called the "feasible set", is looked for. Solvers for such problems typically return a polyhedral domain, within the feasible set, for which the value of the objective function is minimal. In the running example, the embodiment finds:
  • the data footprint of the re-indexed array B is now reduced to 100 memory cells, instead of n 2 + 20 « + 100 initially.
  • the unimodular nature of the reindexing matrix U is obtained by forcing U to be triangular and forcing the absolute value of the diagonal elements to be one.
  • the unimodular nature of the reindexing matrix is obtained by composition of an upper triangular unimodular and a lower triangular unimodular matrix. The advantage of that other embodiment is that the class of unimodular reindexing functions produced is not limited to the reindexing functions represented by a triangular matrix.
  • Finding those two matrices is equivalent to reindexing data twice, first by finding an upper triangular reindexing matrix as described above and applying the reindexing, and then by finding a lower triangular reindexing matrix for the reindexed set and by applying that second reindexing.
  • Yet another embodiment produces, in the same way, a unimodular reindexing by composition of an upper triangular unimodular matrix, a permutation matrix and a lower triangular unimodular matrix.
  • the advantage of the embodiment is that the class of reindexing function that can be produced is the whole class of integer unimodular matrices.
  • FIG. 7 illustrates another embodiment of a provided method, like the previous embodiments, flow begins in block 10 where source code is received in memory. Similar to the above embodiment, the source code represents loops with arbitrary parametric affine iteration domain and contains at least one array reference. Flow continues to block 20 where inefficiencies in memory usage in the at least one array are identified. Flow then continues to block 30 where at least one local array is allocated, and in block 40 a portion of the array with inefficient memory usage is mapped into the local array. In this illustration, block 40 contains block 180 where a parallelotope of minimal volume is derived this parallelotope enclosing the domain of the data set accessed by the local arrays. Block 40 additionally contains block 190 where a finite prism of triangular base is derived.
  • a finite prism is a polyhedron defined by a set of translations of a "base" polyhedron, which lies in a subspace of the considered space, by a finite convex set of linear combinations of vectors of the complementary subspace. Since they are finite, it is possible to characterize the maximum extent of a finite prism along the directions of the complementary subspace. In this document, those extents are called “height" of the prism (there is one height along every direction of the complementary subspace).
  • a triangular prism is a prism whose base polyhedron is a triangle. In two dimensions, it is just a triangle. In one embodiment, this finite prism has a minimum volume that encloses the data footprint domain.
  • the prism is compared to the parallelotope.
  • the prism is partitioned into two prisms. One of the two is then transformed using a central symmetry such that the union of the transformed prism and the non-transformed prism has a smaller memory footprint than the enclosing parallelotope.
  • One advantage of that embodiment is that it provides data layouts that have smaller memory requirements, for a class of accessed datasets for which methods based on parallelotopes are not optimal. For instance, the dataset accessed by the program represented by the following pseudo-code through reference B is triangular:
  • the embodiment finds three constraints that enclose the accessed data set, in a similar way as in the embodiment depicted in FIG 6, using the Farkas lemma.
  • the minimal volume for a parallelotope that encloses the dataset would be about twice the volume of the triangle.
  • using such a parallelotope to determine the memory allocation of the dataset is bound to be sub-optimal.
  • the current embodiment, depicted in FIG. 7 defines a tighter enclosing polyhedron using three inequalities (it is then a prism of triangular base).
  • the data set is partitioned in two subsets, say A and B, and subset A is re-indexed in such a way that both the array elements in B and the re-indexed elements are enclosed in a smaller parallelotope than the original parallelotope.
  • the volume of the new parallelotope is about the volume of the prism of triangular base. Since there is a parallelotope of smaller volume enclosing the reindexed data set, its memory requirements are smaller. The result is a piecewise affine array reindexing, which typically partitions the loop iterations into the iterations that access A, and the ones that access B.
  • Let x w a point in the intersection of (b) and (c) and let w ax w l+a 0 .
  • the prism is partitioned into A and B as follows: .
  • a point, X 0 is defined that is in the domain
  • the prism is about half of the height of the prism.
  • Array elements that are defined by A are transformed using a central symmetry of center X 0
  • the tightest enclosing parallelotope defined by ⁇ 0 ⁇ x1 ⁇ 9; 0 ⁇ x2 ⁇ 9 ⁇ , where x1 represents the first dimension of array C and x2 its second dimension, includes 100 array elements.
  • the tightest enclosing triangle defined by ⁇ 0 ⁇ x1 ; 0 ⁇ x2; x1 +x2 ⁇ 9 ⁇ , by comparison, includes 55 elements, which is about half the number of elements required for the enclosing parallelotope. Since the number of array elements in the enclosing triangle is less than the number of array elements in the enclosing parallelotope, the embodiment considers the tightest enclosing triangle and partitions the enclosed data into data subsets A: ⁇ O ⁇ x-i; 5 ⁇ x 2 ; X 1 +X 2 ⁇ 9 ⁇ and B: ⁇ O ⁇ x-i; 0 ⁇ x 2 ⁇ 4; X 1
  • FIG. 8 illustrates a further embodiment of a provided method.
  • flow begins in block 10 where source code is received in memory. Similar to the above embodiment, the source code contains loops with arbitrary parametric affine iteration domain and contain at least one array reference.
  • Flow continues to block 20 where inefficiencies in memory usage in the at least one array are identified. Flow then continues to block 30 where at least one local array is allocated, and in block 40 a portion of the array with inefficient memory usage is mapped into the local array. Flow then continues to block 220 where asynchronous communications and wait operations are generated.
  • the exemplary embodiment uses the mapping between local memory elements and the elements of the original arrays, in conjunction with a description of the data that are needed as input and produced as output of the tasks to be executed, to produce an abstract representation of the transfers between the original arrays and the local memory elements.
  • the generation of these communications and wait operations includes the use of multi-buffering for overlapping communication and computation operations.
  • One embodiment computes a local memory mapping, adds a polyhedral representation of the communications and schedules communications and computations in a multi-buffering scheme for the program represented by the following pseudo-code.
  • every iteration of the k loop works on a distinct instance of local memory:
  • Computing apparatus 720 includes processor 660, memory 670, storage medium 680, and in some embodiments input port 690 and network interface 710.
  • storage medium 680 contains a set of processor executable instructions that when executed by processor 660 configure computing apparatus 720 to implement the modules and methods described herein.
  • storage medium 680, containing the set of processor executable instructions resides in another computing apparatus 720 across network 730.
  • computer software product 700 is a computer readable storage medium containing processor executable instructions sufficient that when executed by processor 660 configure computing apparatus 720 to implement the above described modules and methods.
  • computer software product in some embodiments consists of a physical medium configured to interface with input port 690 to allow its contents to be copied to storage medium 680.
  • computer software product 700 is an internal storage medium, such as 680.
  • An additional embodiment of computing apparatus 720 includes a plurality of processors 680(a-n), a plurality of memories 670(a-n), a storage medium 680 and in some embodiments input port 690 and network connection 710.
  • one or more processors 680(a-n) is a host, while others are modeled in the form of a grid.
  • Other embodiments of the present invention provide a custom computing apparatus, illustrated in FIG. 10, that is configured to optimize computer source code for operation on a second computing apparatus.
  • first custom computing apparatus 1010(a) is configured to communicate with second computing apparatus 1010(b) across network 1020.
  • a further illustration of computing apparatus 1010 is provided in FIG. 10.
  • custom computing apparatus 1010(a) contains at least one processor 1030 (a - n), a communication port 1040 communicating with the at least one processor 1030 (a - n).
  • Custom computing apparatus 1010(a) additionally includes memory 1050, which in some embodiments includes dependence analysis module 1220.
  • Custom computing apparatus 1010(a) in some embodiments, additionally includes drive 1070 configured to accept external storage medium 1080.
  • external storage medium 1080 is a CD, in others a DVD. In these embodiments, drive 1070 is configured to accept the appropriate external storage medium 1080.
  • Custom computing apparatus 1 (a) additionally includes storage medium 1060.
  • Storage medium 1060 in some embodiments is a hard-disk drive, and in others is a solid state drive.
  • storage medium 1060 contains a set of processor executable instructions that when executed by the at least one processor 30(a - n) configure custom computing apparatus 1010(a) to optimize computer code for execution on computing apparatus 1010(b). While custom computing apparatus 1010(a) and computing apparatus 1010(b) are illustrated in FIG. 10 communicating over network 1020, various embodiments of the invention do not require this inter-computer communication.
  • FIG. 11 illustrates exemplary multi-stage execution units 90.
  • the stages may include instruction fetch, instruction decode, operand address generation, operand fetch, instruction execute, and result store.
  • the stages include instruction fetch, instruction fetch & register decode, execute, memory access and register write-back.
  • one instruction in one stage of the pipeline may attempt to read from a memory location while another instruction is writing to that location. This is problem is confounded in the instance of multiple processing cores. Additionally, in multiple processor and/or multiple core architectures, the locality of data to the execution unit attempting access can create significant delays in processing.
  • FIG. 12 A further illustration of a multiple execution unit system is depicted in FIG. 12.
  • a first execution unit (Execution Unit 1 ) is attempting to write to a specific memory location while a second execution unit (Execution unit 2) is attempting to read from that same location.
  • execution unit 1 executes instructions from a specific memory location
  • execution unit 2 executes instructions from that same location.
  • both read and write occur at the same time, this causes a condition known in the art as a conflicting access which can significantly impact the speed and the correctness of execution.
  • parallel execution of instructions across multiple execution units and/or processors would produce an optimal result this is not always the case.
  • of source code for parallelism may result in code that is poor in terms of locality or communications. In the prior approaches to code optimization, the converse is additionally true.
  • optimization of code for locality can result in poor parallelism and under utilization of computing resources. It is therefore an object of embodiments of the present invention to provide a customized computing apparatus, methods, and computer software product that simultaneously optimizes a computer program for execution on a particular computing device with multiple execution units. It is another object of the invention to provide embodiments of methods which can explore the complete solution space for legal schedules for potential solutions. It is a further object of the invention to provide methods containing new formulations that encode the tradeoffs between locality and parallelism directly in the constraints and the objective functions of an optimization problem. It is a further object of the invention to automatically generate conditional synchronizations between execution units at different levels in the hierarchy of multiple execution units.
  • the following code example illustrates loop fusion. Given the following code:
  • loop fusion The effect of loop fusion is to interleave the execution of the first loop with the execution of the second loop.
  • loop fusion A consequence of loop fusion is that memory locations a[i] and b[i] referenced by the former 2 loops are now accessed in an interleaved fashion.
  • memory locations were accessed in the order a[0], a[1], ... a[100] then b[0], b[1], ... b[100].
  • the memory locations are now accessed in the order a[0], b[0], a[1], b[1], ... a[100], b[100].
  • Loop fusion can lead to better locality when multiple loops access the same memory locations.
  • Loop fusion can change the order in which memory locations of a program are accessed and require special care to preserve original program semantics:
  • a loop in a program can be executed in parallel if there are no dependences between its iterations.
  • the first program loop below can be executed in parallel, while the second loop must be executed in sequential order:
  • Loop permutability is another important property of program optimizations.
  • a set of nested loop is said permutable, if their order in the loop nest can be interchanged without altering the semantics of the program.
  • loop permutability also means the loops in the permutable set of loops dismiss the same set of dependences. It is also common knowledge that such dependences are forward only when the loops are permutable. This means the multi-dimensional vector of the dependence distances has only non-negative components.
  • the components of these vectors are nonnegative for all possible values of i and j. Therefore the loops I and j are permutable and the loop interchange transformation preserves the semantics of the program. If loop interchange is applied, the resulting program is:
  • Loop permutability is important because it allows loop tiling (alternatively named loop blocking).
  • Loop tiling is a transformation that changes the order of the iterations in the program and ensures all the iterations of a tile are executed before any iteration of the next tile.
  • Loop tiling is traditionally performed with respect to tiling hyperplanes.
  • the tiling hyperplanes used are the trivial (i) and G) hyperplanes.
  • any linearly independent combination of hyperplanes may be used for tiling, provided it does not violate program semantics. For example, (i+j) and (i+2 * j) could as well be used and the resulting program would be much more complex.
  • the doall and reduction indicate potential parallelism rather than actual usage of parallelism.
  • doall indicates that a loop may be executed in a data parallel manner
  • reduction indicates that a loop is a reduction, i.e., its order of execution may be permuted.
  • loop skewing Another important loop transformation is loop skewing. It is common knowledge that loop permutability combined with loop skewing results in the production of parallelism. In the following permutable loops, the inner loop can be executed in parallel after loop skewing:
  • the skewing transformation helps extract parallelism at the inner level when the loops are permutable. It is also common knowledge that loop tiling and loop skewing can be combined to form parallel tiles that increase the amount of parallelism and decrease the frequency of synchronizations and communications in the program.
  • the memory accesses to the array a are performed in this order a[0][0], a[1][0], ... a[99][0], a[0][1], ... a[99][1], ... a[0][99], ... a[99][99]. If the storage mode of arrays in memory is row major, then successive accesses are not contiguous in memory and may result in very long latencies of accesses.
  • Transformations such as loop permutations and loop skewing result in modified memory access order that can make accesses contiguous. For instance, interchanging loop i and k, assuming row major storage mode:
  • the memory accesses are now a[0][0], a[0][1], ... a[0][99], a[1][0], ... a[1][99], ... a[99][0], ... a[99][99] and the latencies of accesses are generally reduced.
  • All accesses to a, b and c cannot be made contiguous because the global problem is unfeasible. If k is chosen as the innermost loop, accesses to a are contiguous but not accesses to b and c. If j is chosen as the innermost loop, accesses to b and c are contiguous but not accesses to a.
  • the explicit copy from a into a_l may also change the data layout in a_l. It is a further object of this invention to modify the layout such that accesses to a_l are contiguous too during the computation. A permutability of the jj and kk dimensions in the data layout makes the accesses to a_l contiguous in the computation part of the program:
  • the execution model for the target hierarchy of execution units and memories requires explicit copying of the data in a secondary memory. It is a further object of the invention to comply with this requirement by creating explicit copies of the data to the secondary memory.
  • the first step is to assign to each statement in the program an iteration space and an iteration vector.
  • the second step is to identify when two operations may be executed in parallel or when a producer consumer relationship prevents parallelism. This is done by identifying the set of dependences in the program.
  • Affine fusion means not just merging two adjacent loop bodies together into the same loop nests, but also include loop shifting, loop scaling, loop reversal, loop interchange and loop skewing transformations. This corresponds to general affine scheduling functions that orchestrate the order of operations in the optimized program. In the a I ⁇ I ⁇ convention this means that we would like to have the ability to modify the linear part of the schedule, a , instead of just p and ⁇ .
  • One embodiment of the invention produces a scheduling function used to assign a partial execution order between the iterations of the operations of the optimized program and to produce the resulting optimized code respecting this partial order.
  • affine fusion gives a superior transformation, as shown above.
  • the fusion-preventing dependencies between the loop nests are broken with a loop reversal rather than loop shifting, and as a result, no prologue and epilogue code is required.
  • the two resulting loop nests are permutable.
  • the tension between fusion and scheduling implies that fusion and scheduling should be solved in a unified manner.
  • a cost ⁇ p which measures the slowdown in execution if the loop is executed sequentially rather than in parallel.
  • p, q we estimate upq the cost in performance if the two loops p and q remains unfused.
  • the cost ⁇ p can be interpreted to be the difference between sequential and parallel execution times, and the cost upq can be interpreted as the savings due to cache or communication based locality.
  • the cost ⁇ p is related to a difference in execution speed between sequential operations of the at least one loop on a single execution unit in the second computing apparatus and parallel operations of the at least one loop on more than one of the at least two execution units in the second computing apparatus.
  • the cost upq is related to a difference in execution speed between operations where the pair of loops are executed together on the second computing apparatus, and where the pair of loops are not executed together on the second computing apparatus.
  • the Boolean variable A p denote whether the loop p is executed in sequence
  • the variable fpq denote whether the two loops p and q remain unfused, i.e.
  • fpq 0 means that edge loops p and q have been fused.
  • the variable A p specifies if the loop is executed in parallel in the optimized program.
  • the variable f pq specifies if the pair of loops are executed together in the optimized program.
  • the value of the cost w p is determined by a static evaluation of a model of the execution cost of the instructions in the loop. In another embodiment, the value of the cost w p is determined through the cost of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop. In a further embodiment, the value of the cost Wp is determined by an iterative process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop.
  • the value of the cost u pq is determined by a static evaluation of a model of the execution cost of the instructions in the loop pair. In another embodiment, the value of the cost u pq is determined through the cost of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop pair. In a further embodiment, the value of the cost u pq is determined by an iterative process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop pair.
  • GDG G (V, E) into strongly connected components (SCCs) and consider each SCC to be a separate fusible "loop" candidate.
  • G' (V, E') denote the SCC induced subgraph where V denotes the SCCs and E' the edges between SCCs.
  • sec(v) denote the component in which v belongs to in the SCC decomposition.
  • Some embodiments additionally specify that a schedule dimension at a given depth must be linearly independent from all schedule dimensions already computed. Such an embodiment computes the linear algebraic kernel of the schedule dimensions found so far.
  • A denotes the linear part of ⁇ S , the set of schedule dimensions already found and * denotes a subspace linearly independent of h .
  • a further embodiment derives a set of linear independence constraints that represent Jh ⁇ Oand does not restrict the search to Jh > 0. Such linear independence constraints may be used to ensure successive schedule dimensions are linearly independent.
  • that does not restrict the search to Jh > 0 exhibits an optimization process that can reach any legal multidimensional affine scheduling of the received program including combinations of loop reversal.
  • the set of conditions preserving semantics is the union of all the constraints of the form
  • n another embodimentj the optimizing search space that encompasses all opportunities in parallelism and locality is the conjunction of all the constraints (5)-(17).
  • the set of affine constraints (12) and (13) is linearized using the affine form of Farkas lemma and is based on at least one strongly connected component of the generalized dependence graph.
  • constraints of the form (12) are used to enforce dimensions of schedules of loops belonging to the same strongly connected component are permutable.
  • the constraints of the form (13) are used to ensure that dimensions of schedules of loops that are not executed together in the optimized program do not influence each other.
  • the constraints of the form (13) use a large enough constant to ensure that dimensions of schedules of loops that are not executed together in the optimized program do not influence each other.
  • the linear weighted sum can be optimized directly with the use of an integer linear programming mathematical solver such as Cplex.
  • a non-linear optimization function such as a convex function may be optimized with the use of a convex solver such as CSDP. Further embodiments may devise non-continuous optimization functions that may be optimized with a parallel satisfiability solver.
  • Affine fusion formulation is a depth by depth optimization embodiment.
  • a further embodiment described in FIGS. 19(a), 19(b) and 20 shows a method to derive scheduling functions for a given hardware parallelism and memory hierarchy.
  • a further embodiment described in FIG. 22 shows a method to derive scheduling functions for multiple levels of hardware parallelism and memory hierarchies, more specifically by formulating and optimizing at least one global weighted parametric function for each level of the parallelism and memory hierarchy of the second computing apparatus.
  • the single multidimensional fusion formulation relies on a single multi-dimensional convex affine space. More specifically, an embodiment of such a single multi-dimensional convex affine space assigns variables and relations for loops, loops pairs and dependence edges e at each scheduling dimension k.
  • ⁇ k ( y ) the maximal dependence distance for the loop in which statement a resides, in dimension k.
  • L is a loop (SCC) in dimension k
  • ⁇ t the strongly connected component index (loop number) in which statement a appears.
  • - schedule of statement a in dimension k equal to 1 if the schedule at dimension k strictly satisfy e, i.e., a Boolean variable, 0 only if a Boolean variable, 0 only if the schedules in dimensions k - 1 and k are permutable in the loop in which a resides.
  • a and b belongs to the same loop in dimension k, then
  • the next constraints encode the ⁇ component of the schedules.
  • next set of constraints ensure that all p a k are identical for all nodes a which belong in the same loop nest.
  • the strong satisfaction variable E_ ⁇ k,e ⁇ assigned to each schedule dimension k and each edge e of the at least one strongly connected component is which is equal to 1 when the schedule difference at dimension k strictly satisfie dge e (i.e. when 0 otherwise.
  • the loop permutability Boolean variable p_ ⁇ k,e ⁇ assigned to each schedule dimension and each edge e of the at least one strongly connected component is
  • the statement permutability Boolean variable p_ ⁇ k,a ⁇ assigned to each schedule dimension and each statement a of the at least one strongly connected component is p ⁇ ⁇
  • constraints of the form (27), (28) and (29) are added to ensure dimensions of schedules of statements linked by a dependence edge in the generalized dependence graph do not influence each other at depth k if the dependence has been strongly satisfied up to depth k-1.
  • constraints of the form (30) and (31 ) are added to link the strong satisfiability variables to the corresponding loop permutability Boolean variables.
  • constraints of the form (34) to (43) are added to ensure statement permutability Boolean variables are equal for all the statements in the same loop nest in the optimized program.
  • the conjunction of the previous constraints forms a single multi-dimensional convex affine search space of all legal multi-dimensional schedules that can be traversed exhaustively or using a speeding heuristic to search for schedules to optimize any global cost function.
  • One example of an embodiment tailored for successive parallelism and locality optimizations is provided for an architecture with coarse grained parallel processors, each of them featuring fine grained parallel execution units such as SIMD vectors.
  • One such architecture is the Intel Pentium E 5300.
  • the following example illustrates how an embodiment of the invention computes schedules used to devise multi-level tiling hyperplanes and how a further embodiment of the invention may compute different schedules for different levels of the parallelism and memory hierarchy of the second computing apparatus.
  • the array elements A[i][j][k] are computed by a weighted sum of the 7 elements, B[i][j][k], B[i- I][J][k] , B[i+1][j][k], B[i]0-1][k] , B[i][j+1][k], B[i]0][k-1] and B[i][j][k+1].
  • the array elements B[i][j][k] are computed by a weighted sum of 7 elements of A. The computation is iterated Titer times.
  • embodiments may produce the following optimized code in which permutable loops are marked as such.
  • loops have been fused at the innermost level on loop I and the locality is optimized.
  • Loop tiling by tiling factors (16, 8, 8, 1 ) may be applied to further improve locality and the program would have the following form, where the inner loops m, n, o are permutable.
  • the loops are fused on all loops i,j,k,l,m,n and o.
  • the program does not take advantage of fine grained parallelism on each processor along the loops m, n and o.
  • Embodiments allow the optimization of another selective tradeoff to express maximal innermost parallelism at the expense of fusion.
  • the selective tradeoff gives a much more important cost to parallelism than locality and embodiments may finds a different schedule for the intra-tile loops that result in a program that may display the following pattern:.
  • the innermost doall dimensions may further be exploited to produce vector like instructions while the outermost permutable loops may be skewed to produce multiple dimensions of coarse grained parallelism.
  • the schedules that produce the innermost doall dimensions may be further used to produce another level of multi-level tiling hyperplanes.
  • the resulting code may have the following structure:
  • affine fusion i.e., fusion combined with other affine transformations gives a superior transformation, as shown below.
  • the fusion-preventing dependencies between the loop nests are broken with a loop reversal rather than loop shifting, and as a result, no prologue or epilogue code is required.
  • the two resulting loop nests are permutable. In some embodiments, tiling and extraction of one degree of parallelism out of the resulting loop nests is performed.
  • loop fusion is limited to not be too greedy, i.e., aggressive fusion that destroys parallelism should be avoided.
  • fusion that can substantially improve locality may sometimes be preferred over an extra degree of parallelism, if we already have; obtained sufficient degrees of parallelism to exploit the hardware resources. For example, given the following code:
  • FIG. 13 where the flow of provided method 1100 of source code optimization is illustrated.
  • Flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • Flow continues to block 1120 where a selective tradeoff of parallelism and locality is created for execution of the code on the second computing apparatus 1010(b).
  • Flow then continues to block 1130 where a scheduling function is produced which optimizes the selective tradeoff.
  • the received program code contains at least one arbitrary loop nest.
  • the custom first computing apparatus 1010(a) contains memory 1050, a storage medium 1060 and at least one processor with a multi-stage execution unit.
  • FIG. 14 A provided method 1150 for source code optimization is illustrated in FIG. 14.
  • flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • Flow continues to block 1120 where the code is optimized in terms of both locality and parallelism for execution on a second computing apparatus 1010(b).
  • the optimization block 1120 additionally includes additional functional blocks.
  • flow begins with block 1160 where an unassigned loop is identified. Flow then continues on two paths.
  • a first cost function is assigned in block 1180. This first cost function is related to a difference in execution speed between parallel and sequential operations of the statements within the loop on second computing apparatus 1010(b).
  • a decision variable is assigned to the loop under consideration, this decision variable indicating whether the loop is to be executed in parallel in the optimized program.
  • the cost is determined through static evaluation of a model of the execution cost of the instructions in the loop under consideration.
  • the cost is determined through a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop under consideration.
  • the cost is determined by an iterative refining process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop under consideration.
  • decision block 1220 it is determined if there are additional unassigned loops.
  • executed together means fused in the sense of the code examples (0032)-(0037). Specifically executed together means that loops that are consecutive in the original program become interleaved in the optimized program. In particular, loops that are not “executed together” in the sense of loop fusion can be executed together on the same processor in the more general sense.
  • flow continues from block 160 to block 1170 where an unassigned loop pair is identified. Flow then continues to block 1175 where a second cost function is assigned for locality optimization.
  • This second cost function is related to a difference in execution speed between operations where the loops in the pair of loops are executed together on the second computing apparatus, and where the loops in the pair of loops are not executed together on the second computing apparatus.
  • the second cost is determined through static evaluation of a model of the execution cost of the instructions in the at least one loop pair.
  • the second cost is determined through of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair.
  • the cost is determined through an iterative refining process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair. Flow then continues to decision block 1200 where it is determined if additional unassigned loop pairs exist. If additional unassigned loop pairs exist, flow continues back to block 1170 and the process iterates until no additional unassigned loop pairs are found. When decision block 1200 determines no additional loop pairs are present, flow continues to decision block 1220.
  • decision block 1220 If in decision block 1220 it is determined that additional unassigned loops exist, flow continues back to block 1160 and the process iterates until no additional unassigned loops may be identified. Flow then continues to block 1230 where a selective tradeoff is created for locality and parallelism during the execution on second computing apparatus 1010(b). Flow then continues to block 1130 where a scheduling function is produced that optimizes the selective tradeoff. Flow then continues to block 1140 where optimized code is produced.
  • FIG. 15 The flow of a further provided embodiment of a method 1240 for source code optimization is illustrated in FIG. 15.
  • flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 10(a).
  • Flow continues to block 1120 where the code is optimized in terms of both locality and parallelism for execution on a second computing apparatus
  • the scheduling function block 1130 additionally includes additional functional blocks.
  • the search space characterizes all parallelism and locality opportunities that meet the conditions of semantic correctness.
  • FIG. 16 The flow of a further provided method is illustrated in FIG. 16.
  • This embodiment illustrates alternate embodiments of the flow within blocks 1130 and 1270 in previous embodiments.
  • flow begins in block 1250 where the conditions for semantic correctness of the program are determined.
  • Flow then continues to block 1260 where a search space is derived that meet the conditions for semantic correctness.
  • the search space characterizes all parallelism and locality opportunities that meet the conditions of semantic correctness.
  • block 1270 includes additional functionality.
  • Block 1270 as illustrated contains three independent optimization paths that may be present in any given embodiment.
  • flow begins at block 1300(a) where an element is selected from the search space.
  • flow continues from block 1260 to block 1300(b) where an element is selected from the search space.
  • Flow continues to block 1310(b) where a potential scheduling function is derived for the element.
  • Flow then continues to block 1320(b) where the performance of the potential scheduling function is evaluated.
  • Flow then continues to block 1340 where the search space is refined using the performance of evaluated schedules.
  • decision block 1330(b) it is determined if additional elements exist in the search space. If additional elements are present flow continues back to block 1330 and the process iterated until no other elements exist in the search space. When no additional elements exist, in the search space, flow then continues to block 1370 where the element with the best evaluated performance is selected.
  • flow continues from block 1260 to block 1350 where the tradeoff is directly optimized in the search space with a mathematical problem solver. Flow then continues to block 1360 where an element is selected that is a result of the direct optimization. Flow then continues to block 1320(c) there the performance of the selected element is evaluated. Flow then continues to block 1370 where the element with the best evaluated performance is selected. As illustrated some embodiments may utilize more than one of these paths in arriving at an optimal solution. From selection block 1370 flow then continues to block 1280 where the scheduling function is derived from the optimized tradeoff. Flow then continues to block 1140 where optimized code is produced.
  • FIG. 17 The flow of a further provided embodiment of a method 1380 for optimization of source code on a first custom computing apparatus 1010(a) for execution on a second computing apparatus 1010(b) is illustrated in FIG. 17.
  • flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • Flow continues to block 1400 where the source code is optimized in terms of both locality and parallelism for execution on a second computing apparatus 1010(b).
  • block 1400 contains additional functional blocks.
  • Flow continues from block 1110 to block 1250 where the conditions for semantic correctness are determined from the received code.
  • Flow then continues to block 1390 where these conditions are represented as a generalized dependence graph. Flow then continues to two paths.
  • flow continues to block 1260 where a search space is derived that meet the conditions for semantic correctness.
  • the search space characterizes all parallelism and locality opportunities that meet the conditions of semantic correctness.
  • Flow then continues to block 1410 where a weighted parametric tradeoff is derived and optimized on the elements of the search space.
  • flow begins with block 1160 where an unassigned loop is identified. Flow then continues on two additional paths.
  • flow continues to block 1180 where a first cost function is assigned in block 1180. This first cost function is related to a difference in execution speed between parallel and sequential operations of the statements within the unidentified loop on second computing apparatus 1010(b).
  • a decision variable is assigned to the loop under consideration, this decision variable indicating whether the loop is to be executed in parallel in the optimized program.
  • the cost is determined through static evaluation of a model of the execution cost of the instructions in the loop under consideration.
  • the cost is determined through a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop under consideration.
  • the cost is determined by an iterative refining process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop under consideration.
  • decision block 1220 it is determined if there are additional unassigned loops.
  • an unassigned loop is identified.
  • flow continues to block 1170 where an unassigned loop pair is identified.
  • a second cost function is assigned for locality optimization. This second cost function is related to a difference in execution speed between operations where the loops of the pair of loops are executed together on the second computing apparatus, and where the loops of the pair of loops are not executed together on the second computing apparatus.
  • a decision variable is assigned for locality. This second decision variable specifying if the loops of the loop pair under consideration is to be executed together in the optimized program.
  • the second cost is determined through static evaluation of a model of the execution cost of the instructions in the at least one loop pair.
  • the second cost is determined through of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair.
  • the cost is determined through an iterative refining process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair. Flow then continues to decision block 1200 where it is determined if additional unassigned loop pairs exist. If additional unassigned loop pairs exist, flow continues back to block 1170 and the process iterates until no additional unassigned loop pairs are found. When decision block 1200 determines no additional loop pairs are present, flow continues to decision block 1220.
  • decision block 1220 If in decision block 1220 it is determined that additional unassigned loops exist, flow continues back to block 1160 and the process iterates until no additional unassigned loops may be identified. Flow then continues to block 1230 where a selective trade-off is created for locality and parallelism during the execution on second computing apparatus 1010(b).
  • FIG. 18 The operational flow of a further provided method 1430 for source code optimization is illustrated in FIG. 18.
  • flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • Flow continues to block 1480 where the a level of parallelism and memory hierarchy in the second computing apparatus are selected.
  • Flow then continues to block 1490 where a selective tradeoff for parallelism and locality for execution of that level of hierarchy is created.
  • Flow then continues to block 1450 where tiling hyper- planes are produced based on the scheduling function.
  • flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • Flow continues to block 1250 where the conditions for semantic correctness are determined for the program.
  • Flow then continues to block 1390 where these conditions are represented as a generalized dependence graph.
  • schedule dimensions may have been found through the methods disclosed in other embodiments.
  • Flow continues to block 1520 where the generalized dependence graph is decomposed into at least one strongly connected component.
  • Flow then continues to block 1530 where a strongly connected component is selected. Flow then continues to a number of independent paths.
  • a set of loop independence constraints are derived and used to ensure that dimensions of schedules of loops that are not executed together do not influence each other. In one embodiment, this set of constraints includes a large enough constraint to cancel an effect of constraints on statements that are not executed together in the optimized program.
  • FIG. 20 The operational flow of a further provided method 1610 for source code optimization is illustrated in FIG. 20. As with other embodiments, this embodiment may be used in conjunction with other provided methods.
  • flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • Flow continues to block 1780 which contains additional functionality.
  • Flow continues to block 1250 where the conditions for semantic correctness are determined for the program.
  • Flow then continues to block 1390 where these conditions are represented as a generalized dependence graph.
  • decision block 1620 it is determined if there are additional dimensions to schedule. If there are no additional dimensions, flow continues to block 1760 where a scheduling function is derived and to block 1140 where an optimized program is produced for second computing apparatus 1010(b).
  • flow continues to block 1630 where the generalized dependence graph is decomposed into at least one strongly connected component.
  • Flow continues to block 1640 where a strongly connected component is selected.
  • Flow then continues to block 1650 where affine constraints are derived using the affine form of Farkas lemma, linear independence constraints permutability constraints, and independence constraints are derived as previously discussed.
  • Flow then continues to block 1660 where these constraints are added to the search space.
  • decision block 1670 it is determined if additional strongly connected components exits. If others exist, flow continues back to 1640 and the process iterates until there are no remaining strongly connected components.
  • decision block 1670 When decision block 1670 indicates that there are no remaining strongly connected components, flow continues to block 1730 where a weighted parametric tradeoff function is optimized on the search space. Flow then continues to decision block 1690 where it is determined if new independent permutable schedule dimensions exist. If they exist flow continues to block 1700 where an existing scheduling dimension is selected. Flow continues to block 1720 where additional constraints are added to the search space for independence and linear independence. From block 1720 flow continues to block 1730 where a weighted parametric tradeoff function is optimized on the search space. Flow then continues back to decision block 1690 and this part of the process iterates until no new independent permutable schedule dimensions are found.
  • FIGs. 21 (a) and 21 (b) The flow of a further provided embodiment of a method 1760 for optimization of source code on a first custom computing apparatus 1010(a) for execution on a second computing apparatus 1010(b) is illustrated in FIGs. 21 (a) and 21 (b).
  • flow begins in block 1110 where source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • block 1120 On a first path flow continues to block 1120 where a selective tradeoff of parallelism and locality for execution of the program on second computing apparatus 1010(b) is created.
  • Flow continues to block 1250 where the conditions for semantic correctness are determined.
  • block 1770 a single multi-dimensional convex space of all legal schedules is derived. Additional information on block 1770 is provided in FIG. 21 (b).
  • flow then continues on alternate three paths.
  • flow continues to block 1790(a) where a element from the search space is selected.
  • Flow then continues to block 1800(a) where a scheduling function is derived for the selected element.
  • Flow then continues to block 1810(a) where the scheduling function is evaluated for its performance on the optimized program.
  • decision block 1820(a) If it is determined that there are additional elements in the search space, flow continues back to block 1790(a) where another element is selected. The process iterates until no additional elements remain in the search space.
  • FIG. 21 (b) An exemplary embodiment of block 1770 is illustrated in FIG. 21 (b).
  • flow from block 1250 continues to block 1390 where the conditions for semantic correctness are represented as a generalized dependence graph.
  • Flow continues on two parallel paths. On the first path an edge E is selected from the dependence graph in block 1900.
  • Flow then continues to block 1910 where a strong satisfaction variable is assigned to edge E at dimension K.
  • Block 1910 receives the current dimension K from block 2010.
  • Block 1390 flow continues from block 1390 to block 1970 where a node N is selected.
  • Flow continues to block 1980 where a statement permutability variable is assigned to node N at dimension K.
  • Block 1980 receives dimension K from block 2010.
  • decision block 1990 If there are remaining nodes in the dependence graph flow continues back to block 1970 where another node N is selected. The process iterates until no additional nodes exist in the graph.
  • Block 1950 receives input from blocks 1920 and 1980 and assigns constraints to link edge permutability variable and statement permutability variable at dimension K.
  • FIG. 22 The flow of another provided method 2070 for program code optimization is illustrated in FIG. 22.
  • flow begins in block 1110 where program source code is received in memory 1050 on a custom first computing apparatus 1010(a).
  • Flow continues to block 2080 where a level of parallelism and memory hierarchy is selected from the second computing apparatus 1010(b).
  • Flow then continues to block 1780 which is illustrated in FIG. 20 and discussed in detail above.
  • Flow then continues to decision block 2020. If the performance of the scheduling function is not satisfactory for the current level, flow continues to block 2030 where a partial evaluation of the code produced for the current level of parallelism and memory hierarchy is performed and used to iteratively refine the schedule.
  • Flow continues back to block 1780 a iterates until the performance of the schedule is satisfactory for the level.
  • a memory reference to an array A in the program is denoted by F .
  • F a memory reference to an array A in the program.
  • F is an (r)x(d)-mathx with r rows and d columns.
  • any memory reference accessing array A is optimized for contiguity along dimension k of the array.
  • Fs the (r-1 )xd submatrix composed of the (r-1 ) rows of F excluding row
  • FIG. 23 A further embodiment of a provided method 3000 is illustrated in FIG. 23.
  • flow begins at block 3010 where a program source code is received in a memory on a first computing apparatus.
  • a scheduling function is products that optimizes the tradeoff.
  • Flow continues to block 3040 where the scheduling function is used to assign a partial order to the statements of the source code and an optimized program is produced.
  • flow may continue to block 3050 where conditional synchronizations between a plurality of execution units are inserted into the code.
  • enumerate for each memory array the possible set of values 0 ⁇ k ⁇ r and optimize along the best contiguity dimension.
  • integer programming constraints building will appreciate that such an enumeration can be achieved by standard techniques involving the introduction of disjunctive constraints.
  • a further embodiment applies multiple optimizations using successive parallelism, locality and contiguity optimizations.
  • each tradeoff in the plurality of tradeoffs is optimized using different costs corresponding to the costs of parallelism, locality and contiguity for the particular level of the hardware hierarchy.
  • Example is provided for architectures with a hierarchy of coarse grained parallel execution units, each of them featuring fine grained parallel execution units.
  • One such architecture is the Intel Pentium E 5300
  • another example of such architecture is the NVIDIA Geforce GTX 280 Graphics Processing Unit (GPU)
  • a further example of such architecture is any custom configuration of traditional x 86 processors with GPUs attached to them.
  • the following example illustrates how an embodiment of the invention computes schedules used to devise multi-level tiling hyperplanes and how a further embodiment of the invention computes different schedules for different levels of the parallelism and memory hierarchy of the second computing apparatus.
  • FIG. 24 The flow of additional source code optimization embodiments are illustrated in FIG. 24.
  • flow begins at block 3010 where source code is received in memory on a first computing apparatus.
  • block 3020 where a selective tradeoff between parallelism, locality and contiguity of memory references is created for execution on a second computing apparatus.
  • block 3020 is expanded to illustrate various embodiments.
  • flow continues to block 3270 where an unassigned loop is identified in the code.
  • Flow then continues along three paths.
  • flow continues to block 3320 where a first cost function is assigned.
  • the first cost function is related to the cost of parallel execution on the second computing apparatus.
  • block 3330 where a decision variable for parallelism is assigned, then to conditional block 3340.
  • a memory reference is identified in block 3280 and flow continues to block 3290 where a third cost is assigned.
  • the third cost is related to the cost of contiguous storage and access of the memory reference.
  • Flow then continues to block 3300 where a decision variable is assigned to this cost then on to decision block 3310 where it is determined if additional memory references are present within the loop under consideration. If additional memory references exist, flow continues back to block 3280 and the process iterates until decision block 3310 determines that no additional memory references are present. Flow then continues to decision block 3340.
  • Flow begins at block 3360 where an unassigned loop pair is identified.
  • Flow then continues to block 3370 where a second cost function is assigned. In some embodiments, this second cost function is related to locality.
  • Flow then continues to block 3380 where a locality decision variable is assigned to the loop pair and on to decision block 3390 where a determination is made on the existence of additional unassigned loop pairs. If additional unassigned loop pairs are present, flow continues back to block 3360 and the process iterates until no additional loop pairs are found.
  • Flow then continues to block 3340 where it is determined if additional unassigned loops are present in the code. If so, flow continues back to block 3270 and the process iterates until no unassigned loops are present.
  • Flow then continues to block 3350 where the selective tradeoffs are formulated for the execution of the code on the second computing apparatus.
  • Flow then continues to block 3030 where a scheduling function is produced which optimizes the selective tradeoff and to block 3060 where an optimized program is produced
  • FIGs. 26a and 26b Various provided embodiments of method 3260 are illustrated in FIGs. 26a and 26b.
  • flow begins in block 3210 where program source code is received in a memory on a first computing apparatus.
  • the hierarchical architecture of a second computing apparatus is provided to the first computing apparatus, or in some embodiments derived by the first computing apparatus.
  • Flow continues to block 3410 where the next level of computing hierarchy of the second computing apparatus is identified.
  • block 3460 scheduling coefficients and constraints are fixed for the prior level of hierarchy under consideration.
  • Flow then continues to decision block 3500 where it is determined if the architecture level under consideration contains execution units.
  • the array elements A[i][j][k] are computed by a weighted sum of the 7 elements, B[i][j][k], B[i-1][j][k] , B[i+1][j][k], B[i][j-1][k] , B[i]D + 1 ][k]. B[i][j][k-1] and B[i]0][k+1].
  • the array elements B[i][j][k] are computed by a weighted sum of 7 elements of A. The computation is iterated Titer times.
  • embodiments may produce the following optimized code in which permutable loops are marked as such.
  • the loops have been fused at the innermost level on loop I and the coarse-grained parallelism and locality are optimized by a first scheduling function that orchestrates the operations in the program as coarse-grained tasks of parallel execution.
  • the schedule represented by the following components correspond to one possible outcome of provided embodiments that generate the code above, after the first optimization: for statement for statements S1 : Loop tiling by tiling factors (16, 8, 8, 1 ) may be applied to further improve locality and the program would have the following form, where the inner loops m, n, o are permutable.
  • the loops are fused on all loops i,j,k,l,m,n and o.
  • the program does not take advantage of fine grained parallelism and contiguity on each fine grained parallel execution unit along the loops m, n and o.
  • the innermost doall dimensions traverse memory with contiguous accesses. It is a further object to allow this successive optimizations.
  • a first step we may give a very low priority to contiguity constraints.
  • we may achieve this goal by setting all the costs of contiguity references to zero.
  • tiling may be performed to create coarse grained tasks of execution. Performing tiling to create tasks of execution is a well-known process to an engineer skilled in the art.
  • schedules that produce the innermost doall dimensions may be further used to produce another level of multi-level tiling hyperplanes.
  • the engineer skilled in the field will appreciate multi-level tiling will be useful to perform transformations such as register tiling.
  • the resulting code may have the following structure:
  • Machine models can be composed hierarchically and the smallest machine model granularity for which embodiments optimize a program is denoted as a morph.
  • the machine model for a single GPU has one single morph, and only represents the GPU level of the machine model.
  • the GPU- specific attributes for a GeForce 9800GX2 are:
  • the GeForce 9800GX2 has 32K of shared memory. However, allocating the full 32K for shared memory arrays will leave not enough room for other GPU resources. Other embodiments may restrict the shared memory size to only 16K to affect the tiling and the local memory compaction phases of the mapper.
  • FIGs. 25a and 25b Another provided embodiment of an optimization method is illustrated in FIGs. 25a and 25b.
  • flow begins at block 3010 where source code is received in a memory on a first computing apparatus.
  • Flow then continue to block 3020 where a first selective tradeoff of parallelism, locality, and contiguity of memory references for execution of the code on a second computing apparatus is created.
  • Flow then continues to block 3080 where a first scheduling function is produced that optimizes the first selective tradeoff.
  • Flow then continues to block 3100 where tiling is performed.
  • Flow then continues to block 3110 where scheduling coefficients and constraints are fixed for the previous levels in the hierarchy.
  • a second scheduling function is produced that optimizes the second tradeoff and flow continues to block 3140 where the second scheduling function is used to assign a partial order of statements in the code and produce a fine grained parallel optimized program.
  • explicit copies of memory accessed are inserted between primary and secondary memory memories for non-contiguous access.
  • a third scheduling function is used to assign a partial order to the statements of the source code and to produce a fine grained parallel optimized subprogram and copies.
  • Flow then continues to block 3210 costs are evaluated and a number of virtual execution threads along parallel dimensions is selected.
  • the referenced memory is promoted to private memory, conditional synchronizations between the virtual execution threads are inserted in block 3240 an unroll and jam transformation is then performed in 3250.
  • Massively multi-threaded hardware is defined as hardware that exhibits more virtual threads of execution than physical threads of execution.
  • the hardware (or sometimes the runtime layer) manages the physical scheduling of virtual threads to the actual physical processing units during execution.
  • GPGPUs Modern General Purpose Graphics Processing Units (GPGPUs) are massively parallel multiprocessors capable of high floating point operations performance and large memory bandwidth.
  • GPGPU programming is eased by the introduction of higher level data parallel languages such as CUDA
  • maximizing the performance of an application still requires the precise balancing of many different types of constraints in the GPU architecture, including the utilization of SIMD and coarse-grained data parallelism, and the management of the memory hierarchy.
  • the typical architecture of modern GPUs consists of array SIMD multiprocessors that share a global address space, subdivided into multiple types of memories.
  • the execution model of such architecture is the following: each GPU multiprocessor executes a set of parallel threads (typically 32) as a single "warp" in a SIMD manner. Multithreading between large a number of cores is used to hide the latency of memory transfers and thread synchronization.
  • a CUDA kernel generated by embodiments and executed on the GPU may exhibit the following standard (pseudo code) form: a set of perfectly nested parallel doall-loops divided in a set of outer synchronization-free threads blocks (blocks for short), then further subdivided into a set of inner parallel threads.
  • the tripcounts for the block (B1 , B2, B3) and thread dimensions (T 1 , T2, T3) are limited by architectural resources. Threads belonging to the same block can synchronize with each other and access their own subset of shared memory. Each thread also contains its own private local memory and registers, which can be used to hold thread-private data.
  • the standard target is one core of the GeForce 9800GX21. While we disclose embodiments on a very simple kernel, mapping it onto a massively multi-threaded architecture is already a non-trivial task, and the mapping process will demonstrate many of the benefits of various embodiments.
  • One embodiment solves two main problems in CUDA mapping: (i) assigning the available set of parallelism to blocks and threads, and (ii) managing the memory hierarchies.
  • the problems are currently handled by various embodiments using the following sub-phases:
  • tiling aka blocking
  • the kernel formation task can be further decomposed into a set of sub-tasks:
  • Block and thread placement - determine the loop dimensions that can be mapped onto the block and thread grids of the GPU.
  • the strategy consists of transforming coarse-grained parallelism into the thread block parallelism in a CUDA kernel, and fine-grained parallelism into SIMD parallelism within a thread block.
  • the available amount of parallelism is easy to expose, and the resulting loop nests are as follows:
  • tiling is applied as the next step. Tiling is of common knowledge for the engineer knowledgeable in the art.
  • the tiling algorithm chooses a tile size that satisfies the following criteria:
  • the footprint of the tile does not exceed the size of the shared memory.
  • the tile size balances the amount of computation and communication between tiles.
  • the first constraint ensures that the all the memory storage within one tile after tiling can be fit within the local memory of the GPU.
  • a tile size of 32 ⁇ 32 ⁇ 32 is chosen, and the resulting loop nests loop is:
  • the device memory corresponds to the original primary memory and the shared memory to the memory locations in the secondary memory.
  • the result is:
  • This step in general involves a sequence of "orthogonal" loop and data transformations, including, loop fusion, fission, interchange, stripmining, and data permutation.
  • the first step of this process is block and thread placement, i.e., determining the set of loop dimensions to be used for block and thread dimensions.
  • This first step is related to the contiguity properties of the optimized program.
  • Modern GPUs implements memory coalescing, whereby aligned stride-1 array accesses assigned to adjacent threads are merged into a single memory transaction. By taking advantage of this hardware feature, programs can drastically improve their memory transfer rate. However, memory coalescing interacts with data layout and thread placement in non-trivial way, and so the two optimizations must be determined together.
  • some embodiments may interchange loops i and j, and designate the j loop as the thread dimension.
  • the resulting transformation is shown below:
  • d is an array dimension of A, indexed from 0;
  • j is a potential thread loop dimension (i.e., it must be a parallel intra-tile loop dimension).
  • w is the weight, which measures how much benefit there is if the given reference is coalesced.
  • a coalescing tuple (A, d, j,w) for a reference A[f(x)] means that if data dimension d of A is made the rightmost data dimension 2 and if j is made the outermost thread dimension, we gain a performance benefit of w in the program.
  • the weight w of non-contiguous access is computed via the following estimate:
  • a reference is device memory is slower to execute than from shared memory, the slowing factor related to the relative speed of the load/store from device and shared memory. In further embodiments, this value is determined by the runtime cost of a code representative of load and stores to and from the device and shared memories.
  • the tuples produced for all the references are as follows: for A[i,j], the triples are [(A, 0, i, PQR), (A, 1 , j, PQR)]; for A[i,k], the triples are [(A, 0, i, PQR), (A, 0, k, PQR)]; and for A[i+j,32 * i+j], the triples are [], (i.e., no memory coalescing is possible)
  • unified parallelism, locality and contiguity optimization finds the following loop structure with contiguity along dimension i:
  • the triples for the statement S can be computed by merging the triples for its references. In this case we have: [(A, 0, i, 2PQR), (A, 1 , j, PQR), (A, 0, k, PQR)]. By exhaustive inspection, a further embodiment finds it is most beneficial to choose loop i as the outermost thread dimension. The resulting code is:
  • additional costs for the synchronizations between threads are introduced to account for nonparallel loop dimensions. This is because using an inner doall loop as a thread dimension can increase the amount of synchronization that we require in the final CUDA kernel. For example, consider the following loop nest with two parallel and one interleaved sequential loop dimensions:
  • some embodiments may choose loop k as the thread dimension.
  • a syncthreads call is inserted in the output code to preserve the semantics of the original program as follows:
  • Thread synchronization code is required if a loop dimension is nested under a sequential loop within a tile.
  • the total penalty of the synchronization is proportional to the trip count of the sequential loops, which is an estimate of the minimal amount of thread synchronization calls that the program has to execute per block.
  • Statement S1 and S2 have the i-loop in common.
  • the condition (2) states that if this embodiment chooses the i-loop for a thread dimension of S1 , then it also has to use it for the thread dimension of S2. On the other hand, if the embodiment chooses the j-loop for the thread dimension for S1 , then it has the freedom to use the k- or I-loop for the thread dimension of S2.
  • the previous method is used, in some embodiments, to determine the threads and block dimensions. In further embodiments, we find one dimension at a time, starting from the first thread dimension. During the selection of the first thread dimension, memory coalescing optimization is considered. When choosing other thread and block dimensions (where memory coalescing is no longer a concern), some embodiments may use the following heuristics instead:
  • the first thread dimension has a trip count of 32. Since we are only allowed a maximum of 512 threads on the 9800GX2, the second thread dimension is limited to 16. The second selected thread dimension has a trip count of 32. To maintain the limit of 512 threads, we sthpmine the loop nest by 16 and use the stripmined loop as the second thread dimension. After this, we have exhausted the number of threads. We then proceed to select the block dimensions, which are loops i and j. Both block dimensions have trip counts of 128.
  • Loop dimensions that are assigned to a block or thread are made implicit.
  • Outer-loop dimensions that surround a block dimension are executed on the host processor, i.e., executed on the CPU outside of the CUDA kernel.
  • Loop dimensions that are below block dimensions can be sunken into the CUDA kernel and executed sequentially (doing so may require addition synchronizations to be inserted).
  • the loop dimensions i and j are used as the block dimensions. Since there are no loop dimensions above i in this example, the entire loop nests may be executed in the CUDA kernel, and the host-side code contains only a kernel launch.
  • the reduction loop dimension k can be sunken into the CUDA kernel; doing so requires the introduction of syncthreads () calls to sequentialize the execution within this loop.
  • the resulting transformed loop nests are as follows (we use th for threadldx and bl for blockldx to reduce clutter in the pseudo code):
  • Step 1 Check if all processor dimensions are present in each of the access function. IfNOT, Mark as NOT PRIVATIZABLE Step 2: Check if the coefficients of processor dimensions in all the access functions are the same.
  • Step 3 Check if the non-processor coefficients are a multiple of the processor grid sizes. IfNOT, Mark as NOT PRIVATIZABLE
  • various embodiments may fix schedule coefficients to enforce invariance of parallelism, locality and contiguity of memory references across the plurality of selective tradeoffs as described above.
  • Embodiments then may compute dependences between communication and computation statements using techniques well-known by engineers knowledgeable in the field.
  • a selective tradeoff is further derived and optimized which comprises communication and computation statements. Costs are determined as described above.
  • embodiments may produce the following pseudo-program before the third weighted parametric function is optimized. In this pseudo-program, pr stands for private memory, CJ
  • Optimizing the third weighted function may completely modify the schedule of the computations and the interleaving of communications.
  • communications between the primary memory for B and the non-primary (secondary or third) memory are interleaved with computations:
  • Step 1 Compute dependences for each pair of references (Rl, R2) accessing the same array A in the list, if array A is not marked PRIVATIZED if Rl and R2 access the same location of A at depth d by different execution units add a synchronization at depth at least d between statements (S, T) referencing (Rl, R2). attach the type of dependence (loop independent OR loop carried) attach the minimum dependence distance (min dist)
  • Embodiments allow the generation of multiple program version with different synchronization properties depending on the associated cost model.
  • Unroll-and-jam is common knowledge for an engineer knowledgeable in the art. In one embodiment it is implemented as a simple stripmining transformation of a doall loop followed by an interchange to sink the loop at the innermost level. A late unrolling duplicates the code and results in the following code:
  • unroll-and-jam on the outermost block dimensions has the effect to further improve the reuse within the innermost kernels by keeping more than 2 running sums in the local array CJ.
  • unrolling the outer dimensions increases the sizes of AJ and/or BJ, there is a need to limit the unrolling factor to avoid running out of secondary memory limits.
  • the CUDA kernel output in some embodiments is very similar to the pseudocode above. Two additional and trivial adjustments are made to the output code:
  • #pragma unroll are inserted into the CUDA kernel to further enable unrolling by the NVIDIA nvcc compiler.

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)
  • Stored Programmes (AREA)

Abstract

La présente invention concerne des procédés, un appareil et un produit de logiciel informatique destinés à l'optimisation de codes sources. Dans un mode de réalisation illustratif, un premier appareil informatique personnalisé est utilisé pour optimiser l'exécution d'un code source sur un second appareil informatique. Dans ce mode de réalisation, le premier appareil informatique personnalisé contient une mémoire, un support de stockage et au moins un processeur doté d'au moins une unité d'exécution à étages multiples. Le second appareil informatique contient au moins deux unités d'exécution à étages multiples qui permettent l'exécution parallèle des tâches. Le premier appareil informatique personnalisé optimise le code pour le parallélisme, la localité des opérations et la contiguïté des accès mémoire sur le second appareil informatique. Le présent abrégé a pour seul but d'être en conformité avec les règles d'exigence des abrégés. L'abrégé est soumis à la compréhension explicite qu'il ne sera pas utilisé pour interpréter ou limiter la portée ou la signification des revendications.
PCT/US2010/031524 2009-04-17 2010-04-16 Système, procédés et appareil destinés à une optimisation de programmes pour des architectures de processeurs à plusieurs fils WO2010121228A2 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US17026109P 2009-04-17 2009-04-17
US61/170,261 2009-04-17
US12/561,152 US8572590B2 (en) 2008-09-17 2009-09-16 Methods and apparatus for joint parallelism and locality optimization in source code compilation
US12/561,152 2009-09-16

Publications (2)

Publication Number Publication Date
WO2010121228A2 true WO2010121228A2 (fr) 2010-10-21
WO2010121228A3 WO2010121228A3 (fr) 2011-01-20

Family

ID=42983182

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2010/031524 WO2010121228A2 (fr) 2009-04-17 2010-04-16 Système, procédés et appareil destinés à une optimisation de programmes pour des architectures de processeurs à plusieurs fils

Country Status (1)

Country Link
WO (1) WO2010121228A2 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990791B2 (en) 2011-07-29 2015-03-24 International Business Machines Corporation Intraprocedural privatization for shared array references within partitioned global address space (PGAS) languages
US9235527B2 (en) 2012-07-20 2016-01-12 The Johns Hopkins University Multiple-cache parallel reduction and applications
US9489405B2 (en) 2011-06-25 2016-11-08 International Business Machines Corporation Geometric array data structure
US9626736B2 (en) 2015-06-18 2017-04-18 International Business Machines Corporation Memory-aware matrix factorization
IT201700082213A1 (it) * 2017-07-19 2019-01-19 Univ Degli Studi Di Siena Procedimento per la generazione automatica di codice di calcolo parallelo

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6772415B1 (en) * 2000-01-31 2004-08-03 Interuniversitair Microelektronica Centrum (Imec) Vzw Loop optimization with mapping code on an architecture
US20070074195A1 (en) * 2005-09-23 2007-03-29 Shih-Wei Liao Data transformations for streaming applications on multiprocessors

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6772415B1 (en) * 2000-01-31 2004-08-03 Interuniversitair Microelektronica Centrum (Imec) Vzw Loop optimization with mapping code on an architecture
US20070074195A1 (en) * 2005-09-23 2007-03-29 Shih-Wei Liao Data transformations for streaming applications on multiprocessors

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
UDAY BONDHUGULA ET AL.: 'Toward Effective Automatic Parallelization for Multicore Systems.' PROCEEDING OF 22ND IEEE INTERNATIONAL SYMPOSIUM ON PARALLE 1 AND DISTRIBUTED PROCESSING, (IPDPS 2008) 14 April 2008, MIAMI, FLORIDA USA, *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9489405B2 (en) 2011-06-25 2016-11-08 International Business Machines Corporation Geometric array data structure
US9535937B2 (en) 2011-06-25 2017-01-03 International Business Machines Corporation Geometric array data structure
US8990791B2 (en) 2011-07-29 2015-03-24 International Business Machines Corporation Intraprocedural privatization for shared array references within partitioned global address space (PGAS) languages
US9235527B2 (en) 2012-07-20 2016-01-12 The Johns Hopkins University Multiple-cache parallel reduction and applications
US9626736B2 (en) 2015-06-18 2017-04-18 International Business Machines Corporation Memory-aware matrix factorization
IT201700082213A1 (it) * 2017-07-19 2019-01-19 Univ Degli Studi Di Siena Procedimento per la generazione automatica di codice di calcolo parallelo
WO2019016656A1 (fr) * 2017-07-19 2019-01-24 Università Degli Studi Di Siena Procédé de génération automatique de code parallèle

Also Published As

Publication number Publication date
WO2010121228A3 (fr) 2011-01-20

Similar Documents

Publication Publication Date Title
US8930926B2 (en) System, methods and apparatus for program optimization for multi-threaded processor architectures
US8572590B2 (en) Methods and apparatus for joint parallelism and locality optimization in source code compilation
Che et al. A performance study of general-purpose applications on graphics processors using CUDA
US8661422B2 (en) Methods and apparatus for local memory compaction
US11500621B2 (en) Methods and apparatus for data transfer optimization
Bauer et al. Singe: Leveraging warp specialization for high performance on gpus
Cattaneo et al. On how to accelerate iterative stencil loops: a scalable streaming-based approach
US11989536B1 (en) Methods and apparatus for automatic communication optimizations in a compiler based on a polyhedral representation
Orchard et al. Ypnos: declarative, parallel structured grid programming
Xiang et al. Heteroflow: An accelerator programming model with decoupled data placement for software-defined fpgas
Papakonstantinou et al. Efficient compilation of CUDA kernels for high-performance computing on FPGAs
Sabne et al. Scaling large-data computations on multi-GPU accelerators
Lepley et al. A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory
US9489180B1 (en) Methods and apparatus for joint scheduling and layout optimization to enable multi-level vectorization
Strout et al. Generalizing run-time tiling with the loop chain abstraction
WO2010121228A2 (fr) Système, procédés et appareil destinés à une optimisation de programmes pour des architectures de processeurs à plusieurs fils
Mehrotra et al. High performance Fortran: history, status and future
Parsa et al. Nested-loops tiling for parallelization and locality optimization
Lethin et al. R-stream 3.0 compiler
Lee High-Level Language Compilers for Heterogeneous Accelerators
Zou Towards automatic compilation for energy efficient iterative stencil computations
Davis Polyhedral+ Dataflow Graphs
Shah et al. Efficient Execution of Irregular Dataflow Graphs: Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra
Wei et al. Compilation System
Stratton Performance portability of parallel kernels on shared-memory systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10765317

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10765317

Country of ref document: EP

Kind code of ref document: A2