WO2015099562A1 - Methods and apparatus for data-parallel execution of operations on segmented arrays - Google Patents
Methods and apparatus for data-parallel execution of operations on segmented arrays Download PDFInfo
- Publication number
- WO2015099562A1 WO2015099562A1 PCT/RU2013/001166 RU2013001166W WO2015099562A1 WO 2015099562 A1 WO2015099562 A1 WO 2015099562A1 RU 2013001166 W RU2013001166 W RU 2013001166W WO 2015099562 A1 WO2015099562 A1 WO 2015099562A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- elements
- array
- segment
- chunk
- results
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/453—Data distribution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
Definitions
- the invention relates to methods for executing operations on elements of a segmented array in a data-parallel fashion.
- the methods may be implemented in a computer system.
- Data parallelism is simpler than task parallelism from the programmer point of view and can lead to good scalability, but it requires the system to do a decent job to balance data chunks processing on CPU cores.
- Irregular data structures such as nested arrays are often represented as segmented arrays.
- Segmented array consists of a flat array of values and some information of how these values are partitioned in sub-arrays (called segments). More nesting levels are possible to represent in this way with nested segments information. Segmented arrays were first defined in the context of compiling collection oriented languages onto massively parallel computers, see for example, Guy Blelloch and Gary W. Sabot, "Compiling Collection-Oriented Languages 2onto Massively Parallel Computers", Journal of Parallel and Distributed Computing, 1990.
- US 2010/0076941 Al suggests a scan algorithm implementation for parallel processors.
- the document suggests performing scans on GPUs or other parallel processors.
- Data is represented in a manner that optimizes mapping into the architecture of a GPU.
- Mechanisms structure and operate on data in a way to minimize memory bank conflicts and reduce latency of memory accesses.
- the method described in US 2010/0076941 Al requires copying of the data into specifically created data structures which requires additional work and memory proportional to the size of the data and is superfluous in case of implementation in multi-core systems.
- DPH solves the problem of balancing irregular data by splitting the values of a segmented array in chunks for independent processing based on the size of values only and not on the segments information. How to split the values of a segmented array in chunks for independent processing is however not described by Chakravarty et al. DPH is integrated into full-fledged functional programming language Haskell. This imposes rather strict constrains on the implementation of DPH and in many cases leads to inefficient use of hardware.
- One object is to suggest a method for executing data-parallel operations on irregular data structures, and in particular so-called segmented arrays.
- the method should allow for an equal distribution of data to be processed. Further advantageously, the method should allow reducing memory required for storing intermediate data structures.
- a first aspect of the invention relates to achieving data parallelism in the execution of an operation on a segmented array.
- the segmented array may comprise regular data structures (e.g. (flat) arrays) or irregular data structures (e.g. nested arrays (arrays of arrays of different lengths)).
- the segmented array comprises an array of elements containing all elements of the different segments of the segmented array on which the operation is performed.
- the segmented array further comprises information indicating to which segment the respective elements of the array of elements belong.
- the size of the information indicating to which segment the respective elements of the array of elements belong is proportional to the number of segments, and may be for example an array of segment offsets, i.e.
- the elements in the array of elements of the segmented array are distributed into different chunks and each chunk of elements is assigned to an individual task for execution of the operation on the elements of the respective chunk.
- an intermediate data structure may be required for temporarily storing intermediate results (partial results) of the operation on different parts of a segment.
- the intermediate results are sued in a subsequent synchronization to obtain the final result of the operation performed on the elements of a segment. This is for example the case when elements of a single segment are distributed across two or more chunks.
- the size of the intermediate data structure is proportional to the number of chunks, and thus not dependent on the size of the data in the segmented array.
- a second aspect of the invention in line with the first aspect, provides a method in which a computer may perform the following.
- the computer receives a command to execute an operation on an argument represented by a segmented array.
- the segmented array can be considered comprising of two arrays:
- the first array is an array of elements comprising the elements of plural segments, and the second array is an array of segment indicators indicative of which one or more elements belong to which one of the segments.
- the size of the data in the second array is independent from the size of the data in the segmented array, but only proportional to the number of its segments.
- the computer distributes the elements of the segmented array into plural chunks of elements for execution in respective threads such that each chunk comprises a number of elements of one or more of the segments. Then the computer executes the operation on the elements of each chunk of elements in respective threads.
- At least some of said respective threads are executed by the computer in parallel.
- the distribution of elements of the segmented array is performed independent of the belonging of the elements to the respective segments. It is advantageous if the elements of each segment are arranged consecutively in the array of elements.
- each chunk of elements has approximately the same number of elements.
- the elements of the array of elements may be distributed evenly to the chunks. "Approximately” may for example mean that in case the result of dividing the number of elements of the array of elements is divided by the number of chunks is not a natural number, some chunks may have an element more than other chunks(s).
- the distribution of the elements of the segmented array into the plural chunks of elements is a result from dividing the array of elements. For example, this may be reached by cutting the array of elements in a given number of pieces (chunks). As noted above, this division may not consider to which segments the individual elements of the segmented array belong, i.e. elements of a single segment may be distributed across two or more chunks, depending on the number of elements in the segment and the size of the chunks.
- the respective segment indicators may for example indicate the respective boundaries of the segments in the array of elements by means of offset indices.
- the indicators indicate for example the size of the segments, which also allows for deriving the boundaries of the segments within the array of elements.
- the execution of the operation on the elements of each chunk of elements in respective threads may depend on the type of the command.
- the computer may distinguish the following types of operations: element-wise operations, operations which are commutative and have a identity element, and operations, which are either not commutative or have no identity element.
- the execution of the operation on the elements of each respective chunk of elements in a respective thread by the computer may be realized as follows.
- the computer sequentially executes the element-wise operation on each element of the respective chunk of elements; and writes the result of the element-wise operation to an array of results.
- the execution of the operation on the elements of each chunk of elements in a respective thread by the computer may be realized as follows.
- the computer identifies the elements of the respective chunk of elements belonging to the same segment, based on the segment indicators of the segmented array.
- the computer executes the operation on the elements of the respective chunk of elements belonging to the respective segment to obtain a partial result, executes the operation on the partial result and on another partial result or identity element corresponding to the respective segment in an array of results to obtain a final result, and writes the final result of the operation executed on the elements of the respective segment to a corresponding element in an array of results.
- the execution of the operation on the elements of each respective chunk of elements in a respective thread may further comprise the computer converting the operation into an atomic operation, in case the operation is not atomic.
- the atomic operation is executed on the partial result and on said other partial result or identity element corresponding to the respective segment in the array of results.
- the execution of the operation on the elements of each chunk of elements in a respective thread by the computer may be realized as follows.
- the computer identifies, based on the segment indicators of the segmented array, the elements of the chunk of elements belonging to the same segment. Furthermore, the computer identifies for each segment of which one or more elements are comprised in the respective chunk of elements and based on the segment indicators of the segmented array, whether the elements of the segment have been divided across multiple chunks or not, and if so, whether the one or more elements of the respective segment comprised in the respective chunk of elements correspond to the first one or more elements of the segment or not.
- the computer For each of the segments of which one or more elements are comprised in the respective chunk of elements the computer performs the following.
- the computer executes the operation on the first elements of the respective segment to obtain a partial result of the operation for the respective segment, and writes the partial result to a corresponding element in the array of results.
- the computer executes the operation on the elements of the respective segment to obtain a partial result of the operation for the respective segment, and writes the partial result to a corresponding element in an auxiliary array of results.
- the computer executes the operation on the partial results of the operation comprised in the array of results and the auxiliary array of results, and writing the final result of the operation executed on the respective segment into the array of results (result array).
- the computer has a multi-core processor.
- the number of chunks of elements may be equal to or an integer-multiple of the number of cores of a multi-core processor of the computer. At least some of the threads are executed on respective cores of the processor in parallel.
- a third aspect of the invention which is still in line with the first aspect, relates to a computer.
- the computer comprises a receiving unit configured to receive a command to execute an operation on an argument represented by a segmented array.
- the segmented array has an array of elements comprising the elements of plural segments, and an array of segment indicators indicative of which one or more elements belong to which one of the segments.
- the computer may also have a distribution unit configured to distribute the elements of the segmented array into plural chunks of elements for execution in respective threads so that each chunk comprises a number of elements of one or more of the segments.
- the computer further comprises an execution unit configured to execute the operation on the elements of each chunk of elements in respective threads.
- the computer of a first implementation of the third aspect further comprises a memory unit for storing the segmented array and results of executing the operation on the elements of each chunk of elements.
- the execution unit comprises one or more processors adapted to execute the operation on the elements of each chunk of elements in respective threads.
- the execution unit comprises a multi-core processor and the distribution unit is configured to distribute the plural chunks of elements to the respective cores of the processors for execution in respective threads.
- the computer may be configured to execute at least some of said respective threads in parallel.
- the distribution unit is adapted to distribute the elements of the segmented array into plural chunks of elements so that each chunk of elements has approximately the same number of elements.
- the execution units is configured to determine whether the operation is an element-wise operation or whether the operation is commutative and has an identity element, or whether the operation is commutative and has no identity element, and to execute the operation based on the determined type of operation.
- the execution unit is configured to sequentially execute the element-wise operation on each element of the respective chunk of elements, and to write the result of the element-wise operation to an array of results in a memory unit, in case the operation is an element-wise operation.
- the execution unit is configured to identify, based on the segment indicators of the segmented array, the elements of the respective chunk of elements belonging to the same segment; to execute, for each of the segments of which elements are comprised in the respective chunk of elements, the operation on the elements of the respective chunk of elements belonging to the respective segment to obtain a partial result for said segment, execute the operation on the partial result and on another partial result or identity element corresponding to the respective segment in an array of results to obtain a final result, and to write the final result of the operation executed on the elements of the respective segment to a corresponding element in an array of results, in case the operation is commutative and has an identity element.
- the computer may also comprise a conversion unit configured to convert the operation into an atomic operation, in case the operation is not atomic and the execution unit is configured to execute the atomic operations on the partial result and on said other partial result or identity element corresponding to the respective segment in the array of results.
- a conversion unit configured to convert the operation into an atomic operation, in case the operation is not atomic and the execution unit is configured to execute the atomic operations on the partial result and on said other partial result or identity element corresponding to the respective segment in the array of results.
- the execution unit is configured to identify, based on the segment indicators of the segmented array, the elements of the chunk of elements belonging to the same segment; and to identify for each segment of which one or more elements are comprised in the respective chunk of elements and based on the segment indicators of the segmented array, whether the elements of the segment have been divided across multiple chunks or not, and if so, whether the one or more elements of the respective segment comprised in the respective chunk of elements correspond to the first one or more elements of the segment or not. For each of the segments of which one or more elements are comprised in the respective chunk of elements:
- the execution unit is configured to execute the operation on the elements of the respective segment, and to write the result of the operation to a corresponding element in an array of results, if all elements of the respective segment are comprised in the chunk of elements;
- the execution unit is configured to execute the operation on the first elements of the respective segment to obtain a partial result of the operation for the respective segment, and to write the partial result to a corresponding element in the array of results, if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements comprises the first elements of the respective segment;
- the execution unit is configured to execute the operation on the elements of the respective segment to obtain a partial result of the operation for the respective segment, and to write the partial result to a corresponding element in an auxiliary array of results, if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements does not comprise the first elements of the respective segment; and wherein the execution unit is configured to execute, upon each thread having finished execution and for each segment of which elements have been distributed into different chunks of elements, the operation on the partial results of the operation comprised in the array of results and the auxiliary array of results, and to write the final result of the operation executed on the respective segment into the result array.
- a fourth aspect of the invention which is still in line with the first aspect, provides a computer readable medium storing instructions that, when executed by a processor of computer, cause the computer to execute an operation on an argument represented by a segmented array, by receiving a command to perform execute an operation on an argument represented by a segmented array, wherein the segmented array has an array of elements comprising the elements of plural segments, and an array of segment indicators indicative of which one or more elements belong to which one of the segments; distributing the elements of the segmented array of segments into plural chunks of elements for processing execution in respective threads, wherein each chunk comprises a number of elements of one or more of the segments; and executing the operation on the elements of each chunk of elements in respective threads.
- the computer readable medium of an implementation of the fourth aspect stores instructions that, when executed by the processor of computer, cause the computer to perform the steps of one of the different methods for executing an operation on an argument represented by a segmented array according to the second aspect or one of its exemplary implementations described herein.
- Fig. 1 shows an exemplary flow chart for executing an operation on an argument represented as a segmented array according to an exemplary embodiment of the invention
- Fig. 2 shows an exemplary flow chart for executing an element-wise operation on an argument represented as a segmented array according to an exemplary embodiment of the invention
- Fig. 3 exemplifies the mapping of segments to an array of result by segment- dependent operations according to an embodiment of the invention
- Fig. 4 exemplifies the mapping of segments to an array of result by segment-dependent operations according to an embodiment of the invention, where the segment-dependent operations are commutative and have an identity element,
- Fig. 5 exemplifies the mapping of segments to an array of result by segment-dependent operations according to an embodiment of the invention, where the segment-dependent operations are non-commutative and/or have no identity element,
- Fig. 6 exemplifies the synchronization of partial results, when executing an operation being non-commutative and/or have no identity element on elements of a segmented array according to an embodiment of the invention
- Fig. 7 illustrates a flow chart for executing an operation on a segmented array according to an embodiment of the invention, where the operation can be any element- wise operation or segment-dependent operation
- Fig. 8 illustrates an exemplary flow chart for executing a commutative operation having an identity element on a segmented array according to an embodiment of the invention
- Fig. 9 illustrates an exemplary flow chart for executing an operation, which is non-commutative and/or has no identity element, on a segmented array according to an embodiment of the invention
- Fig. 10 shows a representation of sparse matrix in compressed sparse row (CSR) format
- Fig. 11 shows an exemplary implementation of inventive methods in a computer system according to an exemplary embodiment of the invention
- Fig. 12 shows results of a performance comparison of an SMVM algorithm implemented in C++ using OpenMP library against a compiler implementation using the exemplary SMVM algorithm discussed above and realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9,
- Fig. 13 shows a performance comparison of a compiler implementation for the
- Green-Marl programming language realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9 (named HiGraph) against a standard compiler for the Green-Marl programming language
- Fig. 14 shows an exemplary computer system according to an embodiment of the invention.
- one aspect of the invention relates to achieving data parallelism in the execution of an operation on a segmented array.
- An operation may also be referred to as a primitive.
- the invention may be for example realized within a compiler of a computer system, but the invention is not limited to this embodiment.
- the compiler may be for example used to compile a collection orientated programing language or domain specific programming language achieving data parallelism in the execution of the compiled ambler or object code.
- the individual features of the different embodiments of the aspects discussed herein may individually or in arbitrary combination be subject matter to another invention.
- a sequence of operations e.g. vector operations
- At least one of the arguments is a segmented array.
- Segmented arrays are represented as pairs of segment information and the elements (or values) of the respective segments.
- the elements/values of the segments are all provided in an array of elements.
- the segment information is information indicating which of the element/values belong to which segment of the segmented array.
- the size of the segment information is proportional to the number of segments in the segmented array and could be for example realized by an array of segment offsets.
- the elements (or values) of an individual segment are provided as a group of consecutive elements (i.e. a consecutively arranged within the array of elements/values).
- the segment information is an array of segment offsets identifying the boundaries of the segments by indicating the offset/index of the first element of a respective segment.
- the first element in the array of elements has the index 0 (zero), and accordingly the indices where the individual segments in the array of elements start (see bold characters) are the indices 0, 3 and 5.
- Fig. 1 shows an exemplary flow chart for executing an operation on an argument represented as a segmented array according to an exemplary embodiment of the invention.
- a command is received by the computer.
- the command includes a segmented array as an argument and the operation to be performed on the segmented array.
- the command may include additional arguments or parameters for execution of the operation, e.g. another operand and/or a buffer or pointer to a memory location to store the result of the operation.
- array of elements is partitioned into multiple chunks ("chunks of elements").
- chunks of elements the number of chunks obtained by partitioning is c, where c 6 M > 1.
- this may be for example realized by simply distributing 102 the elements of the array of elements into the given number c of individual chunks.
- the partitioning may be realized by simply dividing the array of elements into c chunks. The division may ensure that all chunks have approximately an equal number of elements (i.e. either or + 1 elements, where is the next lower integer number of -).
- the partitioning of the array of elements may not consider how the values are structured in segments, i.e. the partitioning does not take into account the information of the array of segment offsets, but considers only the array of elements.
- the number c of chunks may be for example configured depending on the number of processors or processor cores of the computer to execute the threads in which the chunks are processed.
- the number of c of chunks may be equal to or an integer multiple of the number of processors or processor cores of the computer, so that the operation can be performed on the elements of the respective chunks in respective threads (or process) executed on a respective one of the processors or processor cores.
- Each of the chunks is processed independently of the other chunks, i.e. the operation is performed 103 on the elements in each chunk individually, without requiring additional information from other chunks or the like.
- the chunks are processed independently (e.g. in parallel) in different threads/processes.
- Inside a chunk the operation may be applied sequentially to the elements of the chunk.
- For each segment the computed result is stored directly to the corresponding location in the array of results, in case the operation is an element-wise operation. Otherwise, there may be an additional execution of the operation on partial results, for example, in case elements of a segment get split into multiple chunks.
- this additional execution of the operation on the intermediate results is also referred to as a synchronization of the intermediate results of the processing of each chunk.
- Such synchronization step may be required so as to obtain the correct final result of the operation for the each of the segments. This will be outlined below in further detail.
- the proposed process of executing an operation on a segmented array may have the following advantages. Since the distribution of the elements of the array of elements into chunks also for having a substantially equal number of elements to be processed in each chunk/thread, each thread can be given an essentially equal amount of work (balancing). The additional overhead for auxiliary data structures depends on the number of chunks/threads, and not on the size of data itself.
- the exact execution of an operation on a segmented array depends on the type of the operation.
- the execution of an operation on a segmented array according to an exemplary embodiment is described for two different basic types of operations, (i) element- wise operations (being segment independent) and (ii) segment-dependent operations.
- the segment-dependent operations may be further subdivided into two different types of operations, namely (ii.a) commutative operations having a identity element, and (ii.b) operations being not commutative and/or not having an identity element.
- Element-wise operations are characterized as operations which are performed on each element of a segment, respectively, the segmented array (argument) individually. They do not dependent on the belonging of a respective element to a given segment, i.e. can be performed on the array of elements without considering information on the segments as for example given in the array of segment offsets. .
- results of element-wise operations may (but not necessarily) have the same structure as the segmented array, i.e. the array of results may also be represented as a segmented array with the same array of segment offsets as the argument of the operation.
- Fig. 2 shows a flow chart of performing an element-wise operation on elements of a segmented array according to an exemplary embodiment of the invention.
- the computer receives 201 a command which includes at least one segmented array having an array of elements and an array of segment offsets, and the element- wise operation to be performed on the array of elements of the segmented array.
- a nested array ⁇ [1, 1, 2], [3, 5], [8] ⁇ would be represented by an array of elements [1, 1, 2, 3, 5, 8] and an array of segment offsets [0, 3, 5].
- the operation is an element-wise operation, the computer can ignore the information in the array of segment offsets.
- the computer divides 203 the elements of the segmented array (substantially) equally to a given number of chunks. Assuming for exemplary purposes the array of elements [1, 1, 2, 3, 5, 8] and two chunks to be formed: a first chunk [1, 1, 2] and a second chunk [3, 5, 8]. Each of the chunks corresponds to a thread.
- the computer executes 203 the element-wise operation on the elements of each chunk in a respective thread.
- the operation may be a multiply operation that multiplies each element of the segmented array with a given operand
- the operand is multiplied with the each element of the respective chunk.
- the operation may be performed on the elements of the chunk subsequently.
- the results of the applying the operation on the respective elements is stored 204 in an array of results.
- the structure of the array of results would be identical to the array of elements and also the segment offsets would still indicate the segment boundaries within the array of results.
- zip_with Multiply, m_values , gather (vector,
- Non-zero matrix elements are represented by two segmented arrays: a nested array of column numbers of non-zero elements in each matrix row (m_indices); a nested array of values on non-zero matrix elements in each matrix row (m_values).
- the element-wise operation gather builds a nested array of values taken from the vector by indices of non-zero elements of the matrix (in each row).
- the gather operation may be implemented as a data parallel operation in different threads, as discussed in connection with Fig. 2 above.
- the element-wise opearation zip_with may also be implemented it in a data- parallel fashion across multiple threads as discussed in connection with Fig. 2 above.
- intelligent runtime system would defer the execution of this operation up to the point of parallel execution of reduce_seg and fuse the multiplication with accessing of the values for the Add monoid. This would allow avoiding of construction of an intermediate array for products.
- the segment dependent operation reducejseg is based on the segments information and can't be parallelized as easily (see below).
- segment-dependent operations Fundamentally different from the element-wise operations upon segmented arrays, are operations that restructure the data of the entire segmented array or its individual segments, e.g. combine some values (of a given segment) in that way or another, and thus depend on the segment information. These operations are referred to as segment-dependent operations herein; Exemplary and non-limiting representatives of such operations are:
- Segment-dependent operations map each segment to some result, as exemplified in Fig. 3.
- the 1 st to 6 th segments of the segmented array may have one or more elements each, which are mapped to some result per segment, which is stored in the array of results.
- the result of a segment- dependent operation could be a single combined result for all segments (e.g. reduce operation), single result for each segment (e.g. reduce_seg operation, see Fig. 3), or an array of results for each segment (e.g. scan_seg operation).
- the array of elements of the segmented array is first partitioned in a given number of chunks.
- the partitioning ignores information on the segment boundaries (as provided in the array of segment offsets) and ideally partitions all values of the array of elements equally into chunks equally. This is exemplified in Fig. 4, where the elements of the 1 st to 6 th segments of the segmented array are distributed equally (three elements each) into 6 chunks. Note that the number of chunks and segments is equal here, but the numbers may of course also be different from each other.
- Each thread then processes a respective one of the chunks completely independently from other chunks and other threads.
- a thread in order to allow independent processing of chunks two problems are to be solved.
- a thread should have a way to know which segments intersect with the chunk. More precisely, for the execution of the operation on the elements of a chunk within a thread/process, the thread/process should know elements of which segments are comprised in the respective chunk This may be necessary in order to know where to store the result computed for a segment.
- knowing the segments intersecting with a given chunk may also be necessary, if the operation directly depends on information on the segment boundaries (for example, an operation could map each value to the number of a corresponding segment). How the computer may find segments intersecting with a given chunk will be described herein below in further detail.
- segment-dependent operation two classes of segment-dependent operations are distinguished in the following: Operations which are commutative and have an identity element and operations which are not commutative and/or do not have an identity element.
- Fig. 8 shows a flow chart of the operation of a computer system according to an exemplary embodiment, in case the operation to be performed on the elements of the segmented array is commutative and has an identity element i.e. the correspondent op function forms a commutative monoid.
- Fig. 4 exemplifies the "mapping" of the - in some instances partial - results to corresponding locations in the array of results, in accordance with the flow chart of Fig. 8.
- the array of results is initialized 801 with the identity element of the operation to be performed on the elements of the segmented array.
- the elements in the segmented array are distributed into distinct chunks, e.g. by splitting 802 them. This distribution may not take into account to which respective segment the respective value of the segmented array belongs to, i.e. the segment information may be ignored.
- Each chunk is executed in a respective thread.
- the computer system checks 803 for each chunk (as indicated by the multiple boxes of step 803), whether it has segments intersecting across the boundary of the respective chunk and identifies 803 the respective segments contained in the respective chunk. In other words, for each given chunk, the computer system determines the set S of segments intersecting with the chunk.
- the elements of the 1 st to 6 th segment are distributed to a total of six chunks.
- the 1 st chunk only contains some but not all elements of the 1 st segment, so that the set S for the 1 st chunk only contains the 1 st segment.
- the 2 nd chunk contains the last element of the 1 st segment and the two elements of the 2 nd chunk, so that the set S for the 1 st chunk contains the 1 st segment and 2 nd segment.
- One possible implementation of a process to identify elements of segments are contained in the respective chunk will be outlined below in further detail.
- the next steps 804 to 809 are performed for each segment within a given chunk and for each chunk.
- the computer system applies 805 the operation sequentially to the elements in the segment. Once the result of the operation is calculated for the elements of the segment within the respective chunk, the result is written to the corresponding location in the array of results. Steps 805 and 806 are performed for all complete segments of set S of the respective chunk.
- the execution of the operation is performed according to steps 807 to 810.
- the operation is not an atomic operation, then the operation is converted 807 into an atomic operation.
- the use of atomic operations may be advantageous to avoid race conditions in writing the results.
- the operation is applied 808 to each of the elements of the respective segment within the given chunk to compute a partial result on the elements of the segment comprised in the respective chunk (note that the order of steps 807 and 808 could be also interchanged).
- the partial result is combined with another partial result of the execution of the operation on other elements of the same segment (in another thread) or the identity element (from initialization 801) stored in a corresponding place in the array of results.
- the atomic step 809 by applying an atomic (version of the) operation to the partial result obtained in the respective thread and a previously calculated partial result for the same segment or the identity element stored in the array of results thereby obtaining the "final result” of the operation and then storing the "final result” in the array of results.
- "Final result” here denotes the two combined partial results, respectively, the combination of the partial result in the given thread and the identity element.
- the result of the operation for a given segment stored in the array of results is final once all elements of the segment have been processed in the chunks (threads) across which the segment got split.
- atomic counterparts can be constructed as a functions using CPU instructions for atomic memory access such as compare-and-swap (CAS).
- CAS compare-and-swap
- the exemplary sample C/C++ code below exemplarily illustrates how an atomic counterpart for a given non-atomic integer function op can be implemented using a compiler intrinsic for CAS instruction (note that this implementation doesn't address the ABA problem):
- V vl ;
- the function atomic op loads a value from memory location pointed by argument p, and then applies the function op to the pair of the loaded value and input argument arg. Finally, the function stores the result back in memory location pointed to by pointer p in an atomic way.
- the number of atomic operations (which are typically less efficient than their non-atomic counterparts, e.g. non-atomic store operations, in terms of execution time) doesn't depend on a size of the data and is proportional to the number of chunks.
- Fig. 9 shows a flow chart of a procedure in a computer system according to an exemplary embodiment, in case the operation to be performed on the elements of the segmented array is non-commutative and/or has no identity element.
- the execution of the operation on the elements of the respective segments contained in the respective chunk is - to a large extend - similar to the processing described in connection with Fig. 8, as indicated by the use of the same reference numerals in Fig. 9.
- no initialization is need in the procedure of Fig. 9 due to the last synchronization step 904 (as will become more apparent from the following).
- the partial results for a split segment are computed in the different threads, and then the operation is performed again on the partial results for the segment to obtain the final result.
- an auxiliary data structure to temporarily store the partial results is provided (one of the partial results may be also stored in the array of results, as in the specific implementation of Fig. 9, but this is optional).
- the size of the auxiliary array is proportional to the number of chunks (which may depend on the hardware features such as the number of CPU cores) and doesn't depend on the size of data.
- the order in processing the partial results is decisive for the final result of the operation, so that the synchronization step (see step 904) needs to know which partial result refers to which elements of the split segment.
- step 901 checks, whether, within a given thread (i.e. for a chunk), a partial result has been computed on the first elements of the split segment, or not. If this is the case (yes) then the partial result of the operation is stored 902 in the array of results at a corresponding location for the segment. If this is not the case (no), the partial result of the operation is stored 903 in an auxiliary array of results. Once all partial results for a split segment are computed in the respective threads (chunks), the operation to computed 904 again on the partial results within the array of results and the auxiliary array of result in the order prescribed by the operation (for example, left-to-right for forward reduction and right- to-left for backward reduction). This step 904 is also referred to as a synchronization step herein.
- the segmented array has six segments and is split in six chunks (similar to Fig. 3 and Fig. 4).
- the exemplary segmented array illustrated in Fig. 5 can be represented as ⁇ [2, 3, 1, 1], [4, 2], [1, 2, 1], [2, 4], [1, 2, 5, 3, 1], [3, 6] ⁇ or alternatively, as an array of elements [2, 3, 1, 1 , 4, 2, 1, 2, 1, 2, 4, 1, 2, 5, 3, 1 , 3, 6] and corresponding segment boundaries [0, 4, 6, 9, 11 , 16] given by the indices of the elements in the array of elements.
- the indices of the elements within the respective segments are shown in Fig. 5 (starting with 0).
- a binary operation which chooses the one of two indices / and j for which the value is minimum (selecting the "left one" i if the values are identical).
- the binary operation may be for example implemented as follows:
- the result can be incorrect.
- the result of the atomic operation to find the (leftmost) index of smallest value in the segment would depend on whether the elements of the I s chunk or the 2 chunk are processed and written to the results first.
- auxiliary array of results stores 903 partial results for each of the split segments instead of using atomic operations, except the partial result for the first elements of the segment (step 902 in Fig. 9; See also Fig. 5).
- the final result of the operation on the elements of the segment is computed in step 904 by applying the same operation (e.g. ChooseMinldx) again, this time to the partial results in the array of results and auxiliary array of results, and again in left-to-right order of the chunks into which the elements of the segment were split.
- This synchronization step is exemplarily illustrated in Fig. 6.
- the elements of the 1 st segment in Fig. 5 are split across the 1 st and 2 nd chunk, there is one partial result in the array of.results (2, 1) - denoting at index 2 of the 1 st segment a value of 1 - and one partial result in the auxiliary array of results (3, 1) - denoting at index 3 of the 1 st segment a value of 1.
- Applying the ChooseMinldx operation to the partial results [1 , 1] with indices 2 and 3, the final result is 2 (leftmost index of the minimum value 1).
- steps 803 of Figs. 8 and 9 the segments intersecting with a given chunk need to be identified.
- a binary search which may be executed by the computer system.
- To find a segment containing the left boundary of a given chunk the binary search of the boundary index in the array of segment offsets is used. Segment offsets are associated with each segmented array. In essence the binary search finds a maximal segment offset which is not larger than the index of the left boundary of the chunk.
- An exemplary sample C++ code for a function to do a binary search of a place for the index in the array of offsets is shown below:
- the binary search finds the place for an index in an array of offsets between a left boundary lb and a right boundary rb. After some segment intersecting with a given chunk is known then it is easy to know whether its right boundary also belongs to the chunk (for example, one can compare an offset for the next segment with the right boundary of the chunk). Thus processing segments left to right it will be known when the current segment is the last in the chunk.
- a binary search has logarithmic complexity in the size of the array.
- overheads for finding segments are proportional to the logarithm of the number of segments for each chunk.
- the number of chunks may be greater than the logarithm of the number of segments one can consider the last to be a constant and the overheads to be proportional to the number of chunks.
- the binary search is extra workload in executing the operation on the segmented array, this extra work can be performed in parallel, e.g. in different CPU cores.
- the above binary search will be exemplified in the following assuming a segmented array [[1, 1, 2], [3, 5], [8]].
- the segmented array thus has offsets [0, 3, 5].
- the left boundary for the 1 st chunk is index 0, so it is right away known that the first segment for this chunk is segment 0.
- the next segment starts at offset 3, which is greater than the end of 1 st chunk (index 1), so that the 1 st chunk intersects with segment 0 only.
- the left boundary for the 2 nd chunk is 2. Finding the location for index 2 within the offsets [0, 3, 5] with binary search, yields the index 0, so that the 2 nd chunk starts in segment 0. Since the leftmost index 2 of the next segment 1 is identical to the right boundary 3 of the 2 nd chunk, and the next segment 2 starts at index 4, segment 1 starts in the 2 nd chunk and ends after it, so the 2 nd chunk intersects with segments 0 and 1.
- the left boundary for the 3 rd chunk is 4. Finding the location for the index 4 in the offsets [0, 3, 5] with binary search, yields a position after index 3, so that the 3 rd chunk starts in segment 1. The next segment 2 starts and ends in the 3 rd chunk so that the 3 rd chunk intersects with segments 1 and 2.
- Fig. 7 illustrates a flow chart for executing an operation on a segmented array according to an embodiment, where the operation can be any element-wise operation or segment- dependent operation.
- the computer system upon receiving 701 a command to perform a given operation on a segmented array, the computer system will next determine 702 the type of operation to be executed. Depending on this determination, different procedures for executing the operation are executed.
- steps 202, 203 and 204 are performed with the computer system. If the operation is a commutative operation and has an identity element, the procedure outlined above in connection with Fig. 8 is performed by the computer system. Otherwise, i.e. the operation is non-commutative and/or has no identity element, the procedure outlined above in connection with Fig. 9 is performed by the computer system.
- the methods for executing operations on the elements of a segmented array may be for example implemented in a compiler running in a computer system.
- the operations executed on the elements of the segmented array may be considered primitives within the "interface" provided by the compiler towards the application programmer.
- Fig. 11 shows an exemplary implementation of inventive methods in a computer system.
- the compiler 1102 receives the source code 1101 and identifies the operations (primitives) that are to be executed, as well as the arguments of the respective operations.
- Each of the operations that are to be executed on a segmented array is parallelized for execution in plural threads by compiler 1102 using for example the methods as shown in one of Figs. 2, 7, 8 and 9.
- the different chunks obtained by this data parallelization are provided to a scheduler 1103.
- the scheduler 1103 provides the chunks to the hardware platform 1104 where the operation is performed on the elements of each chunk in a separate thread.
- the hardware platform 1104 may for example include one or more CPUs which may each have one or more cores on which the different threads can be executed.
- the scheduler 1103 may be a static scheduler that tries to equally distribute the chunks to the different CPUs or their respective cores.
- the data-parallel processing of operations working with irregular data structures can be implemented efficiently in multi-core processor environments.
- the processing of operations can be divided into different threads (one thread per chunk) and thus the workload can be easily divided between CPU cores based on the upper level of a data structure to be processed.
- the scheduler 1103 may also be a dynamic scheduler that uses a dynamic scheduling algorithm for distributing the chunks to the different CPUs or their respective cores for their execution.
- Using a dynamic scheduler in conjunction with data parallelization, as described above, may allow for smaller number of scheduling units (low overheads) but inside each unit the utilization of computing power of the CPU cores can be high due to data parallelization.
- Fig. 12 shows the results of performance comparison of an SMVM algorithm implemented in C++ using OpenMP library against a compiler implementation using the exemplary SMVM algorithm discussed above and realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9.
- the data irregularity parameter controls the distribution of lengths of matrix rows: close to 0 all the lengths are approximately the same; the larger the value of the parameter the longer rows are present and the more rows are very short (the number of non-zero elements is constant across the experiments).
- the graph in Fig. 12 has been in double-CPU Intel Xeon X5650 configuration using 24 threads. Other thread numbers give approximately the same picture.
- the same SMVM algorithm was also compared with a Data Parallel Haskell (DPH) system.
- DPH Data Parallel Haskell
- the SMVM implemented in DPH is about 30 times slower than the exemplary SMVM implementation proposed above.
- the main advantage of the proposed SMVM implementation over the DPH system is efficient management of the resulting data: In case SMVM operations are commutative, the results are written directly to the pre-allocated results array using atomic operations on chunk boundaries. In case of the DPH it isn't possible due to the functional nature of the language.
- Fig. 13 shows a performance comparison of a compiler implementation for the Green-Marl programming language realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9 (named HiGraph) against a standard compiler for the Green-Marl programming language.
- Green-Marl is a domain-specific language (DSL) aiming for easy and efficient graph processing (see http://ppl.stanford.edu/main/green_marl.html). Comparison is done using double-CPU Intel Xeon X5650 configuration.
- a speedup of approx. 10 times can be achieved in comparison to the use of a standard Green-Marl compiler provided by the Stanford University.
- FIG. 14 An exemplary computer system is shown in Fig. 14.
- Such computer system is for example a computing device and may comprise different functional blocks, such as for example a processor 1403, which may have one or - advantageously - multiple cores.
- the computing device may also have multiple processors, which may have one or more cores each.
- the processor may also be considered an execution unit 1404.
- the computing device may also have a memory 1401 for (temporarily) storing data related to the processing of the operations.
- the memory could store the segmented array to which the operation is applied, the array of results as well as the optional intermediate data structures, e.g.
- the memory could also store results of the identification of segments intersecting with a given chunk and information indication which whether a given segment starts in a given chunk. Distinct portions of the memory may be assigned to the individual threads in which the operation is executed on the elements of a given chunk.
- the computing device may also comprise a data bus (not shown) to communicate data between the processor(s) (or its/their cores) and the memory.
- the computing device may also comprise a compiler 1405, which has a receiving unit 1406 for receiving for example the command and its arguments, or the source code from memory 1401.
- the compiler 1405 may also have a distribution unit 1407 which distributes the elements of the segmented array into chunks for execution in different threads by the processor 1403.
- the computing system may also have a scheduler 1408 which is responsible for scheduling the execution of the operation on the elements within each chunk in a respective thread.
- the computing device may be implemented in a variety of devices, including a server, personal computer, laptop computer, netbook computer, tablet computer, a mobile phone, a smart phone, etc.
- a (functional or tangible) block may correspond to one or more method step or a feature of a method step.
- aspects described in the context of a corresponding block or item or feature of a corresponding apparatus may also correspond to individual method steps of a corresponding method.
- the methods described herein may also be executed by (or using) a hardware apparatus, like processor(s), microprocessor(s), a programmable computer or an electronic circuit. Some one or more of the most important method steps may be executed by such an apparatus.
- a hardware apparatus like processor(s), microprocessor(s), a programmable computer or an electronic circuit. Some one or more of the most important method steps may be executed by such an apparatus.
- the apparatus may comprise memory or storage medium, which may be communicatably coupled to one or more hardware elements/circuitry of the apparatus.
- a digital storage medium for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals or instructions stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.
- a data carrier may be provided which has electronically readable control signals or instructions, which are capable of cooperating with a programmable computer system, such that the method described herein is performed.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
The invention relates to methods for executing operations on elements of a segmented array in a data-parallel fashion. The methods may be implemented in a computer system. According to the invention a computer performs the following: receiving a command to execute an operation on an argument represented by a segmented array, wherein the segmented array has an array of elements comprising the elements of plural segments, and an array of segment indicators indicative of which one or more elements belong to which one of the segments; distributing the elements of the segmented array into plural chunks of elements for execution in respective threads, wherein each chunk comprises a number of elements of one or more of the segments; and executing the operation on the elements of each chunk of elements in respective threads.
Description
METHODS AND APPARATUS FOR DATA- PARALLEL EXECUTION OF OPERATIONS ON SEGMENTED ARRAYS
FIELD OF INVENTION
The invention relates to methods for executing operations on elements of a segmented array in a data-parallel fashion. The methods may be implemented in a computer system.
TECHNICAL BACKGROUND
. Multiple CPU cores become ubiquitous in computing devices (servers, laptops, smartphones, etc.). These devices are referred to as multi-core and are differentiated from special-purpose parallel architectures such as wide-vector machines of the past or GPUs. In order for some application to use all the computing power of a multi-core device it should divide its work to some independent pieces that can run in parallel in all the CPU cores. The pieces of work should be balanced, i.e. in each moment every CPU core should have some work to do, otherwise its computing power would be wasted. The task of dividing and balancing the work between CPU cores lies on a programmer and a programming system (s)he uses. This task is referred to as parallel programming.
There are two general approaches to programming parallelism: task parallelism and data parallelism. Herein below, we refer to data parallelism. Data parallelism is described in terms of dividing the application data in chunks and processing each chunk independently. Typical case for this can be written as a loop like
for (int i = 0; i < data_chunks . size ( ) ; i++)
process (data_chunks [i] ) ;
where each loop cycle is independent.
Data parallelism is simpler than task parallelism from the programmer point of view and can lead to good scalability, but it requires the system to do a decent job to balance data chunks processing on CPU cores.
For regular data structures like flat arrays balancing isn't a problem: an array can be divided in equal chunks which are distributed evenly between the cores. But for irregular data structures like nested arrays (i.e. arrays of arrays of different lengths)
this simple solution is often not acceptable: if the work is split between cores based on the upper level of a data structure then amount of actual data in each chunk can be very different. Thus CPU cores will be assigned different amount of work and computing power will be wasted.
Irregular data structures such as nested arrays are often represented as segmented arrays. Segmented array consists of a flat array of values and some information of how these values are partitioned in sub-arrays (called segments). More nesting levels are possible to represent in this way with nested segments information. Segmented arrays were first defined in the context of compiling collection oriented languages onto massively parallel computers, see for example, Guy Blelloch and Gary W. Sabot, "Compiling Collection-Oriented Languages 2onto Massively Parallel Computers", Journal of Parallel and Distributed Computing, 1990.
US 8,321,492 Bl and US 8,243,083 Bl show methods that aim at implementation in massively-parallel architectures such as GPUs. The described methods are severely based on a representation for segmented arrays that for each value of the array allows knowing immediately which segment it belongs to. Therefore the methods require that after dividing the work between threads each thread has all the information needed for computation of an operation without accessing any other memory. In particular, for each value of a segmented array a thread should know which segment this value belongs to, which requires additional data structures proportional to the size of the data. For multi-core systems having this large additional data structures is superfluous and inefficient as significant additional data (proportional to the amount of data to be processed is added and often the "bottleneck" in the processing of massive amounts of data is the data transfer time from memory.
US 2010/0076941 Al suggests a scan algorithm implementation for parallel processors. The document suggests performing scans on GPUs or other parallel processors. Data is represented in a manner that optimizes mapping into the architecture of a GPU. Mechanisms structure and operate on data in a way to minimize memory bank conflicts and reduce latency of memory accesses. The method described in US 2010/0076941 Al requires copying of the data into specifically created data structures which requires additional work and memory proportional to the
size of the data and is superfluous in case of implementation in multi-core systems.
Data Parallel Haskell system as discussed in Chakravarty et al., "Data Parallel Haskell: a status report", DAMP 2007: Workshop on Declarative Aspects of Multicore Programming, ACM Press, 2007 focuses on irregular data parallelism. DPH solves the problem of balancing irregular data by splitting the values of a segmented array in chunks for independent processing based on the size of values only and not on the segments information. How to split the values of a segmented array in chunks for independent processing is however not described by Chakravarty et al. DPH is integrated into full-fledged functional programming language Haskell. This imposes rather strict constrains on the implementation of DPH and in many cases leads to inefficient use of hardware.
One potential problem with the DPH solution is the functional programming approach (pure functions, no mutable state) which leads to excessive creation of intermediate data structures. In general overheads related to intermediate data structures are proportional to the size of data which is often unacceptable in practice.
SUMMARY
One object is to suggest a method for executing data-parallel operations on irregular data structures, and in particular so-called segmented arrays. Advantageously, the method should allow for an equal distribution of data to be processed. Further advantageously, the method should allow reducing memory required for storing intermediate data structures.
A first aspect of the invention relates to achieving data parallelism in the execution of an operation on a segmented array. The segmented array may comprise regular data structures (e.g. (flat) arrays) or irregular data structures (e.g. nested arrays (arrays of arrays of different lengths)). The segmented array comprises an array of elements containing all elements of the different segments of the segmented array on which the operation is performed. The segmented array further comprises information indicating to which segment the respective elements of the array of elements belong. In the invention, the size of the information indicating to which segment the respective elements of the array of elements belong is proportional to the number of segments, and may be for example an array of segment offsets, i.e. an array the elements of which are the indices of the start of the respective segments in the array of elements.
Irrespective of the type of the data structures (regular or irregular), the elements in the array of elements of the segmented array are distributed into different chunks and each chunk of elements is assigned to an individual task for execution of the operation on the elements of the respective chunk.
Moreover, for execution of the operation on the elements of each individual chunk, no additional information or data structures that must be passed to the thread for execution of the operation. Further, only in case the operation to be executed on the elements is non-commutative and/or has no identity element, an intermediate data structure may be required for temporarily storing intermediate results (partial results) of the operation on different parts of a segment. The intermediate results are sued in a subsequent synchronization to obtain the final result of the operation performed on the elements of a segment. This is for example the case when elements of a single segment are distributed across two or more chunks. Advantageously, when having to synchronize intermediate results for a given segment, the size of the intermediate data structure is proportional to the number of chunks, and thus not dependent on the size of the data in the segmented array.
A second aspect of the invention, in line with the first aspect, provides a method in which a computer may perform the following. The computer receives a command to execute an operation on an argument represented by a segmented array. The segmented array can be considered comprising of two arrays: The first array is an array of elements comprising the elements of plural segments, and the second array is an array of segment indicators indicative of which one or more elements belong to which one of the segments. The size of the data in the second array is independent from the size of the data in the segmented array, but only proportional to the number of its segments. The computer distributes the elements of the segmented array into plural chunks of elements for execution in respective threads such that each chunk comprises a number of elements of one or more of the segments. Then the computer executes the operation on the elements of each chunk of elements in respective threads.
In a first implementation of the second aspect, at least some of said respective threads are executed by the computer in parallel.
Furthermore, in another implementation of the second aspect or its first
advantageous implementation, the distribution of elements of the segmented array is performed independent of the belonging of the elements to the respective segments. It is advantageous if the elements of each segment are arranged consecutively in the array of elements.
In another second implementation of the second aspect, each chunk of elements has approximately the same number of elements. In this example, the elements of the array of elements may be distributed evenly to the chunks. "Approximately" may for example mean that in case the result of dividing the number of elements of the array of elements is divided by the number of chunks is not a natural number, some chunks may have an element more than other chunks(s).
In a third implementation of the second aspect or any of its first or second implementation, the distribution of the elements of the segmented array into the plural chunks of elements is a result from dividing the array of elements. For example, this may be reached by cutting the array of elements in a given number of pieces (chunks). As noted above, this division may not consider to which segments the individual elements of the segmented array belong, i.e. elements of a single segment may be distributed across two or more chunks, depending on the number of elements in the segment and the size of the chunks.
In a more specific implementation of the third implementation above, the respective segment indicators may for example indicate the respective boundaries of the segments in the array of elements by means of offset indices. However, it is also possible that the indicators indicate for example the size of the segments, which also allows for deriving the boundaries of the segments within the array of elements.
In a fourth implementation of the second aspect or any of the first through third implementation thereof, the execution of the operation on the elements of each chunk of elements in respective threads may depend on the type of the command. As will become more apparent, the computer may distinguish the following types of operations: element-wise operations, operations which are commutative and have a identity element, and operations, which are either not commutative or have no identity element.
In a fifth implementation of the second aspect or any of the first through fourth implementation thereof, in case the operation is an element-wise operation, the
execution of the operation on the elements of each respective chunk of elements in a respective thread by the computer may be realized as follows. The computer sequentially executes the element-wise operation on each element of the respective chunk of elements; and writes the result of the element-wise operation to an array of results.
In a sixth implementation of the second aspect or any of the first through fifth implementation thereof, in case the operation is commutative and has an identity element, the execution of the operation on the elements of each chunk of elements in a respective thread by the computer may be realized as follows. The computer identifies the elements of the respective chunk of elements belonging to the same segment, based on the segment indicators of the segmented array. Moreover, for each of the segments of which elements are comprised in the respective chunk of elements, the computer executes the operation on the elements of the respective chunk of elements belonging to the respective segment to obtain a partial result, executes the operation on the partial result and on another partial result or identity element corresponding to the respective segment in an array of results to obtain a final result, and writes the final result of the operation executed on the elements of the respective segment to a corresponding element in an array of results.
In a more specific example of the sixth implementation, the execution of the operation on the elements of each respective chunk of elements in a respective thread, in case the operation is commutative and has an identity element, may further comprise the computer converting the operation into an atomic operation, in case the operation is not atomic. The atomic operation is executed on the partial result and on said other partial result or identity element corresponding to the respective segment in the array of results.
In a seventh implementation of the second aspect or any of the first through sixth implementation thereof, in case the operation is not commutative or has no identity element, the execution of the operation on the elements of each chunk of elements in a respective thread by the computer may be realized as follows. The computer identifies, based on the segment indicators of the segmented array, the elements of the chunk of elements belonging to the same segment. Furthermore, the computer identifies for each segment of which one or more elements are comprised in
the respective chunk of elements and based on the segment indicators of the segmented array, whether the elements of the segment have been divided across multiple chunks or not, and if so, whether the one or more elements of the respective segment comprised in the respective chunk of elements correspond to the first one or more elements of the segment or not.
For each of the segments of which one or more elements are comprised in the respective chunk of elements the computer performs the following.
If all elements of the respective segment are comprised in the chunk of elements, executing the operation on the elements of the respective segment, and writing the result of the operation to a corresponding element in an array of results,
If not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements comprises the first elements of the respective segment, the computer executes the operation on the first elements of the respective segment to obtain a partial result of the operation for the respective segment, and writes the partial result to a corresponding element in the array of results.
If not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements does not comprise the first elements of the respective segment, the computer executes the operation on the elements of the respective segment to obtain a partial result of the operation for the respective segment, and writes the partial result to a corresponding element in an auxiliary array of results. Upon each thread having finished execution and for each segment of which elements have been distributed into different chunks of elements, executing the operation on the partial results of the operation comprised in the array of results and the auxiliary array of results, and writing the final result of the operation executed on the respective segment into the array of results (result array).
Note that in the first two of the three cases, no additional intermediate data structure (auxiliary array of results) is required to obtain the final results of the operation. Only in the last of the three cases, a synchronization of intermediate results in the auxiliary array of results and the array of results, is performed once the each thread has finished execution.
In an eighth implementation of the second aspect or any of the first through
seventh implementation thereof, the computer has a multi-core processor. For example, the number of chunks of elements may be equal to or an integer-multiple of the number of cores of a multi-core processor of the computer. At least some of the threads are executed on respective cores of the processor in parallel.
A third aspect of the invention, which is still in line with the first aspect, relates to a computer. The computer comprises a receiving unit configured to receive a command to execute an operation on an argument represented by a segmented array. The segmented array has an array of elements comprising the elements of plural segments, and an array of segment indicators indicative of which one or more elements belong to which one of the segments. The computer may also have a distribution unit configured to distribute the elements of the segmented array into plural chunks of elements for execution in respective threads so that each chunk comprises a number of elements of one or more of the segments. The computer further comprises an execution unit configured to execute the operation on the elements of each chunk of elements in respective threads.
The computer of a first implementation of the third aspect further comprises a memory unit for storing the segmented array and results of executing the operation on the elements of each chunk of elements.
In a second implementation of the third aspect or its first implementation, the execution unit comprises one or more processors adapted to execute the operation on the elements of each chunk of elements in respective threads. In a third implementation of the third aspect or its first or second implementation, the execution unit comprises a multi-core processor and the distribution unit is configured to distribute the plural chunks of elements to the respective cores of the processors for execution in respective threads. The computer may be configured to execute at least some of said respective threads in parallel.
In a fourth implementation of the third aspect or any of its first through third implementation, the distribution unit is adapted to distribute the elements of the segmented array into plural chunks of elements so that each chunk of elements has approximately the same number of elements.
According a fifth implementation of the third aspect or any of its first through fourth implementation, the execution units is configured to determine whether the
operation is an element-wise operation or whether the operation is commutative and has an identity element, or whether the operation is commutative and has no identity element, and to execute the operation based on the determined type of operation.
According a sixth implementation of the third aspect or any of its first through fifth implementation, the execution unit is configured to sequentially execute the element-wise operation on each element of the respective chunk of elements, and to write the result of the element-wise operation to an array of results in a memory unit, in case the operation is an element-wise operation.
Furthermore, in a seventh implementation of the third aspect or any of its first through sixth implementation, the execution unit is configured to identify, based on the segment indicators of the segmented array, the elements of the respective chunk of elements belonging to the same segment; to execute, for each of the segments of which elements are comprised in the respective chunk of elements, the operation on the elements of the respective chunk of elements belonging to the respective segment to obtain a partial result for said segment, execute the operation on the partial result and on another partial result or identity element corresponding to the respective segment in an array of results to obtain a final result, and to write the final result of the operation executed on the elements of the respective segment to a corresponding element in an array of results, in case the operation is commutative and has an identity element. Optionally, the computer may also comprise a conversion unit configured to convert the operation into an atomic operation, in case the operation is not atomic and the execution unit is configured to execute the atomic operations on the partial result and on said other partial result or identity element corresponding to the respective segment in the array of results.
In an eighth implementation of the third aspect or any of its first through seventh implementation, in case the operation is not commutative or has no identity element, the execution unit is configured to identify, based on the segment indicators of the segmented array, the elements of the chunk of elements belonging to the same segment; and to identify for each segment of which one or more elements are comprised in the respective chunk of elements and based on the segment indicators of the segmented array, whether the elements of the segment have been divided across multiple chunks or not, and if so, whether the one or more elements of the respective
segment comprised in the respective chunk of elements correspond to the first one or more elements of the segment or not. For each of the segments of which one or more elements are comprised in the respective chunk of elements:
- the execution unit is configured to execute the operation on the elements of the respective segment, and to write the result of the operation to a corresponding element in an array of results, if all elements of the respective segment are comprised in the chunk of elements;
- the execution unit is configured to execute the operation on the first elements of the respective segment to obtain a partial result of the operation for the respective segment, and to write the partial result to a corresponding element in the array of results, if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements comprises the first elements of the respective segment; and
- the execution unit is configured to execute the operation on the elements of the respective segment to obtain a partial result of the operation for the respective segment, and to write the partial result to a corresponding element in an auxiliary array of results, if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements does not comprise the first elements of the respective segment; and wherein the execution unit is configured to execute, upon each thread having finished execution and for each segment of which elements have been distributed into different chunks of elements, the operation on the partial results of the operation comprised in the array of results and the auxiliary array of results, and to write the final result of the operation executed on the respective segment into the result array.
A fourth aspect of the invention, which is still in line with the first aspect, provides a computer readable medium storing instructions that, when executed by a processor of computer, cause the computer to execute an operation on an argument represented by a segmented array, by receiving a command to perform execute an operation on an argument represented by a segmented array, wherein the segmented array has an array of elements comprising the elements of plural segments, and an array of segment indicators indicative of which one or more elements belong to which one of the segments; distributing the elements of the segmented array of segments into
plural chunks of elements for processing execution in respective threads, wherein each chunk comprises a number of elements of one or more of the segments; and executing the operation on the elements of each chunk of elements in respective threads.
The computer readable medium of an implementation of the fourth aspect stores instructions that, when executed by the processor of computer, cause the computer to perform the steps of one of the different methods for executing an operation on an argument represented by a segmented array according to the second aspect or one of its exemplary implementations described herein.
BRIEF DESCRIPTION OF FIGURES
In the following embodiments of the invention are described in more detail in reference to the attached figures and drawings. Similar or corresponding details in the figures are marked with the same reference numerals.
Fig. 1 shows an exemplary flow chart for executing an operation on an argument represented as a segmented array according to an exemplary embodiment of the invention,
Fig. 2 shows an exemplary flow chart for executing an element-wise operation on an argument represented as a segmented array according to an exemplary embodiment of the invention,
Fig. 3 exemplifies the mapping of segments to an array of result by segment- dependent operations according to an embodiment of the invention,
Fig. 4 exemplifies the mapping of segments to an array of result by segment- dependent operations according to an embodiment of the invention, where the segment-dependent operations are commutative and have an identity element,
Fig. 5 exemplifies the mapping of segments to an array of result by segment- dependent operations according to an embodiment of the invention, where the segment-dependent operations are non-commutative and/or have no identity element,
Fig. 6 exemplifies the synchronization of partial results, when executing an operation being non-commutative and/or have no identity element on elements of a segmented array according to an embodiment of the invention,
Fig. 7 illustrates a flow chart for executing an operation on a segmented array according to an embodiment of the invention, where the operation can be any element- wise operation or segment-dependent operation,
Fig. 8 illustrates an exemplary flow chart for executing a commutative operation having an identity element on a segmented array according to an embodiment of the invention,
Fig. 9 illustrates an exemplary flow chart for executing an operation, which is non-commutative and/or has no identity element, on a segmented array according to an embodiment of the invention,
Fig. 10 shows a representation of sparse matrix in compressed sparse row (CSR) format,
Fig. 11 shows an exemplary implementation of inventive methods in a computer system according to an exemplary embodiment of the invention,
Fig. 12 shows results of a performance comparison of an SMVM algorithm implemented in C++ using OpenMP library against a compiler implementation using the exemplary SMVM algorithm discussed above and realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9,
Fig. 13 shows a performance comparison of a compiler implementation for the
Green-Marl programming language realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9 (named HiGraph) against a standard compiler for the Green-Marl programming language, and
Fig. 14 shows an exemplary computer system according to an embodiment of the invention.
DETAILED DESCRIPTION
The following paragraphs will describe various implementations and embodiments of the different aspects. As already noted above, one aspect of the invention relates to achieving data parallelism in the execution of an operation on a segmented array. An operation may also be referred to as a primitive. The invention may be for example realized within a compiler of a computer system, but the invention is not limited to this embodiment. In such compiler embodiment, the compiler may be for example used to compile a collection orientated programing language or domain specific programming language achieving data parallelism in the execution of the compiled ambler or object code. It should be further noted that the individual features of the different embodiments of the aspects discussed herein may individually or in arbitrary combination be subject matter to another invention.
Generally, it is assumed in the following that a sequence of operations, e.g. vector operations, is to be performed on one or more arguments. At least one of the arguments is a segmented array. Segmented arrays are represented as pairs of segment information and the elements (or values) of the respective segments. The elements/values of the segments are all provided in an array of elements. The segment information is information indicating which of the element/values belong to which segment of the segmented array. As noted previously, the size of the segment information is proportional to the number of segments in the segmented array and could be for example realized by an array of segment offsets.
On one exemplary implementation, which is also considered in the following for exemplary purposes only, the elements (or values) of an individual segment are provided as a group of consecutive elements (i.e. a consecutively arranged within the array of elements/values). Furthermore, the segment information is an array of segment offsets identifying the boundaries of the segments by indicating the offset/index of the first element of a respective segment.
For example, consider a nested array {[1, 1, 2], [3, 5], [8]} comprising three arrays (or vectors) [1, 1 , 2], [3, 5], and [8] having three, two and one element/value, respectively. The representation of this irregular data structure as a segmented array according to the above exemplary implementation would be as follows:
1. Array of elements: [1, 1, 2, 3, 5, 8]
2. Array of segment offsets: [0, 3, 5]
Note that the first element in the array of elements has the index 0 (zero), and accordingly the indices where the individual segments in the array of elements start (see bold characters) are the indices 0, 3 and 5.
Fig. 1 shows an exemplary flow chart for executing an operation on an argument represented as a segmented array according to an exemplary embodiment of the invention. In a first step 101 a command is received by the computer. The command includes a segmented array as an argument and the operation to be performed on the segmented array. Furthermore, the command may include additional arguments or parameters for execution of the operation, e.g. another operand and/or a buffer or pointer to a memory location to store the result of the operation.
Furthermore, array of elements is partitioned into multiple chunks ("chunks of
elements"). Without loss of generality it can be assumed for exemplary purposes that the array of elements has E elements and that the number of chunks obtained by partitioning is c, where c 6 M > 1. As shown in Fig. 1, this may be for example realized by simply distributing 102 the elements of the array of elements into the given number c of individual chunks. In one embodiment, the partitioning may be realized by simply dividing the array of elements into c chunks. The division may ensure that all chunks have approximately an equal number of elements (i.e. either or + 1 elements, where is the next lower integer number of -). Furthermore, the partitioning of the array of elements may not consider how the values are structured in segments, i.e. the partitioning does not take into account the information of the array of segment offsets, but considers only the array of elements.
The number c of chunks may be for example configured depending on the number of processors or processor cores of the computer to execute the threads in which the chunks are processed. For example, the number of c of chunks may be equal to or an integer multiple of the number of processors or processor cores of the computer, so that the operation can be performed on the elements of the respective chunks in respective threads (or process) executed on a respective one of the processors or processor cores.
Each of the chunks is processed independently of the other chunks, i.e. the operation is performed 103 on the elements in each chunk individually, without requiring additional information from other chunks or the like. The chunks are processed independently (e.g. in parallel) in different threads/processes. Inside a chunk the operation may be applied sequentially to the elements of the chunk. For each segment the computed result is stored directly to the corresponding location in the array of results, in case the operation is an element-wise operation. Otherwise, there may be an additional execution of the operation on partial results, for example, in case elements of a segment get split into multiple chunks. In case the operation is not commutative or does not include an identity element, this additional execution of the operation on the intermediate results is also referred to as a synchronization of the intermediate results of the processing of each chunk. Such synchronization step may be required so as to obtain the correct final result of the operation for the each of the segments. This will be outlined below in further detail.
The proposed process of executing an operation on a segmented array may have the following advantages. Since the distribution of the elements of the array of elements into chunks also for having a substantially equal number of elements to be processed in each chunk/thread, each thread can be given an essentially equal amount of work (balancing). The additional overhead for auxiliary data structures depends on the number of chunks/threads, and not on the size of data itself. Furthermore, depending on the type of operation, no auxiliary data structures may be required. In addition, no synchronization points are required during the main workload, due to individually processing the elements of each chunk. Altogether this allows for and efficient implementation of data-parallel algorithms working with irregular data structures in a (multi-core or multi-processor) computer system.
As noted above, the exact execution of an operation on a segmented array depends on the type of the operation. In the following, the execution of an operation on a segmented array according to an exemplary embodiment is described for two different basic types of operations, (i) element- wise operations (being segment independent) and (ii) segment-dependent operations. As will become further apparent, the segment-dependent operations may be further subdivided into two different types of operations, namely (ii.a) commutative operations having a identity element, and (ii.b) operations being not commutative and/or not having an identity element.
Element-wise Operations
Element-wise operations are characterized as operations which are performed on each element of a segment, respectively, the segmented array (argument) individually. They do not dependent on the belonging of a respective element to a given segment, i.e. can be performed on the array of elements without considering information on the segments as for example given in the array of segment offsets. .
Furthermore, the results of element-wise operations may (but not necessarily) have the same structure as the segmented array, i.e. the array of results may also be represented as a segmented array with the same array of segment offsets as the argument of the operation.
Exemplary and non-limiting representatives of this class of operations are:
- map (op, array [, operand])
Apply a function (or monoid) op to each element of the array. The operation
may use an operand, which is however optional. An example of such operation would be a multiplication of each element of the array with a given operand.
- zip with (op, arrayl, array2)
Apply a function (or monoid) op to corresponding elements of arrayl and array2, i.e. op (arrayl [i], array2[ij), for all .
- gather (array, indices)
Produce an segmented array of values gathered from the flat input array by the indices (i.e. [arr ay [indices [0]], array [indices [1]], ,..]). In this case, the segment structure of the resulting array is the same as a segment structure of the indices.
As noted above, one property of this class of operations is that the segments information isn't needed for computing the result of the operations. Therefore, this class of operations facilitates an essentially equal distribution of workload between threads. Fig. 2 shows a flow chart of performing an element-wise operation on elements of a segmented array according to an exemplary embodiment of the invention.
The computer receives 201 a command which includes at least one segmented array having an array of elements and an array of segment offsets, and the element- wise operation to be performed on the array of elements of the segmented array. Returning to the example mentioned above, a nested array {[1, 1, 2], [3, 5], [8]} would be represented by an array of elements [1, 1, 2, 3, 5, 8] and an array of segment offsets [0, 3, 5]. As the operation is an element-wise operation, the computer can ignore the information in the array of segment offsets.
The computer divides 203 the elements of the segmented array (substantially) equally to a given number of chunks. Assuming for exemplary purposes the array of elements [1, 1, 2, 3, 5, 8] and two chunks to be formed: a first chunk [1, 1, 2] and a second chunk [3, 5, 8]. Each of the chunks corresponds to a thread.
Furthermore, the computer executes 203 the element-wise operation on the elements of each chunk in a respective thread. For example, if the operation may be a multiply operation that multiplies each element of the segmented array with a given operand, in each thread, the operand is multiplied with the each element of the respective chunk. In the respective thread, the operation may be performed on the elements of the chunk subsequently.
The results of the applying the operation on the respective elements is stored 204 in an array of results. In the example of a multiply operation that multiplies each element of the segmented array with a given operand, the structure of the array of results would be identical to the array of elements and also the segment offsets would still indicate the segment boundaries within the array of results. Since the operation is element-wise, no race-conditions can occur in writing the result to the given position of the array of results. For example, the when multiplying each element of the segmented array with an array of elements [1, 1, 2, 3, 5, 8] and an array of segment offsets [0, 3, 5] by 3, the result is a segmented array with an array of elements [3, 3, 6, 9, 15, 24] and an array of segment offsets [0, 3, 5].
Note that for element-wise operation no intermediate data structures, such as an auxiliary array of results, are required.
In the following, an example of the use of element-wise operations in a sparse matrix by dense vector multiplication (SMVM) function, is discussed for exemplary purposes. Note that the following example uses element-wise operations {gather, zip with) and a commutative operation (reduce_seg), which is a segment dependent operation (see below). The sparse matrix is represented in compressed sparse row (CSR) format, i.e. the matrix is an array of sparse vectors, and a sparse vector is an array of pairs (index of non-zero value, the value), as exemplary shown in Fig. 10.
An exemplary multiplication algorithm represented in terms of operations upon segmented arrays could look as follows in a C++-like language:
array<double> smvm (
array<array<int> > m_indices , array<array<double>> m_values ,
array<double> vector) {
array<array<double> > products =
zip_with (Multiply, m_values , gather (vector,
m_indices ) ) ;
return reduce_seg (Add, products ) ;
}
Non-zero matrix elements are represented by two segmented arrays: a nested array of column numbers of non-zero elements in each matrix row (m_indices); a nested array of values on non-zero matrix elements in each matrix row (m_values).
The element-wise operation gather builds a nested array of values taken from the vector by indices of non-zero elements of the matrix (in each row). The gather operation may be implemented as a data parallel operation in different threads, as discussed in connection with Fig. 2 above.
The element-wise opearation zip_with may also be implemented it in a data- parallel fashion across multiple threads as discussed in connection with Fig. 2 above. Moreover, intelligent runtime system would defer the execution of this operation up to the point of parallel execution of reduce_seg and fuse the multiplication with accessing of the values for the Add monoid. This would allow avoiding of construction of an intermediate array for products. To the contrary, the segment dependent operation reducejseg is based on the segments information and can't be parallelized as easily (see below).
Segment-Dependent Operations
Fundamentally different from the element-wise operations upon segmented arrays, are operations that restructure the data of the entire segmented array or its individual segments, e.g. combine some values (of a given segment) in that way or another, and thus depend on the segment information. These operations are referred to as segment-dependent operations herein; Exemplary and non-limiting representatives of such operations are:
- reduce (op, array)
Reduce the whole segmented array using function (or monoid) op and produce a single value.
- reducejseg (op, array)
Reduce each segment of the segmented array using function (or monoid) op and produce a flat array of results for each segment.
- scan _s eg (op, array)
Produce a segmented array of arrays of partial reductions using function (or monoid) op inside each segment of the array.
Segment-dependent operations map each segment to some result, as exemplified in Fig. 3. As exemplified in Fig. 3, the 1st to 6th segments of the segmented array may have one or more elements each, which are mapped to some result per segment, which is stored in the array of results. The result of a segment-
dependent operation could be a single combined result for all segments (e.g. reduce operation), single result for each segment (e.g. reduce_seg operation, see Fig. 3), or an array of results for each segment (e.g. scan_seg operation).
In embodiments of the invention, irrespective of whether the operation is an element-wise operation or a segment-dependent operation, the array of elements of the segmented array is first partitioned in a given number of chunks. In an exemplary implementation, the partitioning ignores information on the segment boundaries (as provided in the array of segment offsets) and ideally partitions all values of the array of elements equally into chunks equally. This is exemplified in Fig. 4, where the elements of the 1st to 6th segments of the segmented array are distributed equally (three elements each) into 6 chunks. Note that the number of chunks and segments is equal here, but the numbers may of course also be different from each other. Each thread then processes a respective one of the chunks completely independently from other chunks and other threads.
In case of a segment-dependent operation, in order to allow independent processing of chunks two problems are to be solved. Firstly, for a given chunk a thread should have a way to know which segments intersect with the chunk. More precisely, for the execution of the operation on the elements of a chunk within a thread/process, the thread/process should know elements of which segments are comprised in the respective chunk This may be necessary in order to know where to store the result computed for a segment. In addition, knowing the segments intersecting with a given chunk may also be necessary, if the operation directly depends on information on the segment boundaries (for example, an operation could map each value to the number of a corresponding segment). How the computer may find segments intersecting with a given chunk will be described herein below in further detail.
Secondly, for segments divided between several chunks (see for example 1st and 5th segment in Figs. 5 and 6) special measures may be taken to combine the results obtained in different threads. For each chunk containing elements of a given segment (i.e. the elements are split across multiple chunks) an intermediate result is computed for each chunk in a respective thread and needs to be combined with other partial result(s) for the segment that are computed in one or more other threads. The way to
treat this situation differs depending on whether the segment-dependent operation to be implemented is commutative and has an identity element.
As regards as segment-dependent operation, two classes of segment-dependent operations are distinguished in the following: Operations which are commutative and have an identity element and operations which are not commutative and/or do not have an identity element.
Commutative Operations with Identity Element
Fig. 8 shows a flow chart of the operation of a computer system according to an exemplary embodiment, in case the operation to be performed on the elements of the segmented array is commutative and has an identity element i.e. the correspondent op function forms a commutative monoid. Fig. 4 exemplifies the "mapping" of the - in some instances partial - results to corresponding locations in the array of results, in accordance with the flow chart of Fig. 8.
In a first step, the array of results is initialized 801 with the identity element of the operation to be performed on the elements of the segmented array. Next, similar to the procedure described in connection with Figs. 1 and 2 (see steps 102, 202), the elements in the segmented array are distributed into distinct chunks, e.g. by splitting 802 them. This distribution may not take into account to which respective segment the respective value of the segmented array belongs to, i.e. the segment information may be ignored. Each chunk is executed in a respective thread.
Next, the computer system checks 803 for each chunk (as indicated by the multiple boxes of step 803), whether it has segments intersecting across the boundary of the respective chunk and identifies 803 the respective segments contained in the respective chunk. In other words, for each given chunk, the computer system determines the set S of segments intersecting with the chunk. In the example of Fig. 4, the elements of the 1st to 6th segment are distributed to a total of six chunks. The 1st chunk only contains some but not all elements of the 1st segment, so that the set S for the 1st chunk only contains the 1st segment. The 2nd chunk contains the last element of the 1st segment and the two elements of the 2nd chunk, so that the set S for the 1st chunk contains the 1st segment and 2nd segment. One possible implementation of a process to identify elements of segments are contained in the respective chunk will be outlined below in further detail.
The next steps 804 to 809 are performed for each segment within a given chunk and for each chunk. In case the respective chunk comprises all elements of a segment (complete segment), the computer system applies 805 the operation sequentially to the elements in the segment. Once the result of the operation is calculated for the elements of the segment within the respective chunk, the result is written to the corresponding location in the array of results. Steps 805 and 806 are performed for all complete segments of set S of the respective chunk.
In case the elements of the segment are split across multiple chunks, i.e. the respective chunk does not contain all elements of a segment (split segment), the execution of the operation is performed according to steps 807 to 810. In case the operation is not an atomic operation, then the operation is converted 807 into an atomic operation. The use of atomic operations may be advantageous to avoid race conditions in writing the results.
The operation is applied 808 to each of the elements of the respective segment within the given chunk to compute a partial result on the elements of the segment comprised in the respective chunk (note that the order of steps 807 and 808 could be also interchanged). Upon having finished the execution of the operation on the elements of a segment comprised in the respective chunk, the partial result is combined with another partial result of the execution of the operation on other elements of the same segment (in another thread) or the identity element (from initialization 801) stored in a corresponding place in the array of results. This is achieved by the atomic step 809 by applying an atomic (version of the) operation to the partial result obtained in the respective thread and a previously calculated partial result for the same segment or the identity element stored in the array of results thereby obtaining the "final result" of the operation and then storing the "final result" in the array of results. Note that the entire step is atomic. "Final result" here denotes the two combined partial results, respectively, the combination of the partial result in the given thread and the identity element. Apparently, the result of the operation for a given segment stored in the array of results is final once all elements of the segment have been processed in the chunks (threads) across which the segment got split.
For operations that don't have dedicated atomic CPU instructions atomic counterparts can be constructed as a functions using CPU instructions for atomic
memory access such as compare-and-swap (CAS).
The exemplary sample C/C++ code below exemplarily illustrates how an atomic counterpart for a given non-atomic integer function op can be implemented using a compiler intrinsic for CAS instruction (note that this implementation doesn't address the ABA problem):
void atomic_o ( int* p , int arg) {
int v = *p ;
int vl = op (v, arg) ;
while ( (vl = sync_val_compare_and_swap (p , v , vl ) ) ! = v) {
V = vl ;
vl = op (v , arg) ;
}
}
Being analogues to atomic operation, the function atomic op loads a value from memory location pointed by argument p, and then applies the function op to the pair of the loaded value and input argument arg. Finally, the function stores the result back in memory location pointed to by pointer p in an atomic way. Notably, in this example implementation, the number of atomic operations (which are typically less efficient than their non-atomic counterparts, e.g. non-atomic store operations, in terms of execution time) doesn't depend on a size of the data and is proportional to the number of chunks.
Furthermore, no intermediate data structures, such as an auxiliary array of results, are required due to the commutative nature of the operation.
Non-Commutative Operations and Operations without Identity Element
Fig. 9 shows a flow chart of a procedure in a computer system according to an exemplary embodiment, in case the operation to be performed on the elements of the segmented array is non-commutative and/or has no identity element. In this case, the execution of the operation on the elements of the respective segments contained in the respective chunk is - to a large extend - similar to the processing described in connection with Fig. 8, as indicated by the use of the same reference numerals in Fig. 9. Notably, no initialization is need in the procedure of Fig. 9 due to the last synchronization step 904 (as will become more apparent from the following).
A further difference to the procedure in Fig. 8 is the operation of the computer system upon having calculated the partial results of the operations performed on the elements of a split segment within different threads. Unlike Fig. 8, the intermediate result for a given split segment in each thread cannot be combined with the data in the array of results (see step 809 in Fig. 8) due to the non-commutative nature of the operation and/or the lack of an identity element. Hence, in the procedure of Fig. 9, all threads that are processing elements of a given segment need to finish their calculation before the final result can be obtained by combining the partial results for the split segment. For this reason, also no atomic operations are necessary, as no race conditions in accessing a segment's partial result (or the identity element) in the array or results by different threads occurs.
The partial results for a split segment are computed in the different threads, and then the operation is performed again on the partial results for the segment to obtain the final result. As the partial results may thus not be directly written into the array of results, an auxiliary data structure to temporarily store the partial results is provided (one of the partial results may be also stored in the array of results, as in the specific implementation of Fig. 9, but this is optional). Note that the size of the auxiliary array is proportional to the number of chunks (which may depend on the hardware features such as the number of CPU cores) and doesn't depend on the size of data. In case the operation is not commutative, the order in processing the partial results is decisive for the final result of the operation, so that the synchronization step (see step 904) needs to know which partial result refers to which elements of the split segment.
In the exemplary implementation of Fig. 9 step 901 checks, whether, within a given thread (i.e. for a chunk), a partial result has been computed on the first elements of the split segment, or not. If this is the case (yes) then the partial result of the operation is stored 902 in the array of results at a corresponding location for the segment. If this is not the case (no), the partial result of the operation is stored 903 in an auxiliary array of results. Once all partial results for a split segment are computed in the respective threads (chunks), the operation to computed 904 again on the partial results within the array of results and the auxiliary array of result in the order prescribed by the operation (for example, left-to-right for forward reduction and right-
to-left for backward reduction). This step 904 is also referred to as a synchronization step herein.
In the following, an exemplary non-commutative operation with no identity element and its execution is considered and outlined with respect to Fig. 5. It is assumed for exemplary purposes only that the operation is to find an index of the minimum value in each segment within an unsorted segmented array of integers. In case of several equal minimum values the leftmost value is to be taken.
As exemplarily shown in Fig. 5, the segmented array has six segments and is split in six chunks (similar to Fig. 3 and Fig. 4). The exemplary segmented array illustrated in Fig. 5 can be represented as {[2, 3, 1, 1], [4, 2], [1, 2, 1], [2, 4], [1, 2, 5, 3, 1], [3, 6]} or alternatively, as an array of elements [2, 3, 1, 1 , 4, 2, 1, 2, 1, 2, 4, 1, 2, 5, 3, 1 , 3, 6] and corresponding segment boundaries [0, 4, 6, 9, 11 , 16] given by the indices of the elements in the array of elements. The indices of the elements within the respective segments are shown in Fig. 5 (starting with 0).
In order to implement this desired functionality, a binary operation is used which chooses the one of two indices / and j for which the value is minimum (selecting the "left one" i if the values are identical). The binary operation may be for example implemented as follows:
int ChooseMinldx ( int i , int j ) {
return (values [j ] < values [i] ) ? j : i ;
}
and. would be sequentially applied from to the elements of a split segment in the chunk from left-to-right, where the returned value of ChooseMinldx is used as the left index / for the next element. Simplified, the processing of the elements within a split segment of a given chunk could be something as simple as:
reduce_seg (ChooseMinldx, array) ;
(it is connived at the fact that ChooseMinldx (choose minimum index) takes indices and not values as arguments in order to crystallize the idea).
If the operation is implemented in a way similar to described above for the case of commutative operations with identity element (storing partial results using atomic operations) then the result can be incorrect. Considering for example the 1st segment in Fig. 5 which is split across the 1st and 2nd chunk, the result of the atomic operation to find the (leftmost) index of smallest value in the segment would depend
on whether the elements of the Is chunk or the 2 chunk are processed and written to the results first. A similar problem exists for the processing of the 5th segment, the elements of which are split across the 4th, 5th and 6th chunk.
This problem is solved by using an auxiliary array of results (as exemplified in Fig. 5 and Fig. 6). The auxiliary array of results stores 903 partial results for each of the split segments instead of using atomic operations, except the partial result for the first elements of the segment (step 902 in Fig. 9; See also Fig. 5). Once all threads that contain elements of a given segment finished processing, the final result of the operation on the elements of the segment is computed in step 904 by applying the same operation (e.g. ChooseMinldx) again, this time to the partial results in the array of results and auxiliary array of results, and again in left-to-right order of the chunks into which the elements of the segment were split.
This synchronization step is exemplarily illustrated in Fig. 6. As the elements of the 1st segment in Fig. 5 are split across the 1st and 2nd chunk, there is one partial result in the array of.results (2, 1) - denoting at index 2 of the 1st segment a value of 1 - and one partial result in the auxiliary array of results (3, 1) - denoting at index 3 of the 1st segment a value of 1. Applying the ChooseMinldx operation to the partial results [1 , 1] with indices 2 and 3, the final result is 2 (leftmost index of the minimum value 1). Similarly, for the elements of the 5th segment which are split across the 4th, 5th and 6th chunk, three partial results (0, 2), (1, 2) and (4, 1) exist. Applying ChooseMinldx on the partial results [2, 2, 1] within indices 0, 1, 4, the final result is 4, which is the index of the lowest value 1 in the 5th segment.
Finding segments intersecting with a given chunk
As noted above, steps 803 of Figs. 8 and 9, the segments intersecting with a given chunk need to be identified. One possibility, to find chunks with which a segment intersects, and whether the first (leftmost) elements of the segment are contained in a given chunk, is the use of a binary search which may be executed by the computer system. To find a segment containing the left boundary of a given chunk the binary search of the boundary index in the array of segment offsets is used. Segment offsets are associated with each segmented array. In essence the binary search finds a maximal segment offset which is not larger than the index of the left boundary of the chunk. An exemplary sample C++ code for a function to do a binary
search of a place for the index in the array of offsets is shown below:
int binary_search ( int index , int const* of f sets , int lb , int rb) {
while ( lb < rb) {
int mid = ( lb + rb + 1 ) / 2 ;
if ( index < of f sets [mid] ) rb = mid - 1;
else lb = mid ;
}
return lb ;
}
The binary search finds the place for an index in an array of offsets between a left boundary lb and a right boundary rb. After some segment intersecting with a given chunk is known then it is easy to know whether its right boundary also belongs to the chunk (for example, one can compare an offset for the next segment with the right boundary of the chunk). Thus processing segments left to right it will be known when the current segment is the last in the chunk.
Note that a binary search has logarithmic complexity in the size of the array. Thus overheads for finding segments are proportional to the logarithm of the number of segments for each chunk. As the number of chunks may be greater than the logarithm of the number of segments one can consider the last to be a constant and the overheads to be proportional to the number of chunks. Note that although the binary search is extra workload in executing the operation on the segmented array, this extra work can be performed in parallel, e.g. in different CPU cores.
The above binary search will be exemplified in the following assuming a segmented array [[1, 1, 2], [3, 5], [8]]. The segmented array thus has offsets [0, 3, 5]. The three segments are numbered 0, 1, and 2, and are partitioned in three chunks, two elements each: 1st chunk = [1 , 1], 2nd chunk = [2, 3], 3rd chunk = [5, 8]. The left boundary for the 1st chunk is index 0, so it is right away known that the first segment for this chunk is segment 0. The next segment starts at offset 3, which is greater than the end of 1st chunk (index 1), so that the 1st chunk intersects with segment 0 only.
The left boundary for the 2nd chunk is 2. Finding the location for index 2 within the offsets [0, 3, 5] with binary search, yields the index 0, so that the 2nd chunk
starts in segment 0. Since the leftmost index 2 of the next segment 1 is identical to the right boundary 3 of the 2nd chunk, and the next segment 2 starts at index 4, segment 1 starts in the 2nd chunk and ends after it, so the 2nd chunk intersects with segments 0 and 1.
The left boundary for the 3rd chunk is 4. Finding the location for the index 4 in the offsets [0, 3, 5] with binary search, yields a position after index 3, so that the 3rd chunk starts in segment 1. The next segment 2 starts and ends in the 3 rd chunk so that the 3rd chunk intersects with segments 1 and 2.
Note that also other algorithms than a binary search may be used for identifying the segments intersecting a given chunk and for identifying which chunk contains the first elements of a segment.
Although certain applications of the ideas outlined herein may limit the type of operation to be executed in the computer system to one of the two options (element- wise operations; segment-dependent operations (including its subcases)) above, it is also possible to mix these kinds of operations in a computer system. Fig. 7 illustrates a flow chart for executing an operation on a segmented array according to an embodiment, where the operation can be any element-wise operation or segment- dependent operation. In Fig. 7 upon receiving 701 a command to perform a given operation on a segmented array, the computer system will next determine 702 the type of operation to be executed. Depending on this determination, different procedures for executing the operation are executed. For example, if the operation is an element-wise operation, steps 202, 203 and 204 (which are similar to those in Fig. 2) are performed with the computer system. If the operation is a commutative operation and has an identity element, the procedure outlined above in connection with Fig. 8 is performed by the computer system. Otherwise, i.e. the operation is non-commutative and/or has no identity element, the procedure outlined above in connection with Fig. 9 is performed by the computer system.
The methods for executing operations on the elements of a segmented array, as discussed in the different embodiments herein above may be for example implemented in a compiler running in a computer system. The operations executed on the elements of the segmented array may be considered primitives within the "interface" provided by the compiler towards the application programmer. Fig. 11
shows an exemplary implementation of inventive methods in a computer system. The compiler 1102 receives the source code 1101 and identifies the operations (primitives) that are to be executed, as well as the arguments of the respective operations. Each of the operations that are to be executed on a segmented array (optionally using additional arguments for the operation) is parallelized for execution in plural threads by compiler 1102 using for example the methods as shown in one of Figs. 2, 7, 8 and 9.
The different chunks obtained by this data parallelization are provided to a scheduler 1103. The scheduler 1103 provides the chunks to the hardware platform 1104 where the operation is performed on the elements of each chunk in a separate thread. The hardware platform 1104 may for example include one or more CPUs which may each have one or more cores on which the different threads can be executed.
The scheduler 1103 may be a static scheduler that tries to equally distribute the chunks to the different CPUs or their respective cores. The data-parallel processing of operations working with irregular data structures can be implemented efficiently in multi-core processor environments. The processing of operations can be divided into different threads (one thread per chunk) and thus the workload can be easily divided between CPU cores based on the upper level of a data structure to be processed.
The scheduler 1103 may also be a dynamic scheduler that uses a dynamic scheduling algorithm for distributing the chunks to the different CPUs or their respective cores for their execution. Using a dynamic scheduler in conjunction with data parallelization, as described above, may allow for smaller number of scheduling units (low overheads) but inside each unit the utilization of computing power of the CPU cores can be high due to data parallelization.
Fig. 12 shows the results of performance comparison of an SMVM algorithm implemented in C++ using OpenMP library against a compiler implementation using the exemplary SMVM algorithm discussed above and realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9. The data irregularity parameter controls the distribution of lengths of matrix rows: close to 0 all the lengths are approximately the same; the larger the value of the parameter the longer rows are present and the more rows are very short (the number of non-zero elements is constant
across the experiments). The graph in Fig. 12 has been in double-CPU Intel Xeon X5650 configuration using 24 threads. Other thread numbers give approximately the same picture.
The same SMVM algorithm was also compared with a Data Parallel Haskell (DPH) system. The SMVM implemented in DPH is about 30 times slower than the exemplary SMVM implementation proposed above. The main advantage of the proposed SMVM implementation over the DPH system is efficient management of the resulting data: In case SMVM operations are commutative, the results are written directly to the pre-allocated results array using atomic operations on chunk boundaries. In case of the DPH it isn't possible due to the functional nature of the language.
Fig. 13 shows a performance comparison of a compiler implementation for the Green-Marl programming language realizing data parallelization according to the examples in Figs. 2, 7, 8 and 9 (named HiGraph) against a standard compiler for the Green-Marl programming language. Green-Marl is a domain-specific language (DSL) aiming for easy and efficient graph processing (see http://ppl.stanford.edu/main/green_marl.html). Comparison is done using double-CPU Intel Xeon X5650 configuration. As shown in Fig. 13, when using a compiler implementation based on data parallelization as discussed herein, a speedup of approx. 10 times can be achieved in comparison to the use of a standard Green-Marl compiler provided by the Stanford University.
As noted above, one embodiment provides a computer system that allows for the execution of operations on segmented arrays according to the embodiments and flow charts discussed herein above. An exemplary computer system is shown in Fig. 14. Such computer system is for example a computing device and may comprise different functional blocks, such as for example a processor 1403, which may have one or - advantageously - multiple cores. Alternatively, the computing device may also have multiple processors, which may have one or more cores each. The processor may also be considered an execution unit 1404. The computing device may also have a memory 1401 for (temporarily) storing data related to the processing of the operations. For example, the memory could store the segmented array to which the operation is applied, the array of results as well as the optional intermediate data
structures, e.g. the auxiliary array of results, and/or the source code 1402 to be processed by a compiler implementation of the invention. Furthermore, the memory could also store results of the identification of segments intersecting with a given chunk and information indication which whether a given segment starts in a given chunk. Distinct portions of the memory may be assigned to the individual threads in which the operation is executed on the elements of a given chunk. The computing device may also comprise a data bus (not shown) to communicate data between the processor(s) (or its/their cores) and the memory. Furthermore, the computing device may also comprise a compiler 1405, which has a receiving unit 1406 for receiving for example the command and its arguments, or the source code from memory 1401. Furthermore, the compiler 1405 may also have a distribution unit 1407 which distributes the elements of the segmented array into chunks for execution in different threads by the processor 1403. The computing system may also have a scheduler 1408 which is responsible for scheduling the execution of the operation on the elements within each chunk in a respective thread.
The computing device may be implemented in a variety of devices, including a server, personal computer, laptop computer, netbook computer, tablet computer, a mobile phone, a smart phone, etc.
Although some aspects have been described in the context of a method, it is clear that these aspects also represent a description of the corresponding apparatus suitably adapted to perform such method. In such apparatus a (functional or tangible) block may correspond to one or more method step or a feature of a method step. Analogously, aspects described in the context of a corresponding block or item or feature of a corresponding apparatus may also correspond to individual method steps of a corresponding method.
Furthermore, the methods described herein may also be executed by (or using) a hardware apparatus, like processor(s), microprocessor(s), a programmable computer or an electronic circuit. Some one or more of the most important method steps may be executed by such an apparatus. Where an apparatus has been described herein in terms of functional blocks, it should be further understood that those elements of the apparatus may be fully or partly implemented in hardware elements/circuitry. Individual hardware, like processor(s) or microprocessor(s), etc., may be used to
implement the functionality of one or more elements of the apparatus.
In addition, where information or data is to be stored in the process of implementing a method step of functional element of an apparatus in hardware, the apparatus may comprise memory or storage medium, which may be communicatably coupled to one or more hardware elements/circuitry of the apparatus.
It is also contemplated implementing the aspects of the invention in in hardware or in software or a combination thereof. This may be using a digital storage medium, for example a floppy disk, a DVD, a Blu-Ray, a CD, a ROM, a PROM, an EPROM, an EEPROM or a FLASH memory, having electronically readable control signals or instructions stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed. A data carrier may be provided which has electronically readable control signals or instructions, which are capable of cooperating with a programmable computer system, such that the method described herein is performed.
It is also contemplated implementing the aspects of the invention in the form of a computer program product with a program code, the program code being operative for performing the method when the computer program product runs on a computer. The program code may be stored on a machine readable carrier.
The above described is merely illustrative, and it is understood that modifications and variations of the arrangements and the details described herein will be apparent to others skilled in the art. It is the intent, therefore, to be limited only by the scope of the impending claims and not by the specific details presented by way of description and explanation above.
Claims
1. A method comprising the following steps performed by a computer:
receiving a command to execute an operation on an argument represented by a segmented array, wherein the segmented array has an array of elements and an array of segment indicators, wherein the array of elements comprises elements of plural segments, and the array of segment indicators is indicative of which one or more elements belong to which one of the segments, wherein the size of the data in the array of segment indicators is proportional to the number of segments in said segmented array;
distributing the elements of the segmented array into plural chunks of elements for execution in respective threads, wherein each chunk comprises a number of elements of one or more of the segments; and
executing the operation on the elements of each chunk of elements in respective threads.
2. The method according to claim 1, wherein at least some of said respective threads are executed by the computer in parallel.
3. The method according to claim 1 or 2, wherein the distribution of elements of the segmented array is performed independent of the belonging of the elements to the respective segments.
4. The method according to one of claims 1 to 3, wherein each chunk of elements has approximately the same number of elements.
5. The method according to one of claims 1 to 4, wherein the elements of the segmented array are distributed into the plural chunks of elements by dividing the array of elements.
6. The method according to one of claims 1 to 5, wherein the elements of each segment are arranged consecutively in the array of elements.
7. The method according to one of claims 1 to 6, wherein the respective segment indicators indicate the respective boundaries of the segments in the array of elements by means of offset indices.
8. The method according to one of claims 1 to 7, wherein in case the operation is an element-wise operation, executing the operation on the elements of each respective chunk of elements in a respective thread comprises:
sequentially executing the element- wise operation on each element of the respective chunk of elements; and
writing the result of the element-wise operation to an array of results.
9. The method according to one of claims 1 to 8, wherein in case the operation is commutative and has an identity element, executing the operation on the elements of each chunk of elements in a respective thread comprises:
identifying, based on the segment indicators of the segmented array, the elements of the respective chunk of elements belonging to the same segment; and for each of the segments of which elements are comprised in the respective chunk of elements, executing the operation on the elements of the respective chunk of elements belonging to the respective segment to obtain a partial result, executing the operation on the partial result and on another partial result or identity element corresponding to the respective segment in an array of results to obtain a final result, and writing the final result of the operation executed on the elements of the respective segment to said array of results.
10. The method according to claim 9, further comprising the step of converting the operation into an atomic operations, in case the operation is not atomic; wherein the atomic operation is executed on the partial result and on said other partial result or identity element corresponding to the respective segment in the array of results.
11. The method according to one of claims 1 to 10, wherein in case the operation is not commutative or has no identity element, executing the operation on the elements of each chunk of elements in a respective thread comprises:
identifying, based on the segment indicators of the segmented array, the elements of the chunk of elements belonging to the same segment;
identifying for each segment of which one or more elements are comprised in the respective chunk of elements and based on the segment indicators of the segmented array, whether the elements of the segment have been divided across multiple chunks or not, and if so, whether the one or more elements of the respective segment comprised in the respective chunk of elements correspond to the first one or more elements of the segment or not;
for each of the segments of which one or more elements are comprised in the
respective chunk of elements:
- if all elements of the respective segment are comprised in the chunk of elements, executing the operation on the elements of the respective segment, and writing the result of the operation to a corresponding element in an array of results, - if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements comprises the first elements of the respective segment, executing the operation on the first elements of the respective segment to obtain a partial result of the operation for the respective segment, and writing the partial result to a corresponding element in the array of results, and
- if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements does not comprise the first elements of the respective segment, executing the operation on the elements of the respective segment to obtain a partial result of the operation for the respective segment, and writing the partial result to a corresponding element in an auxiliary array of results; and
upon each thread having finished execution and for each segment of which elements have been distributed into different chunks of elements, executing the operation on the partial results of the operation comprised in the array of results and the auxiliary array of results, and writing the final result of the operation executed on the respective segment into the array of results.
12. The method according to one of claims 1 to 11 , wherein the computer has a multi-core processor, and at least some of the threads are executed on respective cores of the processor in parallel.
13. The method according to one of claims 1 to 12, wherein the number of chunks of elements is equal to or an integer-multiple of the number of cores of a multi-core processor of the computer.
14. A computer comprising:
a receiving unit configured to receive a command to execute an operation on at least one argument, at least one of which is represented by a segmented array, wherein the segmented array has an array of elements and an array of segment indicators, wherein the array of elements comprises elements of plural segments, and the array of
segment indicators is indicative of which one or more elements belong to which one of the segments, wherein the size of data in the array of segment indicators is proportional to the number of segments in the segmented array;
a distribution unit configured to distribute the elements of the segmented array into plural chunks of elements for execution in respective threads, wherein each chunk comprises a number of elements of one or more of the segments; and
a execution unit configured to execute the operation on the elements of each chunk of elements in respective threads.
15. The computer of claim 14, further comprising a memory unit for storing the segmented array and results of executing the operation on the elements of each chunk of elements.
16. The computer according to claim 14 or 15, wherein the execution unit comprises one or more processors adapted to execute the operation on the elements of each chunk of elements in respective threads.
17. The computer according to claim 14 or 15, wherein the execution unit comprises a multi-core processor and the distribution unit is configured to distribute the plural chunks of elements to the respective cores of the processors for execution in respective threads.
18. The computer according to claim 16 or 17, wherein the computer is configured to execute at least some of said respective threads in parallel.
19. The computer according to one of claims 14 to 18, wherein the distribution unit is adapted to distribute the elements of the segmented array into plural chunks of elements so that each chunk of elements has approximately the same number of elements.
20. The computer according to one of claims 14 to 19, wherein the execution units is configured to determine whether the operation is an element-wise operation or whether the operation is commutative and has an identity element, or whether the operation is commutative and has no identity element, and to execute the operation based on the determined type of operation.
21. The computer according to one of claims 14 to 20, wherein the execution unit is configured to sequentially execute the element-wise operation on each element of the respective chunk of elements, and to write the result of the element-wise
operation to an array of results in a memory unit, in case the operation is an element- wise operation.
22. The computer according to one of claims 14 to 20, wherein the execution unit is configured to identify, based on the segment indicators of the segmented array, the elements of the respective chunk of elements belonging to the same segment; to execute, for each of the segments of which elements are comprised in the respective chunk of elements, the operation on the elements of the respective chunk of elements belonging to the respective segment to obtain a partial result; configured to execute the operation on the partial result and on another partial result or identity element corresponding to the respective segment in an array of results to obtain a final result; and configured to write the final result of the operation executed on the elements of the respective segment to said array of results, in case the operation is commutative and has an identity element.
23. The computer according to claim 22, further comprising a conversion unit configured to convert the operation into an atomic operation, in case the operation is not atomic;
wherein the execution unit is configured to execute the atomic operation on the partial result and on said other partial result or identity element corresponding to the respective segment in the array of results.
24. The computer according to one of claims 14 to 23, wherein, in case the operation is not commutative or has no identity element, the execution unit is configured to identify, based on the segment indicators of the segmented array, the elements of the chunk of elements belonging to the same segment; and to identify for each segment of which one or more elements are comprised in the respective chunk of elements and based on the segment indicators of the segmented array, whether the elements of the segment have been divided across multiple chunks or not, and if so, whether the one or more elements of the respective segment comprised in the respective chunk of elements correspond to the first one or more elements of the segment or not;
for each of the segments of which one or more elements are comprised in the respective chunk of elements:
- the execution unit is configured to execute the operation on the
elements of the respective segment, and to write the result of the operation to a corresponding element in an array of results, if all elements of the respective segment are comprised in the chunk of elements;
the execution unit is configured to execute the operation on the first elements of the respective segment to obtain a partial result of the operation for the respective segment, and to write the partial result to a corresponding element in the array of results, if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements comprises the first elements of the respective segment; and
- the execution unit is configured to execute the operation on the elements of the respective segment to obtain a partial result of the operation for the respective segment, and to write the partial result to a corresponding element in an auxiliary array of results, if not all elements of the respective segment are comprised in the respective chunk of elements and if the respective chunk of elements does not comprise the first elements of the respective segment; and wherein the execution unit is configured to execute, upon each thread having finished execution and for each segment of which elements have been distributed into different chunks of elements, the operation on the partial results of the operation comprised in the array of results and the auxiliary array of results, and to write the final result of the operation executed on the respective segment into the array of results.
25. A computer readable medium storing instructions that, when executed by a processor of computer, cause the computer to execute an operation on an argument represented by a segmented array, by:
receiving a command to perform execute an operation on an argument represented by a segmented array, wherein the segmented array has an array of elements and an array of segment indicators, wherein the array of elements comprises elements of plural segments, and the array of segment indicators is indicative of which one or more elements belong to which one of the segments;
distributing the elements of the segmented array of segments into plural chunks of elements for processing execution in respective threads, wherein each chunk comprises a number of elements of one or more of the segments; and
executing the operation on the elements of each chunk of elements in
respective threads.
26. The computer readable medium of claim 25, further storing instructions that, when executed by the processor of computer, cause the computer to perform the steps of the method according to one of claims 1 to 13.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2013/001166 WO2015099562A1 (en) | 2013-12-24 | 2013-12-24 | Methods and apparatus for data-parallel execution of operations on segmented arrays |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2013/001166 WO2015099562A1 (en) | 2013-12-24 | 2013-12-24 | Methods and apparatus for data-parallel execution of operations on segmented arrays |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2015099562A1 true WO2015099562A1 (en) | 2015-07-02 |
Family
ID=51212930
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2013/001166 WO2015099562A1 (en) | 2013-12-24 | 2013-12-24 | Methods and apparatus for data-parallel execution of operations on segmented arrays |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2015099562A1 (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10002205B2 (en) | 2015-11-20 | 2018-06-19 | Oracle International Corporation | Efficient method for indexing data transferred between machines in distributed graph processing systems |
US10534657B2 (en) | 2017-05-30 | 2020-01-14 | Oracle International Corporation | Distributed graph processing system that adopts a faster data loading technique that requires low degree of communication |
US10846069B2 (en) | 2017-09-30 | 2020-11-24 | Oracle International Corporation | Method of distributed graph loading for minimal communication and good balance via lazy materialization and directory indirection using indexed tabular representation |
US10990595B2 (en) | 2018-05-18 | 2021-04-27 | Oracle International Corporation | Fast distributed graph query engine |
US11093459B2 (en) | 2020-01-21 | 2021-08-17 | Oracle International Corporation | Parallel and efficient technique for building and maintaining a main memory, CSR-based graph index in an RDBMS |
US11288275B2 (en) | 2019-09-09 | 2022-03-29 | Oracle International Corporation | Technique for fast join processing of dictionary encoded key columns in relational database systems |
US11537579B2 (en) | 2020-03-12 | 2022-12-27 | Oracle International Corporation | Fast in-memory technique to build a reverse CSR graph index in an RDBMS |
US11580108B2 (en) | 2019-09-09 | 2023-02-14 | Oracle International Corporation | Cache conscious techniques for generation of quasi-dense grouping codes of compressed columnar data in relational database systems |
US11675785B2 (en) | 2020-01-31 | 2023-06-13 | Oracle International Corporation | Dynamic asynchronous traversals for distributed graph queries |
US11797539B2 (en) | 2019-09-12 | 2023-10-24 | Oracle International Corporation | Accelerated building and probing of hash tables using symmetric vector processing |
US11868348B2 (en) | 2019-09-12 | 2024-01-09 | Oracle International Corporation | Hardware friendly key value table design to support relational hash join |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100076941A1 (en) | 2008-09-09 | 2010-03-25 | Microsoft Corporation | Matrix-based scans on parallel processors |
US8243083B1 (en) | 2007-12-04 | 2012-08-14 | Nvidia Corporation | System, method, and computer program product for converting a scan algorithm to a segmented scan algorithm in an operator-independent manner |
US8321492B1 (en) | 2008-12-11 | 2012-11-27 | Nvidia Corporation | System, method, and computer program product for converting a reduction algorithm to a segmented reduction algorithm |
-
2013
- 2013-12-24 WO PCT/RU2013/001166 patent/WO2015099562A1/en active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8243083B1 (en) | 2007-12-04 | 2012-08-14 | Nvidia Corporation | System, method, and computer program product for converting a scan algorithm to a segmented scan algorithm in an operator-independent manner |
US20100076941A1 (en) | 2008-09-09 | 2010-03-25 | Microsoft Corporation | Matrix-based scans on parallel processors |
US8321492B1 (en) | 2008-12-11 | 2012-11-27 | Nvidia Corporation | System, method, and computer program product for converting a reduction algorithm to a segmented reduction algorithm |
Non-Patent Citations (4)
Title |
---|
CHAKRAVARTY ET AL.: "DAMP 2007: Workshop on Declarative Aspects of Multicore Programming", 2007, ACM PRESS, article "Data Parallel Haskell: a status report" |
GUY BLELLOCH; GARY W. SABOT: "Compiling Collection-Oriented Languages 2onto Massively Parallel Computers", JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 1990 |
PEDRO J. MARTÍN, LUIS F. AYUSO, ROBERTO TORRES, ANTONIO GAVILANES: "Algorithmic Strategies for Optimizing the Parallel Reduction Primitive in CUDA", HIGH PERFORMANCE COMPUTING AND SIMULATION (HPCS), 2012 INTERNATIONAL CONFERENCE ON, 2 July 2012 (2012-07-02) - 6 July 2012 (2012-07-06), pages 511 - 519, XP002728781, ISBN: 978-1-4673-2359-8, Retrieved from the Internet <URL:http://www.csd.uwo.ca/~moreno/CS433-CS9624/Resources/Algorithmic_strategies_for_optimizing_the_parallel_reduction_primitive_in_CUDA.pdf> [retrieved on 20140901], DOI: 10.1109/HPCSim.2012.6266966 * |
SHUBHABRATA SENGUPTA: "Efficient Primitives and AlgorithmsforMany-core architectures", INTERNET ARTICLE, 2011, XP002728780, Retrieved from the Internet <URL:http://www.cs.ucdavis.edu/research/tech-reports/2011/CSE-2011-4.pdf> [retrieved on 20140901] * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10002205B2 (en) | 2015-11-20 | 2018-06-19 | Oracle International Corporation | Efficient method for indexing data transferred between machines in distributed graph processing systems |
US10534657B2 (en) | 2017-05-30 | 2020-01-14 | Oracle International Corporation | Distributed graph processing system that adopts a faster data loading technique that requires low degree of communication |
US10846069B2 (en) | 2017-09-30 | 2020-11-24 | Oracle International Corporation | Method of distributed graph loading for minimal communication and good balance via lazy materialization and directory indirection using indexed tabular representation |
US11561780B2 (en) | 2017-09-30 | 2023-01-24 | Oracle International Corporation | Method of distributed graph loading for minimal communication and good balance via lazy materialization and directory indirection using indexed tabular representation |
US10990595B2 (en) | 2018-05-18 | 2021-04-27 | Oracle International Corporation | Fast distributed graph query engine |
US11288275B2 (en) | 2019-09-09 | 2022-03-29 | Oracle International Corporation | Technique for fast join processing of dictionary encoded key columns in relational database systems |
US11580108B2 (en) | 2019-09-09 | 2023-02-14 | Oracle International Corporation | Cache conscious techniques for generation of quasi-dense grouping codes of compressed columnar data in relational database systems |
US11797539B2 (en) | 2019-09-12 | 2023-10-24 | Oracle International Corporation | Accelerated building and probing of hash tables using symmetric vector processing |
US11868348B2 (en) | 2019-09-12 | 2024-01-09 | Oracle International Corporation | Hardware friendly key value table design to support relational hash join |
US11093459B2 (en) | 2020-01-21 | 2021-08-17 | Oracle International Corporation | Parallel and efficient technique for building and maintaining a main memory, CSR-based graph index in an RDBMS |
US11675785B2 (en) | 2020-01-31 | 2023-06-13 | Oracle International Corporation | Dynamic asynchronous traversals for distributed graph queries |
US11537579B2 (en) | 2020-03-12 | 2022-12-27 | Oracle International Corporation | Fast in-memory technique to build a reverse CSR graph index in an RDBMS |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2015099562A1 (en) | Methods and apparatus for data-parallel execution of operations on segmented arrays | |
US11144348B2 (en) | Heterogeneous computing system and method including analyzing expected costs of compute kernels | |
US7937567B1 (en) | Methods for scalably exploiting parallelism in a parallel processing system | |
US9152462B2 (en) | Parallel processing device, parallel processing method, optimization device, optimization method and computer program | |
TWI525540B (en) | Mapping processing logic having data-parallel threads across processors | |
US10866806B2 (en) | Uniform register file for improved resource utilization | |
US20210342184A1 (en) | Method, electronic device, and computer program product for processing computing job | |
US10120717B2 (en) | Method for optimizing the size of a data subset of a processing space for improved execution performance | |
CN111488205A (en) | Scheduling method and scheduling system for heterogeneous hardware architecture | |
Boyer et al. | Dense dynamic programming on multi GPU | |
US20150269225A1 (en) | System and method for parallelizing grid search method facilitating determination of pk-pd parameters | |
US20200356373A1 (en) | Process for the automatic generation of parallel code | |
US8615770B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
D'Amore et al. | Towards a parallel component in a GPU–CUDA environment: a case study with the L-BFGS Harwell routine | |
Cho et al. | On-the-fly workload partitioning for integrated CPU/GPU architectures | |
Lee et al. | Optimization of GPU-based sparse matrix multiplication for large sparse networks | |
Cui et al. | Directive-based partitioning and pipelining for graphics processing units | |
Holk et al. | Declarative parallel programming for GPUs | |
CN116755878A (en) | Program running method, apparatus, device, medium and program product | |
Haque et al. | Plain polynomial arithmetic on GPU | |
US8959497B1 (en) | System and method for dynamically spawning thread blocks within multi-threaded processing systems | |
Tse | Image processing with CUDA | |
Tarakji et al. | The development of a scheduling system GPUSched for graphics processing units | |
Li et al. | A GPU parallel approximation algorithm for scheduling parallel identical machines to minimize makespan | |
Hogg | A fast dense triangular solve in CUDA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13872273 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13872273 Country of ref document: EP Kind code of ref document: A1 |