WO2016177405A1

WO2016177405A1 - Systems and methods for transformation of a dataflow graph for execution on a processing system

Info

Publication number: WO2016177405A1
Application number: PCT/EP2015/059826
Authority: WO
Inventors: Natan Peterfreund; Eyal ROZENBERG; Adnan Agbaria; David MINOR; Ofer Rosenberg
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2015-05-05
Filing date: 2015-05-05
Publication date: 2016-11-10
Also published as: CN110149801A

Abstract

An apparatus adapted to transform nodes of a flow graph for execution on an, in particular distributed, processing system, comprising: an interface adapted to receive a dataflow graph that comprises nodes, each node representing a high-level operation; a compiler adapted to: transform at least one high-level operation node to at least one low-level operation node corresponding to the at least one high-level operation to create a transformed dataflow graph, the at least one low-level operation adapted for execution on a processor of a plurality of processors of a runtime environment executing the transformed dataflow graph, the transform performed according to a performance measure calculated for each processor executing the at least one high-level operation using the at least one low-level operation adapted for execution by the respective processor.

Description

Title: SYSTEMS AND METHODS FOR TRANSFORMATION OF A

DATAFLOW GRAPH FOR EXECUTION ON A PROCESSING SYSTEM

BACKGROUND

The present invention, in some embodiments thereof, relates to program execution on a processing system, in particular a heterogeneous system, and, more specifically, but not exclusively, to systems and methods for transformation of a dataflow graph of a computer program for execution on the processing system.

Heterogeneous systems include a set of interconnected processors, each of which is based on different computer architectures and computing models. Examples of such processors include graphics processing units (GPU), which are parallel in nature and based on a single instruction multiple data (SIMD) computing model, multi-threaded central processing units (CPU) , each of which is serial in nature, and general purpose field programmable gate arrays (FPGA), which provide various intermediate forms of computing models.

Computer programs written in a high-level abstract programming language, such as a domain specific language (DSL), may be parsed to an intermediate representation (IR), such as a dataflow graph (DFG). The DFG contains nodes that represent computing operations, which are selected from an operation set defined by the DSL. Edges of the DFG represent data relations between the computing nodes.

The same DFG may be executed within the same heterogeneous system at different performance levels, depending on multiple factors, such as which particular processor of the heterogeneous system is executing the DFG. For example, execution of the DFG by the GPU may be much quicker than execution of the same DFG by the CPU. In another example, execution by the CPU may be quicker than execution by the GPU. Other additional performance affecting parameters include: the data format of the data being processed by the DFG, the input received by the executing DFG, and the way in which computing resources are allocated for the different computing nodes of the DFG.

SUMMARY

It is an object of the present invention to provide an apparatus, a system, a computer program product, and a method for transforming nodes of a dataflow graph for execution in a processing system.

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, an apparatus adapted to transform nodes of a flow graph for execution on an, in particular distributed, processing system, comprises: an interface adapted to receive a dataflow graph that comprises nodes, each node representing a high-level operation; a compiler adapted to: transform at least one high-level operation node to at least one low-level operation node corresponding to the at least one high-level operation to create a transformed dataflow graph, the at least one low-level operation adapted for execution on a processor of a plurality of processors of a runtime environment executing the transformed dataflow graph, the transform performed according to a performance measure calculated for each processor executing the at least one high-level operation using the at least one low- level operation adapted for execution by the respective processor.

Processor specific optimizations are performed at the front end, before compilation of the program, to improve execution of the dataflow graph representation of the computer program. Optimizations are performed at the dataflow graph level, which allows for a compiler to further optimize the dataflow graph before compilation and execution. Each processor has its own low-level operation designed for optimal performance, instead of, for example, mapping the high-level operation to a common low-level operation designed for execution on all or multiple processors, which results in lower performance.

The graph representation allows for application of standard graph optimization methods. The graph representation provides compatibility with existing system components that parse a computer program written in a high-level language into a dataflow graph representation.

In a first possible implementation of the apparatus according to the first aspect, the apparatus further comprises a set of low-level operations defined for each processor, each set including low-level operations, each low-level operation adapted for a variation of data for processing by the computer program, wherein the at least one low-level operation is selected from the set corresponding to the processor.

Each low-level operation is designed for optimal performance on the corresponding target processor, instead of, for example, the same high-level operation being compiled to one of multiple available target processors, which results in lower performance. Each low-level operation may be optimally designed for different data formats, instead of, for example, the same high-level operation which is designed for a common data format, which results in lower performance.

In a second possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the at least one high-level operations is based on abstract operations defined by a domain specific language (DSL) used to write the computer program, each high-level operation being mappable to a plurality of low-level operations for execution on different processors.

Processor specific optimizations are performed for DSL programming languages, for example, for the R programming language for statistical computing, and for the SQL (structured query language) programming language for databases. The high-level operation is mappable to one of different available combinations or subsets of low-level operations (i.e. not necessarily in a 1 : 1 manner), which provides improved performance when the best combination of low-level operations is selected.

In a third possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the performance measure includes one or more members of the group consisting of: lower computation time, lower computation complexity, lower expenditure of energy, and lower momentary power consumption.

The processor and/or low-level operations may be selected to achieve particular desired improvements in performance. In a fourth possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the interface is adapted to receive a dataset for processing by the transformed dataflow graph, and the transform is performed according to the performance measure calculated according to processing of the dataset.

Different transformations may be performed for different datasets, to improve processing performance of the actual dataset. Performance of database management systems (DBMS) and/or data warehouses (DWH) is improved, by selecting the best processor and/or low-level operations for the database being accessed.

In a fifth possible implementation form of the apparatus according to the fourth implementation forms of the first aspect, the apparatus further comprises a preprocessing module adapted to generate a plurality of instances of the dataset, each instance adapted for processing by at least one low-level operation executed by a processor of the plurality of processors.

Creation of the multiple instances of the dataset improves performance, as each instance is designed for more efficient execution performance using the corresponding processor and/or low-level operations. The multiple instances are generated in advance of program execution, to further improve performance.

In a sixth possible implementation form of the apparatus according to the fourth or fifth implementation forms of the first aspect, the apparatus further comprises a pre-processing module adapted generate at least one statistical value according to an analysis of the dataset, and wherein the transform is performed according to the at least one statistical value.

The processor and/or low-level operations are selected for best performance during processing of the data according to characteristics of the data itself.

In a seventh possible implementation form of the apparatus according to the fourth, fifth, or sixth implementation forms of the first aspect, wherein the transformation is performed according to the performance measure of the processor executing the at least one low-level operation on the dataset in relation to other processors executing the at least one low-level operation on the dataset. The processor and/or the low-level operations are selected for execution to improve performance. The performance measure allows for selection of one processor over another, or for particular low-level operations over other operations.

In an eighth possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the transform comprises insert at least one data copy node in the dataflow graph, the copy node defining copying of data between memories of different processors.

The data-copy node allows the flow graph to represent the low-level operation describing data communication between different processors, which is used to improve performance when different processors transfer data between each other.

In a ninth possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the transform comprises insert at least one data processing node in the dataflow graph, each data processing node defining a member selected from the group consisting of: conversion of data from one format to another format, partition split of data to memories of different processors, and joining or two or more data items from memories of different processors.

The data-processing node allows the flow graph to represent the low-level operations describing processing of the data involving communication between different processors, which is used to improve performance when different processors work together on data.

In a tenth possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the transform is performed according to one or more of the following: transform a dataflow graph node of a high-level operation into a subgraph that includes a plurality of the low-level operations, wherein the subgraph has the same semantic as the node; transform a subgraph of high-level operations of the dataflow graph into a single node representing a single low-level operation, wherein the single node has the same semantic as the subgraph; transform a first subgraph of high-level operations of the dataflow graph into a second subgraph of low-level operations, wherein the first subgraph and second subgraph have the same semantic. Portions of the graph are transformed to improve performance.

In an eleventh possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the compiler is further adapted to iteratively transform the transformed dataflow graph, by transforming the transformed at least one low-level operation node into another at least one low-level operation node.

Iterative transformations may further improve performance.

In a twelfth possible implementation form of the apparatus according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, the runtime environment executes the transformed dataflow graph.

The transformed graph itself may be optimized using graph optimization methods, and then executed in a standard manner by the runtime environment, providing compatibility with existing systems.

In a thirteenth possible implementation form there is provided a method for transforming a dataflow graph IR, wherein the method is adapted to operate an apparatus according to one of the preceding claims.

In a fourteenth possible implementation form, there is provided a computer program that runs the preceding method when executed on a computer.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 is a flowchart of a method for transforming nodes of a dataflow graph, in accordance with some embodiments of the present invention;

FIG. 2 is a block diagram of components of a system including an apparatus for transforming nodes of a dataflow graph, in accordance with some embodiments of the present invention;

FIG. 3 is a flowchart of some possible transformations performed by the apparatus of FIG. 1 and/or based on the method of FIG. 2, in accordance with some embodiments of the present invention;

FIG. 4 is an example of a subgraph of a transformed dataflow graph, in accordance with some embodiments of the present invention;

FIG. 5 is a block diagram of a system for compilation and execution of a transformed dataflow graph, incorporating the apparatus of FIG. 2, in accordance with some embodiments of the present invention; and

FIGs. 6A-6D are schematic diagrams depicting generation of dataset instances and/or low-level instructions for transformation, designated for improved performance on different processor architectures, in accordance with some embodiments of the present invention.

DETAILED DESCRIPTION

An aspect of some embodiments of the present invention relates to a compiler that creates a transformed data flow graph by transforming one or more high-level operation nodes of a dataflow graph to one or more low-level operation nodes that are designed for execution on a certain processor of multiple processors of a runtime environment. Optionally, the runtime environment is a distributed processing system including different processor architectures, for example, a heterogeneous system. Each of the multiple processors is associated with a different set of low-level operations designed for execution on the respective processor. The compiler performs the transformation according to one or more performance measures calculated for each (or a subset of) of the multiple processors executing their corresponding low- level operations corresponding to the same high-level operation of the dataflow graph. In this manner, the best performing low-level operations and corresponding processor are designated for the high-level operation node(s), before the transformed dataflow graph is executed.

Optionally, each processor is associated with a predefined set of low-level operations designed for variations of the data designated for processing by the dataflow graph. The low-level operations are designed for improved performance during execution of the data variation by the respective processor. For example, two low-level operations corresponding to join of two datasets together may be available, where each low- level operation is designed to improve performance of the join operation according to characteristics of the datasets, such as when one of the datasets is sorted. The low-level operation nodes are selected by the compiler from the set according to the designated processor and/or the performance measure. Alternatively or additionally, the processor is designated according to availability of suitable low- level operations.

Optionally, the performance measure is calculated (e.g., by the compiler) for different processors available to execute portions of the dataflow graph processing the dataset. Different characteristics of datasets may result in different performance measures. The compiler may select the low-level operations and corresponding processor for execution of the dataset according to the calculated performance measure.

Optionally, a pre-processing module in communication with the compiler generates multiple instances of the dataset, which store the same information in different formats. Each instance is designed for processing using low-level operations by a different processor. The compiler may select the instance to execute on the respective processor according to the performance measure. The low-level operations and corresponding processor may be designated based on the selected instance. Optionally, the compiler inserts one or more data copy nodes in the transformed data flow graph. The data copy nodes define copying of data between memories associated with different processors. The data copy nodes may be inserted when different processors are designated to execute different parts of the transformed dataflow graph, in order to pass data between the different processors.

It is noted that the compiler as described herein may be implemented within an apparatus, as a program module (in hardware and/or software), as a system, as a method, and/or as a computer program product.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.

The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Reference is made to FIG. 1, which is a flowchart of a method for transforming nodes of a dataflow for execution in a processing system, optionally a distributed heterogeneous system, in accordance with some embodiments of the present invention. Reference is also made to FIG. 2, which is a block diagram of components of a system that improves performance execution of high-level operation nodes of a dataflow graph, by transforming the high-level operation nodes into low- level operation nodes designated for execution at certain processor(s), in accordance with some embodiments of the present invention. The method of claim 1 may be executed by the apparatus and/or system of FIG. 2.

Processor specific optimizations are performed at the front end, before compilation of the program, to improve execution of the dataflow graph representation of the computer program. Optimizations are performed at the dataflow graph level, which allows for a compiler to further optimize the dataflow graph before compilation and execution. Each processor has its own low-level operation set designed for optimal performance, instead of, for example, mapping the high-level operation to a common low-level operation designed for execution on all or multiple processors, which results in lower performance.

At 102, a dataflow graph 202 that includes multiple operation nodes is received by an apparatus 204, optionally via an interface 206 (e.g., a network connection, a hard-drive, an external memory card, a connection to an internal bus, and an abstract interface such as an application programming interface(API)). Each node of the dataflow graph represents a high-level operation, for example, joining of two datasets, searching within a dataset, aggregation of data, and selection of a sub-set of the dataset. The dataflow graph is generated from a computer program, optionally from source code of the computer program, for example, by another compiler module. The flow graph models functions in the source code as nodes. Data flow and/or data relations between the functions (i.e., nodes) are represented as edges between the nodes. The computer program may be a whole computer program, a part of a computer program, and/or a single algorithm.

The graph representation allows for utilization of standard graph optimization methods. The graph representation provides compatibility with existing system components that parse a computer program written in a high-level language into a dataflow graph representation.

The dataflow graph is designed for execution within a runtime execution environment 214 including multiple processors 216A-C (it is noted that more or fewer processors may exist within the execution environment). Runtime execution environment 214 may be organized as a distributed processing system, optionally a heterogeneous distributed processing system including multiple different types of processors. Processors 216A-C may be dissimilar, optionally operating using different instruction set architectures (ISAs). Processors 216A-C may have different architectural designs, for example, central processing units (CPUs), graphical processing units (GPUs), field programmable gate arrays (FPGA), processors for interfacing with other units, and/or specialized hardware accelerators (e.g., encoders, decoders, and cryptography co-processors). The dataflow graph may represent a distributed computer program.

The high-level operations may be based on abstract operations defined by a domain specific language (DSL) 208 used to write the computer program. Each high- level operation may be mapped to different multiple low-level operations for execution on different processors. The DSL may provide a higher level of abstraction of data types, and/or more general use of abstract data types than would be available with other programming languages, such as low- level programming languages, and/or programming languages that were not specifically designed to handle problems in the same domain as the DSL. The DSL may be a pre-existing available DSL, or a custom developed DSL, for example, the R programming language for statistical computing, and the SQL (structured query language) programming language for databases.

Optionally, at 104, a dataset 210 is received and/or accessed by apparatus 204 (optionally via interface 206). Dataset 210 is designated for processing by the transformed dataflow graph of the computer program (transformation of the dataflow graph is described, for example with reference to block 1 12), for example, a database on which queries defined by the dataflow graph are executed. Dataset 210 may be stored on a local memory and/or remote server in communication with apparatus 204.

At 106, multiple instances of dataset 210 are generated. Each instance is designed for processing by one or more low-level operations executable by a certain processor of the processors of the execution environment. Each instance contains the same data within dataset 210 in a different format, for example, the same data is organized in a different data structure, and the dataset is split into two sub-sets. Alternatively or additionally, each instance contains the same data organized differently. For example, data may be pre-sorted by one column in one instance and pre-sorted by another column in another instance, and an additional column is added to designate groups of dataset.

The instances may be generated according to the set of available low-level operations associated with each processor. The instance may be generated to improve performance of execution by the respective processor using the respective low-level operations. For example, the instance may be in a format suitable for execution within a parallel processor, such as a GPU. The instance may be generated to reduce the number of low-level processor instructions required to executed high-level operations, for example, pre-sorting the dataset and/or adding another column designating group numbers may reduce the number of low-level processor instructions required to aggregate the dataset, as compared to the non-presorted dataset.

The instances may be generated by a pre-processing module 212, which may be a component of apparatus 204 and/or an external component (e.g. residing on a local computer and/or remote server) in association with apparatus 204.

Optionally, pre-processing module 212 analyzes dataset 210 and generates one or more statistical values based on the analysis. Examples of statistical values include: distribution of size of data within the dataset, and organization of the data within the dataset (e.g., sort or non-sorted). The transformation of the dataflow graph (e.g., as described with reference to block 1 12) may be performed according to the calculated statistical values.

It is noted that blocks 104 and 106 may occur prior to and/or independently of block 102. Dataset 210 may be received prior to and/or independently of receiving the dataflow graph. Dataset 210 may be pre-processed, to generate the instances prior to receiving the dataflow graph, such that the instances are available for use when the dataflow graph is received.

At 108, one or more performance measures are calculated for each processor executing the high-level operation(s) using the low-level operations designed for execution by the respective processor. The same high-level operation may be mapped to different low-level operations executed by different processors. Although the same result is obtained when the same high-level operation is executed by different processors, each processor may have different performance measure values associated with performing the low-level operation that corresponds to the high-level operation.

Optionally, one or more performance measures are calculated for a sub-set of low- level operations designated from a set of operations 218 associated with the respective processor. Each processor is associated with its own set of low-level operations. For each processor, the same high-level operation may be mapped to different sub-sets of low-level operations selected from the set. The set of low-level operations may include variations of the same operation, for example, for processing different formats of data, for processing different sizes of data, and for processing data with different statistical values (e.g., distribution). Each set of low-level operations are designed to execute a variation of the dataset. The instances of the dataset (e.g., described with reference to block 106) may be generated according to the different sub-sets of low-level operations. Although the same result is obtained when the same high-level operation is executed by different sub-sets of low-level operations of the same processor, each sub-set of low-level operations may have different performance measure values associated with performing the sub-set of low- level operations corresponding to the same high-level operation.

The performance measure may be calculated based on each processor executing the low level operations (or sub-sets thereof) on dataset 210, optionally on each generated instance of dataset 210. The performance measure may be estimated based on past performance of the processor executing a similar dataset (e.g. according to statistically similar calculated statistical values). The calculated performance measure may depend on the available system resources, for example, processors, memory, and bandwidth for data transfer between processors.

The performance measure may include one or more of: computation time, computation complexity, expenditure of energy, and momentary power consumption. The performance measure may be absolute, or a relative measure between processors. The processor may be selected according to the lowest value of the calculated performance measure, absolute and/or relative. The processor and/or the low-level operations are designated for execution of the transformed high-level operation node(s) to improve performance according to the selected performance measures. The performance measure allows for selection of one processor over another, or for particular low-level operations over other operations.

Set of operations 218 may be stored in association with each respective processor 216A-C, for example, on the memory associated with each processor. Set of operations 218 may be stored external to the processor, accessible and/or in communication with the processor, for example as a component of apparatus 204, on a remote server, and/or on a central local server.

Set of operations 218 may be manually defined by a programmer, for example, manually written for each processor based on the architecture of the respective processor. For example, the same join operation may be defined using different low level instruction by the same processor, to achieve the same result with different performance measures.

Each low- level operation is designed for optimal performance on the corresponding target processor, instead of, for example, the same high-level operation being compiled to one of multiple available target processors, which results in lower performance. Each low-level operation may be optimally designed for different data formats, instead of, for example, the same high-level operation which is designed for a common data format, which results in lower performance.

At 110, for each high-level operation node, or group of nodes (e.g., a subgraph), the certain processor out of the available processors 216A-C of runtime environment 214 is designated. Optionally, the sub-set of low-level operations are designated from the set of low-level operations defined for the designated processor. It is noted that the processor and associated sub-set of low-level operations may be designated simultaneously, or sequentially.

Optionally, the processor and/or low-level operations are designated according to the calculated performance measures, for example, according to requirements, such as a function of the performance measures, a range, and/or a threshold. Optionally, designation is performed according to the performance measure of the processor executing its respective low-level operation on the dataset in relation to other processors executing their respective low-level operation on the dataset. The processor and/or low-level operations may be designated to achieve particular desired improvements in performance. For example, in some cases, cost may be the primary factor, while in other cases computation time may be the primary factor. The processor and/or low-level operations are designated for best performance during processing of the data according to characteristics of the data itself.

At 112, the dataflow graph is transformed by a compiler 220 component of apparatus 204 into a transformed dataflow graph 222 according to the designated processor and/or low-level operations. Optionally, the transformation is performed simultaneously with the designation of block 1 10. Alternatively, designation of block 1 10 and the transformation are performed sequentially.

Compiler 220 transforms one or more high-level operation nodes to one or more low-level operation nodes corresponding to the high-level operation of the node, to create transformed dataflow graph 222. The low- level operations are designed for execution on a processor of processors 216A-C of runtime environment 214 executing transformed dataflow graph 222. The transform of each high-level operation node (or group thereof) is performed according to the performance measure calculated for the different possible low-level operation nodes, for example, the low-level operation nodes having the lower calculated performance measure values. Optionally, the transform is performed according to the performance measure calculated according to the processor processing the dataset using the low-level operations.

Reference is now made to FIG. 3, which is a flowchart of some possible transformations performed by compiler 220, in accordance with some embodiments of the present invention. It is noted that different parts (e.g., nodes or sub-graphs) of the same dataflow graph may be transformed for execution by different processors and/or using different low-level operations. Portions of the graph are transformed to improve local performance of each transformed portion, and/or global performance of the dataflow graph. For example, local portions of the graph are transformed to improve performance efficiency measure values calculated for the local portions (e.g., Turing complete complexity). For example, global transformations are performed based on greedy rules applied at a local level, which ensure that the local transformations are globally beneficial.

The transformation may be performed according to a set of rules, transformation algorithm, graph processing methods, or other methods, for example, graph transformations are evaluated for validity before a subgraph is applied.

Graph transformations may be defined by a set of APIs.

Optionally, at 302, the dataflow graph (or portion thereof) is transformed by inserting one or more data copy nodes in the dataflow graph. The copy node defines copying of data between memories of different processors. The copy node may be inserted, for example, between two data processing nodes designated for execution by different processors, to allow communication of the processed data between the different processors.

The data-copy node allows the flow graph to represent the low-level operation describing data communication between different processors, which is used to improve performance when different processors need to transfer data between each other.

Alternatively or additionally, at 304, the dataflow graph (or portion thereof) is transformed by inserting one or more data processing nodes in the dataflow graph. The data processing node defines one of: conversion of data from one format to another format, partition split of data to memories of different processors, and joining of two or more data items from memories of different processors. The data processing node may be inserted according to the designated low-level operations and/or designated processors, to prepare data for processing using different low-level operation sets and/or different processors.

The data-processing node allows the flow graph to represent the low-level operations describing processing of the data involving communication between different processors, which is used to improve performance when different processors work together.

Alternatively or additionally, at 306, a dataflow graph node of a high-level operation is transformed into a subgraph that includes multiple low-level operations. The replacement subgraph has the same semantic as the original node. Such a transformation may be performed, for example to transform a complex high-level operation into multiple simpler low complexity operations. The set of lower complexity operations may be designated based on performance efficiency over the higher complexity operation.

Alternatively or additionally, at 308, a subgraph of high-level operations of the dataflow graph is transformed into a single node representing a single low-level operation. The single node has the same semantic as the subgraph. Such a transformation may be performed, for example, to simplify multiple operations into a single operation when the single operation has improved performance efficiency over the multiple operations.

Alternatively or additionally, at 310, an original subgraph of high-level operations of the dataflow graph is transformed into another subgraph of low-level operations. The original subgraph and transformed subgraph have the same semantic. The subgraph transformation may be performed, for example, to designate one computing approach over another when the transformed approach has improved performance efficiency over the original approach.

Alternatively or additionally, at 312, an original node of high-level operations of the dataflow graph is transformed into another node of low-level operations. The original node and transformed node have the same semantic. The node transformation may be performed, for example, to replace a high-level operation by a corresponding low-level operation when the transformed node has improved performance efficiency over the original node.

Reference is now made to FIG. 4, which is an example of a subgraph 400 of a transformed dataflow graph, in accordance with some embodiments of the present invention. The original dataflow graph (i.e., before transformations) includes nodes 402, 404, 406, 408, and 410. The original graph depicts a program that pre-processes data at node 402. An operation is performed at node 404. The outputted of node 404 is processed by either node 406 that applies another operation, or node 408 that applies another operation. The output of nodes 406 and 408 is combined by node 410 which performs another operation.

The transformation adds copy nodes 452, 454, 456, 458 and 460 between the original nodes At node 452, data Dj provided by node 402 is copied to the memory of a first designated processor. At node 404, the operation is performed on the copied data Dj. Output Dj is then distributed to two different processors (i.e., second and third) for distributed processing. At 454, data Dj is copied to data Di at the memory of the second processor. At 406, the original operation is applied to data Di. At 458 data Di is copied to data Dq at a fourth processor. At 456, data Dj is copied to data Dw at the memory of the third processor. At 408, the original operation is applied to data Dw. At 460 data Dw is copied to data Dq at the fourth processor. At 410, the original operation is applied by the fourth processor to combined data Dq.

One or more of the original nodes 402, 404, 406, 408, and 410 may be transformed into node(s) and/or a subgraph of low- level operations designated to be performed by the designated processor. For example, original node 406 is transformed into subgraph 470, which includes low-level operations designated to be performed by the second processor. It is noted that subgraph 470 is semantically similar to original node 406, generating the same results for the same input. It is noted that subgraph 470 has improved performance (e.g., according to the performance measured) as compared to original node 407.

Referring now back to FIG. 1, optionally, at 1 14, compiler 220 iteratively transforms transformed dataflow graph 222 to generate an updated version of the transformed dataflow graph. Each node and/or subgraph may be analyzed for additional transformations, for example, as described herein in reference to the original dataflow graph. Each transformed one low-level operation node or sub-graph of nodes may be analyzed and/or transformed into another low-level operation node or sub-graph of nodes.

Iterative transformations may further improve performance, by further analyzing the transformed dataflow graph for additional transformations that lead to additional performance improvements.

At 116, runtime environment 214 executes transformed dataflow graph 222. Processors 216A-C executed designated portions of transformed dataflow graph 222, optionally using the designated sub-set of low-level operations defined for the respective processor, optionally on associated and/or designates instances of dataset 210. Optionally, the transformed graph is further optimized using graph optimization methods, for example, by another compiler module, such as an off the shelf back-end compiler module. The transformed graph (optionally the optimized transformed graph) may be executed in a standard manner by the runtime environment, providing compatibility with existing systems.

Reference is now made to FIG. 5, which is a block diagram of a system 500 for compilation and execution of a transformed dataflow graph, incorporating apparatus 204 of FIG. 2, in accordance with some embodiments of the present invention. System 500 may include existing off-the-shelf equipment and/or code modules integrated with apparatus 204 of FIG. 2.

System 500 includes a storage unit 502 that stores a dataset, for example, a database on which queries are executed, as described herein, such as corresponding to dataset 210 of FIG. 2. Storage unit 502 may include large storage capacity, which may be slow for loading and/or saving data.

A datastore unit 504 in communication with storage unit 502 may include a memory designed for fast data loading and/or storage. Datastore unit 504 may be quickly accessed by processors and/or compilers executing operations on the dataset.

A pre-processing module 512 corresponding to pre-processing module 212 of FIG. 2, is in communication with datastore 502. Pre-processing module 512 accesses the dataset on datastore 502, to generate multiple instances of the dataset (as described herein) based on available low level operations 518 defined for each processor (corresponding to operation sets 218 of FIG. 2), which may be stored as code modules. The generated instances may be stored on datastore 502 for fast access by compilers and/or processors.

System 500 includes a front end (FE) compiler 506 that receives a source code written in a DSL, and parses the source code into a dataflow graph representation of operations, as described herein.

A back end (BE) compiler 520 (which corresponds to compiler 220 of FIG. 2) receives the dataflow graph, and creates a transformed dataflow graph, as described herein. The graph transformations are performed from high level operations to processor designated low level operations, as described herein, such as according to: defined low level operations 518, pre-existing low level functions that are not necessarily mapped to a designated processor 530, a set of rules for performing the transformation 522 (as described herein), performance measure(s) calculated according to a resource allocation module 524, and standard graph optimization methods according to a graph transformation module 526.

The transformed dataflow graph is complied, optionally according to program- lib association, and scheduled for execution within a runtime environment 514 corresponding to runtime environment 214 of FIG. 2.

Reference is now made to FIGs. 6A-6D, which are schematic diagrams depicting generation of dataset instances and/or low-level instructions for transformation, designated to for improved performance on different processor architectures, in accordance with some embodiments of the present invention. FIGs. 6A-6D depict database operations written in Structured Query Language (SQL) that are executed on instances of the dataset designed for improved performance when executed on single instruction multiple data (SIMD) processor such as a graphics processing unit (GPU) as compared to a central processing unit (CPU).

FIG. 6A depicts execution of the SQL query 600: SELECT sum(c2) FROM tl GROUP BY cl;. The operation aggregates groups according to the key index value, and sums the data of each aggregated group.

Operation 600 may be performed by a standard CPU, shown graphically by arrow 602. The unsorted data is sorted and grouped by the value of the index key. The data for each group is summed. Operation 600 may be represented as dataflow graph 604, which may be executed by the standard CPU.

Alternatively, using the systems and/or methods described herein, operation 600 may be transformed for execution by a SIMD processor (e.g., GPU), which may increase performance compared to processing of graph 604 and/or processing by the CPU (as depicted by arrow 602). Arrow 606 graphically depicts generation of an instance of the dataset that includes a group number column according to the value of the index key. Summation may be performed in a single executable function of an aggregated sum on the instance according to the group column. Transformed dataflow graph 608 depicts nodes for performing the single function by the SIMD processor.

FIG. 6B depicts an operation or performing a join operation. Arrow 612 graphically depicts performance by a standard CPU, that searches for the components of dataset CI that are present in C2, and generates indices of the matches. Dataflow graph 614 corresponds to the method of arrow 612.

Arrow 616 depicts pre-processing of the dataset C2 to generate a statistical value indicative that the data of C2 is sequential. Based on the statistical value, the join operation is transformed to a scatter-gather operation which may be executed with improved performance by SIMD implementation on the GPU. Transformed dataflow graph 618 represents low- level operations executable on the GPU.

FIG. 6C depicts standard CPU and GPU processing on the SQL operation 620: SELECT c_l, c_k FROM tl WHERE some _pred(c _p);, which filters multiple columns using the same criterion.

Arrow 622 and associated dataflow graph 624 depict performance of the operation on a standard CPU using row-store based operations.

Arrow 626 and associated dataflow graph 628 depict performance of the operation on a standard CPU using column-store based operations.

Arrow 630 and associated transformed dataflow graph 632 depict performance of the operation using pre-processing operations which improve execution on a GPU.

FIG. 6D is a table summarizing FIGs. 6A-6C. The table has columns op (depicting the high level operation), pre-processing (depicting what pre-processing is performed), transformation type (which depicts transformation of high level operations to low level operations), SIMD-op (depicting low level operation implementations on a GPU), and MTC-op (depicting low level operation implementations on a CPU).

The table summarizes the three high-level operations join (FIG. 6B), select^ aggregate (FIG. 6A), and select (FIG. 6C), and the pre-processing required to implement the respective high-level operation on the designated processor using respective low-level operations.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant datasets, dataflow graphs, compilers, and processors will be developed and the scope of the terms datasets, dataflow graphs, compilers, and processors are intended to include all such new technologies a priori.

As used herein the term "about" refers to ± 10 %.

The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of and "consisting essentially of.

The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless such features conflict.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:

1. An apparatus adapted to transform nodes of a flow graph for execution on an, in particular distributed, processing system, comprising:

an interface adapted to receive a dataflow graph that comprises nodes, each node representing a high-level operation;

a compiler adapted to:

transform at least one high-level operation node to at least one low-level operation node corresponding to the at least one high-level operation to create a transformed dataflow graph, the at least one low-level operation adapted for execution on a processor of a plurality of processors of a runtime environment executing the transformed dataflow graph, the transform performed according to a performance measure calculated for each processor executing the at least one high-level operation using the at least one low-level operation adapted for execution by the respective processor.

2. The apparatus of claim 1, further comprises a set of low-level operations defined for each processor, each set including low-level operations, each low-level operation adapted for a variation of data for processing by the computer program, wherein the at least one low-level operation is selected from the set corresponding to the processor.

3. The apparatus of any of the previous claims, wherein the at least one high- level operations is based on abstract operations defined by a domain specific language (DSL) used to write the computer program, each high-level operation being mappable to a plurality of low-level operations for execution on different processors.

4. The apparatus of any of the previous claims, wherein the performance measure includes one or more members of the group consisting of: lower computation time, lower computation complexity, lower expenditure of energy, and lower momentary power consumption.

5. The apparatus of any of the previous claims, wherein the interface is adapted to receive a dataset for processing by the transformed dataflow graph, and the transform is performed according to the performance measure calculated according to processing of the dataset.

6. The apparatus of claim 5, wherein the apparatus further comprises a preprocessing module adapted to generate a plurality of instances of the dataset, each instance adapted for processing by at least one low-level operation executed by a processor of the plurality of processors.

7. The apparatus of claims 5 or 6, wherein the apparatus further comprises a preprocessing module adapted generate at least one statistical value according to an analysis of the dataset, and wherein the transform is performed according to the at least one statistical value.

8. The apparatus of any of claims 5-7, wherein the transformation is performed according to the performance measure of the processor executing the at least one low- level operation on the dataset in relation to other processors executing the at least one low-level operation on the dataset.

9. The apparatus of any of the previous claims, wherein the transform comprises insert at least one data copy node in the dataflow graph, the copy node defining copying of data between memories of different processors.

10. The apparatus of any of the previous claims, wherein the transform comprises insert at least one data processing node in the dataflow graph, each data processing node defining a member selected from the group consisting of: conversion of data from one format to another format, partition split of data to memories of different processors, and joining or two or more data items from memories of different processors.

11. The apparatus of any of the previous claims wherein the transform is performed according to one or more of the following:

transform a dataflow graph node of a high-level operation into a subgraph that includes a plurality of the low-level operations, wherein the subgraph has the same semantic as the node;

transform a subgraph of high-level operations of the dataflow graph into a single node representing a single low-level operation, wherein the single node has the same semantic as the subgraph;

transform a first subgraph of high-level operations of the dataflow graph into a second subgraph of low-level operations, wherein the first subgraph and second subgraph have the same semantic.

12. The apparatus of any of the previous claims, wherein the compiler is further adapted to iteratively transform the transformed dataflow graph, by transforming the transformed at least one low-level operation node into another at least one low-level operation node.

13. The apparatus of any of the previous claims, wherein the runtime environment executes the transformed dataflow graph.

14. A method for transforming a dataflow graph IR, wherein the method is adapted to operate an apparatus according to one of the preceding claims.

15. A computer program that runs the preceding method when executed on a computer.