WO2016116132A1

WO2016116132A1 - Systems and methods for execution of algorithms on parallel heterogeneous systems

Info

Publication number: WO2016116132A1
Application number: PCT/EP2015/050881
Authority: WO
Inventors: David MINOR; Natan Peterfreund; Eyal ROZENBERG; Adnan Agbaria; Ofer Rosenberg
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2015-01-19
Filing date: 2015-01-19
Publication date: 2016-07-28
Also published as: CN107111505A; CN107111505B

Abstract

There is provided an apparatus adapted to generate code for an execution on an, in particular distributed, processing system, comprising: an intermediate representation (IR), of a computer program; an interpreter to evaluate the intermediate representation, whereas the interpreter is configured to: receive feedback information that comprises information about the processing system; and adapt the IR based on the feedback information.

Description

SYSTEMS AND METHODS FOR EXECUTION OF ALGORITHMS ON PARALLEL HETEROGENEOUS SYSTEMS

BACKGROUND

The present invention, in some embodiments thereof, relates to systems and methods for optimization of execution of programs and, more specifically, but not exclusively, to systems and methods for optimization of program execution within distributed and/or heterogeneous processing systems.

Heterogeneous processing platforms exhibit a wide variance in scale, processor instruction set architecture (ISA), communication architecture, and memory architecture. For example, a cellular phone with a system-on-chip (SOC) consists of a combination of custom application specific integrated circuits (ASIC), a graphic processing unit (GPU) and a small dual-core central processing unit (CPU), contrasted with a homogenous processing system such as a computing cloud of 10,000 multi-core CPU servers.

The same program written in a high-level language may be compiled into machine executable code that is executed within different target heterogeneous distributed processing systems. Each target processing system may have a different architecture. As such, the same machine executable instructions may perform at different performance levels depending on the architecture of the execution system. For one system, the program may execute quickly, while for a different system the same program may execute very slowly.

One solution to the problem of improving program performance based on the target heterogeneous system, is manual customization of code for each particular target hardware configuration. Such manual coding is time consuming and prone to errors.

Another solution, when possible, is to execute code within a homogenous system, such as the computing cloud that is composed of similar processing platforms.

Another solution proposed to the above problem is the generation of a graph based intermediate representation (IR) from the source code. The IR is a machine agnostic representation of the high-level program. The IR may be optimized during compilation, to generate multiple device specific implementations of the program. For example, two machine executable versions are produced, a first version suitable for execution on a general central processing unit, and a second version suitable for execution on a specialized graphics processing unit. The runtime environment selects which variation to execute depending on the architecture.

SUMMARY

It is an object of the present invention to improve the code generation of code for execution in a processing system.

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, an apparatus adapted to generate code for an execution on an, in particular distributed, processing system, comprises: an intermediate representation (IR), of a computer program; an interpreter to evaluate the intermediate representation, whereas the interpreter is configured to: receive feedback information that comprises information about the processing system; and adapt the IR based on the feedback information.

Knowledge of target system architecture is required at compilation time. The apparatus performs runtime adaptation of the computer program, to change the behavior of the program based on the state and/or dynamic changes to the processing system. The computer program reconfigures itself to accommodate changes in the architecture of the processing system. The executing program may reconfigure itself when encountering previously unknown processing architectures. The executing program may reconfigure itself for execution on a target heterogeneous system composed of different sub-architectures (e.g., different node processor architectures).

In a first possible implementation of the apparatus according to the first aspect, the IR includes a dependency dataflow graph representing a computation flow of the computer program; the graph comprising the following elements: nodes denoting one or more data operations, in particular incoming, edges denoting one or more arguments of the data operations, in particular outgoing, edges denoting one or more results of the data operations; and/or one or more rules that encode how to evaluate the IR, in particular the elements of the dataflow graph.

Association of one or more rules with graph nodes allows for efficient runtime adaptation of execution of the program based on the affected graph nodes. The graph based DIR may be efficiently re-optimized and/or re-compiled. In a second possible implementation form of the apparatus according to the first implementation form of the first aspect, the interpreter is configured to adapt the IR in response to the feedback information, performing at least one of the operations of: adding at least one new rule to the IR, cancelling at least one pre-existing rule of the IR, and changing at least one pre-existing rule of the IR.

Evaluated rules may trigger adaptation of the previously existing rules, allowing for complex run-time adaptations, for example, based on recursion.

In a third possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, the interpreter is further configured to centrally generate a partially materialized IR to be executable by a plurality of target distributed processing systems as a basis for local adaptation and local generation of a fully materialized local IR for local execution at each of a plurality of nodes in a certain distributed processing system.

The central IR is partially compiled based on available global information, without requiring production of a global complete program. The partially compiled code is sent to each node, for local compilation and adaptation based on local node conditions.

In a fourth possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, each respective node of a plurality of nodes of the certain distributed processing system includes a local interpreter to evaluate a centrally generated adapted IR, whereas the interpreter is configured to: receive local feedback information that comprises local information about the certain distributed process system; and locally adapt the centrally generated adapted IR based on the local feedback information.

The centrally adapted IR is locally adapted at each node, to create different versions, each version being optimized for execution at the local node. The adaptation may be different at each node depending on the local node architecture and other local conditions.

In a fifth possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, the interpreter is further configured to provide the adapted IR to a central scheduler configured to centrally schedule the adapted IR for local execution at each respective node for a plurality of respective target architectures at each respective node. A central scheduler is able to schedule processing of the adapted IR on the target node, without knowledge of the architecture and processing conditions at the target node.

In a sixth possible implementation form of the apparatus according to the third, fourth or fifth implementations forms of the first aspect, the apparatus further comprises a local set of rules at each respective node of the plurality of nodes that encode how to evaluate the locally adapted IR, in particular the elements of the a local dataflow graph of the locally adapted IR.

The localized set of rules is used to adapt the computer program to the local environment (e.g., at the local node), for example, based on local architecture and/or local conditions. Different local processing environments may have different local sets of rules, providing customized flexibility in adapting the same computer program in different ways to each local environment.

In a seventh possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, the feedback information is selected from a group consisting of: a pattern of graph topology representation of the processing system, a pattern of processing system topology of the processing system, at least one logical expression based on at least one system run-time variable of the processing system, and a pattern of at least one function and argument of the computer program.

The adaption is trigged by different feedback information, providing flexibility in responding to changes in various aspects. Encounters of new situations may be dynamically handled by the adaptation. Dynamic run-time adaptation is trigged by one or more of: the DIR representation itself, architecture of the distributed processing system, run-time system variables, and the executing code.

In an eighth possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, adapt the IR is selected from a group consisting of: dynamic adaptation of a runtime graph representing computational flow of the computer program, adaptation of operations in the computer program, re-compilation of one or more portions of the computer program for optimization on a certain platform, and updating variables that trigger one or more rules.

Different parameters may be dynamically adapted, providing flexibility in the ability of the system to respond in different ways. The optimal response may be selected. The code itself may be changed, different code may be substituted, new code may be complied for optimization based on changes to parameters, and other rules may be triggered.

In a ninth possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, a set of rules included in the IR are implemented as an application programming interface based on a rule-based language.

The set of rules is independent of the IR. The set of rules is written separately from the source code used to generate the IR, for example, by different programmers. The same set of rules may be applied to different computer programs. The same IR of the same computer program may be adapted using different sets of rules, for example, at different organizations.

In a tenth possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, the interpreter is further configured to provide the adapted IR to a low- level compiler for compilation and generation of low-level code for execution within the processing system.

The computer program triggers a modification of itself, by updating the DIR and recompiling the DIR to generate updated computer executable code. The recompiling of the updated DIR may be optimized more efficiently, resulting in optimized updated executable code.

In an eleventh possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, the feedback information includes at least one member selected from a group consisting of: addition of new processing unit, removal of existing processing unit, failure of a process, failure of a processing unit, changes in availability of processing unit, changes in availability of processing resources, changes in input data, changes in processing complexity.

Adaptation of the executable code during run time is triggered by one or more scenarios that commonly occur in a distributed processing system.

In a twelfth possible implementation form of the apparatus according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, the apparatus further comprises a data-base configured to store computer executable code compiled from the adapted-DIR for re-use in future executions of similar set of rules evaluations.

Storing the different versions of the executable code generated by re- compilation and optional re-optimization during run-time makes the code available for future use when similar system conditions are encountered. The code may be re-used without repeating the processing steps to generate the code.

According to a second aspect, there is provided a method for generating code for an execution on an, in particular distributed, processing system, comprising: providing an intermediate representation, IR, of a computer program; receiving feedback information that comprises information about the processing system; and adapting the IR based on the feedback information.

According to a third aspect, there is provided a computer program product comprising a readable storage medium storing program code thereon for use by an interpreter to evaluate an intermediate representation, IR, of a computer program, the program code comprising: instructions for receiving feedback information that comprises information about an, in particular distributed, processing system that executes the computer program; and instructions for adapting the IR based on the feedback information.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced. In the drawings:

FIG. 1 is a flowchart of a method of run-time adaptation of an intermediate representation of a computer program executed within a processing system, in accordance with some embodiments of the present invention;

FIG. 2 is a block diagram of a system including an apparatus that performs runtime adaptation of an intermediate representation of a computer program executed within a processing system, in accordance with some embodiments of the present invention;

FIG. 3 is a flowchart of a method of locally adapting a centrally generated intermediate representation for local execution, in accordance with some embodiments of the present invention;

FIG. 4 is a block diagram of a system that locally adapts a centrally generated intermediate representation for local execution, in accordance with some embodiments of the present invention;

FIG. 5 is a schematic diagram of an example of the implementation of the method of FIG. 1 by an architecture based on the system of FIG. 2, in accordance with some embodiments of the present invention;

FIG. 6 is a schematic diagram of an example of the implementation of the method of FIG. 3 by an architecture based on the system of FIG. 4, in accordance with some embodiments of the present invention; and

FIG. 7 is a schematic diagram depicting adaptation of an intermediate representation, in accordance with some embodiments of the present invention.

DETAILED DESCRIPTION

The present invention, in some embodiments thereof, relates to systems and methods for optimization of execution of programs and, more specifically, but not exclusively, to systems and methods for optimization of program execution within a distributed and/or heterogeneous processing system.

An aspect of some embodiments of the present invention relates to an interpreter module that adapts an intermediate representation (IR) of a computer program based on feedback information that comprises information about the processing system in which the computer program is executed. The adaptation is performed in real-time, based on the dynamic feedback information reflecting the current state of the processing system. Program execution is dynamically changed during run-time based on the adapted IR. The module allows a program designed for execution on the processing system to dynamically re-configure itself in response to changes in the processing system during program execution, instead of, for example, statically defining different versions of the program in advance and selecting the version to run, which limits the program only to the predefined versions. The interpreter allows the computer program to adapt itself to unexpected changes in the distributed processing system (DPS) and/or to configure itself when encountering previously unknown processing architectures. The interpreter allows the same original computer program to be automatically adapted by the module to run efficiently on a wide variety of distributed processing platforms. The interpreter may be implemented within a system, executed as a method, and/or stored as a computer program product, as described herein.

Optionally, the adaptation is performed according to at least one rule of a set of rules that define the IR adaption based on the feedback information. The set of rules may be defined and/or programmed separately from the source code, for example, by different programmers. Optionally the set of rules are defined using a different language, optionally a customized rule language.

The set of rules in combination with the IR is referred to herein as a dynamic intermediate representation (DIR). The term DIR is sometimes interchangeable with the term IR, for example, when the dataflow graph of the DIR is adapted, the dataflow graph refers to the IR portion of the DIR.

Optionally, the DIR is represented in a high-level of abstraction, optionally a dependency dataflow graph, which is executable on multiple different target DPS architectures and/or target DPS compilers. The DIR may be constructed based on partial (or little) knowledge of the target DPS architectures. Adaptation of the DIR to the certain target architectures is performed by the interpreter, dynamically, during runtime, based on feedback information from the certain target DPS.

Optionally, the interpreter is organized as a hierarchy, with a central interpreter module that generates a central DIR for distribution to multiple processing nodes. Each processing node includes a local interpreter module that locally adapts the central DIR based on local feedback information from the local processing system. Optionally, the local adaption of the DIR is performed according to a local set of rules that define the adaptation based on the local feedback information. Each node may adapt the central DIR in a different manner, according to the local conditions (e.g., local architecture).

Optionally, the DPS is a heterogeneous distributed processing system that includes different architectures and/or different low-level program implementations. The heterogeneous distributed processing system is based on diversity in, for example, programming models, communication models, memory semantics, and processor architectures.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.

The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Reference is now made to FIG. 1, which is a flowchart of a method of run-time adaptation of an intermediate representation of a computer program for execution within a processing system, in accordance with some embodiments of the present invention. Reference is also made to FIG. 2, which is a diagram of components of a system that includes an interpreter module that allows an IR of a computer program to be dynamically adapted during runtime to the target processing system, in response to feedback information of the target processing system. The interpreter module adapts the IR to the current system state and/or to dynamic changes occurring during execution of the computer program by the DPS, for example, failure of one or more hardware components, hot-swapping of one or more components, dynamic optimization, and/or dynamic partitioning of hardware resources between multiple applications. The method of FIG. 1 may be executed by the apparatus and/or system of FIG. 2.

The systems and/or methods described herein do not required knowledge of the target system architecture at compilation time of the IR. The method performs runtime adaptation of the computer program, to change the behavior of the computer program based on the state and/or dynamic changes to the processing system. The system reconfigures itself to accommodate changes in the architecture of the DPS. The executing program may reconfigure itself when encountering previously unknown processing architectures. The executing program may reconfigure itself for execution on a target heterogeneous system composed of different sub-architectures (e.g., different node processor architectures).

Optionally, at 102, an intermediate representation (IR) of a computer program is received by an interpreter module 202. Alternatively, the source code of the computer program is received by interpreter module 202. The source code and/or IR may be stored on a 204 memory in communication with interpreter module 202. The memory may store iterations of the adapted IR.

The computer program may be a whole computer program, a part of a computer program, and/or a single algorithm. The computer program may be in a high-level source code format, a low-level code format suitable for execution, or pre-compiled code.

The computer program is designed to be executed by a processing system, optionally a distributed processing system 208, optionally a heterogeneous distributed processing system. For example, the program may solve a computational problem that cannot be solved on a single computational node due to a large volume of information that is required to be processed in order to solve the computational problem. A single computational node may not have sufficient memory and processing power to solve the computational problem within a reasonable amount of time, or may not be able to handle the volume of information at all (e.g., insufficient local memory).

It is noted that a source code of the computer program may be processed by a high-level compiler (located within apparatus 200 or external to apparatus 200) to generate the IR, for example, by parsing and/or compiling the source code. Alternatively, the IR may be generated by reverse compilation of an existing computer program. Alternatively, the IR is obtained from an external source. The source code used to generate the IR may be written using an application programming interface of a high-level programming language. The high-level programming language may be a domain specific language (DSL). The DSL provides for a high-level of abstraction that is not directly tied to any particular low-level implementation, allowing for multiple possible low-level implementations. Examples of DSLs include languages designed to program applications in the domains of machine learning, data query, and graph algorithms.

The IR may include a dependency dataflow graph representing a computational flow of the computer program. The graph may include the following elements: nodes that denote data operation(s), incoming edges that denote argument(s) of the data operations, and outgoing edges that denote result(s) of the data operations. The IR is machine agnostic, having the ability to be compiled for execution on different target systems.

At 104, a set of rules is received by interpreter 202. The set of rules encode how to evaluate the IR, in particular, the elements of the dataflow graph. The set of rules define a dynamic change in execution of the computer program according to the feedback information, to optimize performance of the computer program within DPS 208.

The set of rules may define adaptive optimization of the IR, and/or compilation rules of the IR, based on the feedback information. The set of rules transform the algorithm of the computer program (represented as the IR) for optimal execution in different processing environments.

Optionally, the set of rules are implemented as an application programming interface (API) based on a rule-based language. The rule-based language is designed to express adaptation logic. The rule-based language may be different than the language used for writing the source code.

The set of rules may be stored on memory 204 in communication with interpreter module 202. Optionally, each rule is divided into a predicate (which may be represented as a left hand side (LHS) by the rule language), and an associated action (which may be represented as a right hand side (RHS) by the rule language).

Examples of predicates include: pattern matches on graph topology representation of the processing system, pattern matches on processing system topology of the processing system, logical expressions based on processing system run-time variables, processing system performance metrics (e.g., available memory, and processor usage), and pattern matches on functions and arguments of the processing system.

Examples of actions for adapting the IR (e.g., the graph representation) include: graph transformations, graph portioning, operation substitution, operation fusion and fission, calling a third party compiler to compiler or re-compile kernels or optimization on a particular platform, and updating of variables that are associated with other predicates (which may iterative ly trigger other rules).

Optionally, at 106, the DIR is generated by combining the set of rules with the

IR. The rules may be associated with the IR, and/or mapped to the IR. Optionally, the DIR is a combined data structure including both the IR and the set of rules, for example, the rules are stored within each node of the IR. Alternatively, the DIR separately includes the data structures of the IR and the set of rules. The rules are evaluated independently, and the actions of the evaluated rules are applied to the IR.

The DIR is executable (or may be compiled for execution) on multiple different target DPS architectures and target DPS specific compilers. The DIR may be executed differently at each of different nodes within the target DPS.

Association of one or more rules with graph nodes allows for efficient runtime adaptation of execution of the program based on the affected graph nodes. The graph based DIR may be efficiently re-optimized and/or re-compiled.

At 108, feedback information that comprises information about DPS 208 is received by interpreter module 202. Feedback information may be obtained by a monitoring module 216 that monitors DPS 208, performing monitoring continuously, periodically, event-based, and/or in-real time. Monitoring module 216 transmits the feedback information to interpreter module 202 for evaluation of the rules, as described herein.

The feedback information may represent, for example, the current state of and/or changes in the processing environment, the state and/or changes in the algorithm itself, and/or the state and/or changes in the input data being processed by the executing program.

Examples of feedback information include: addition of a new processing unit (e.g., hot swap), removal of an existing processing unit (e.g., hot swap), failure of a process, failure of a processing unit, changes in availability of a processing units and/or other resources (e.g., due to multiple users and/or changing data sets), statistical changes in varying input data type and/or size, changes in availability of processing resources, changes in processing complexity (e.g., input dependent changes).

Optionally, the feedback information is related to one or more rules, for example, associated with one or more predicates, for example, a pattern of graph topology representation of the IR, a pattern of processing system topology of the DPS, a logical expression based on one or more system run-time variables, and a pattern of function(s) and/or argument(s) of the executing computer program.

At 110, the DIR is adapted in response to the feedback information. The adaptation is triggered when one or more rules of the DIR are evaluated based on the feedback information. The evaluated rules trigger related adaptation actions.

The adaption is trigged by different parameters, providing flexibility in responding to changes in various aspects. Encounters of new situations may be dynamically handled by the adaptation. Dynamic run-time adaptation is trigged by one or more of: the DIR representation itself, architecture of the DPS, run-time system variables, and the executing code.

Adaptation of the executable code during run time is triggered by one or more scenarios that commonly occur in a DPS.

The rules and the graph components of the DIR may be adapted together, separately, or independently. Optionally, when the rule(s) are evaluated as true (or fulfilling another condition), such as the predicate of the rule, the associated adaptation action is triggered. The adapted version of the previous DIR version is referred to herein as adapted-DIR. The adapted-DIR may be a sub-graph of the previous DIR (i.e., of the rules and/or graph), a partition of the previous DIR, an updated version of the previous DIR, a partially deleted version of the previous DIR, and/or a changed version of the previous DIR.

Optionally, the rules are evaluated and invoked to adapt the DIR while the system is running. Optionally, the rules are evaluated based on the feedback information, to trigger adaptation of the same rules or other rules within the DIR, for example, adding one or more new rules to the DIR, cancelling one or more pre-existing rules within the DIR, and/or changing one or more pre-existing rules of the DIR. Evaluated rules may trigger adaptation of the previously existing rules, allowing for complex run-time adaptations, for example, based on recursion.

The adaptation action performed on the DIR is based on the associated triggered rule, defined by the RHS action of the rule. For example, dynamic adaptation of the runtime graph representing computational flow of the computer program, adaptation of operations in the computer program, compilation or re-compilation of one or more portions of the computer program for optimization on a certain target platform, and updating variables that trigger one or more other rules.

At 112, the adapted-DIR is transmitted to a central scheduler 210 that schedules execution of the computer program within target DPS 208. Central scheduler 210 centrally schedules the adapted-DIR for local execution at each respective processing node of DPS 208. When DPS 208 is a heterogeneous system, each processing node may include different target architectures. The central scheduler is able to schedule processing of the adapted-DIR on the target node, without knowledge of the architecture and processing conditions at the target node.

Optionally, the adapted-DIR is provided to a low- level compiler 212 for compilation and generation of low- level code for execution within target DPS 208. Alternatively or additionally, low- level compiler 212 generates a static run-time dataflow graph from the adapted-DIR. The low-level code and/or run-time graph are provided to central schedule 210 for scheduling. In this manner, the DIR may trigger its own partial or complete re-compilation based on the current state of DPS 208.

Low- level compiler 212 may compile the adapted-DIR to a format suitable for execution on the target DPS, for example, to a target binary format, a portable type code format, or a runtime dataflow graph having nodes representing operations composed of binary or byte code. Low-level compiler 212 may be an existing off-the-shelf compiler based on the high-level programming language, for example, a DSL back-end compiler that compiles the adapted-DIR (which appears to the low-level compiler in the recognizable IR format when provided without the set of rules).

The computer program may trigger a modification of itself, by updating the DIR and recompiling the DIR to generate updated computer executable code of the program. The recompiling of the updated DIR may be optimized more efficiently, resulting in optimized updated executable code.

Optionally, the compiled code is stored within a code repository 214 (e.g., a data-base). The stored code may be re-used in future executions of similar set of rules evaluations. Storing the different versions of the executable code generated by re- compilation and optional re-optimization during run-time makes the code available for future use when similar system conditions are encountered. The code may be re-used without repeating the processing steps to generate and/or compile the code, which increases system performance.

Optionally, interpreter module 202 centrally generates a partially materialized DIR to be executable by multiple target DPS architectures. The partially materialized DIR is provided to each processing node of the target DPS, as a basis for local adaptation and local generation of a fully materialized local DIR for local execution at the local processing node. The central IR is partially compiled based on available global information, without requiring production of a global complete program. The partially compiled code is sent to each node, for local compilation and adaptation based on local node conditions.

Partial, complete, or partitioned DIRs are sent to central scheduler 210 for execution scheduling.

Reference is now made to FIG. 3, which is a flowchart of a method of locally adapting a centrally generated IR for local execution, in accordance with some embodiments of the present invention. Reference is also made to FIG. 4, which is a diagram of components of a system including a local node 400 of a target processing system (e.g., DPS 208 of FIG. 2), and a local interpreter module 402 that allows a computer program to be dynamically adapted during runtime to a local processing system 404, in response to feedback information of the local environment. Local interpreter module 402 evaluates a centrally generated adapted IR of the computer program according to local feedback information. Local interpreter module 402 adapts the central DIR to local dynamic changes occurring during local execution of the program by the processing node, and/or to the local system state. The method of FIG. 3 may be executed by the apparatus and/or system of FIG. 4.

At 302, at least a portion of the centrally generated DIR (which may have been centrally adapted) is received at each local node. Scheduler 210 of FIG. 2 may distribute the DIR to the local nodes. The same centrally generated DIR may be received at each local node, for local adaptation. Alternatively, different portions of the DIR are transmitted to each respective node, associated with the tasks scheduled for performance by the respective node. Alternatively, the central DIR is first converted to a runtime graph, and the runtime graph is transmitted to each processing node. Alternatively, the IR component of the central DIR is transmitted to each processing node, without the central set of rules component.

At 304, a local set of rules is received. The local set of rules encode how to evaluate the locally adapted IR, in particular the elements of the local dataflow graph of the locally adapted IR. Each rule(s) is associated with a respective node (e.g., stored in a memory in communication with the node).

The local set of rules is used to adapt the computer program to the local environment (e.g., at the local node), for example, based on local architecture and/or local conditions. Different local processing environments may have different local sets of rules, providing customized flexibility in adapting the same computer program in different ways to each local environment.

The local set of rules may have the same format, and/or written using the same

(or similar) rule-based language described with reference to the central set of rules.

At 306, the local set of rules is combined with the central DIR to generate a local DIR. The local set of rules may be mapped to the IR component of the central DIR, may be combined with the central set of rules of the central DIR, and/or replace the central set of rules of the central DIR.

At 308, local feedback information is received from local processing system (LPS) 404. The local feedback information comprises local information about LPS 404.

Optionally, a local monitoring module 408 performs the monitoring of LPS 404 and transmits the local feedback information to local interpreter module 402. At 310, the central DIR is adapted, based on the local feedback information, to generate a local DIR. Alternatively, when a local DIR has already been generated, the local DIR is adapted based on the local feedback information, to generate an adapted local DIR.

At 312, the local-DIR or adapted- local-DIR is transmitted to a local scheduler

406 to schedule execution within LPS 404.

At 314, blocks 308-312 are iterated. The iteration may be performed when new local feedback information is received, and/or when changes are detected from the previous local feedback information that trigger evaluation of the local rules, to generate new locally adapted-DIRs.

It is noted that blocks 302-314 may be iterated in additional multiple hierarchical levels, for example, the local processing node may itself be a local distributed system including multiple sub-nodes.

Referring now back to FIG. 1, at 114, blocks 108-112 are iterated. The iteration may be performed when new feedback information is received, and/or when changes are detected from the previous feedback information that trigger evaluation of the rules, to generate new adapted-DIRs.

Reference is now made to FIG. 5, which is a schematic diagram of an example of the implementation of the method of FIG. 1 by an architecture 502 based on the system of FIG. 2, in accordance with some embodiments of the present invention.

Algorithm 504 is written as source code in a high-level language (e.g., a DSL) 506, for example, by a programmer. Optionally, the source code is compiled by a front end compiler 508 into an IR. The IR is combined with algorithm specific optimization rules 512 (e.g., rules written by the programmer to optimize the program), to generate a DIR. Alternatively, front end compiler 508 receives rules 510 and the source code as input, and generates DIR (i.e., without outputting the IRthat does not include the rules).

A DIR interpreter module 514 receives real time feedback information from a system monitor 516 that monitors the target DPS. DIR interpreter 514 evaluates the rules based on the received feedback information, to perform an action:

* A re-write of DIR 512,

* Generation of a runtime graph 518, which is transmitted to a heterogeneous scheduler 526 for execution at the target DPS.

* Optimization of the DIR by an optimizer module 520, and compilation by a back end compiler 522 (i.e., low-level compiler). The compiled code is stored in an operation store 524 for future use. The compiled code is transmitted to heterogeneous scheduler 526 for execution within DPS. For example, when operation code (e.g., in binary format and/or byte code format) is missing for the system platform, or requires updating, the re-optimization and re-scheduling is triggered.

Reference is now made to FIG. 6, which is a schematic diagram of an example of the implementation of the method of FIG. 3 by an architecture based on the system of FIG. 4, in accordance with some embodiments of the present invention.

A DIR 602 is centrally generated, as described herein. DIR interpreter 604 receives real-time system information 606 as feedback information from the target DPS. The rules of DIR 602 are evaluated based on the feedback information, to generate a partially materialized graph 608. The feedback information may include system level details, for example, the number of available nodes. Partially materialized graph 608 is transmitted to a master scheduler 610 for scheduling at local nodes 612A and 612B.

Local node 612A is now described. For clarity and brevity, the description for local node 612B is omitted due to similarity. Differences in elements between the nodes are described. A local DIR interpreter 614A receives partially materialized graph 608. Based on local feedback information from the local processing system, local DIR interpreter 614A may convert partially materialized graph 608 to a local fully materialized graph 616A. It is noted that fully materialized graphs 616A and 616B may be different, adapted to the local conditions based on the local feedback information. Alternatively, partially materialized graph 608 is transmitted by local DIR interpreter 614A to a local compiler 620A for generation of low-level code. It is noted that local compilers 620A and 620B may be different, compiling the same partially materialized graph into different low-level languages suitable for execution within the local architecture. Generated code may be saved in a local operation store 622A. A local scheduler 618 A schedules execution of fully materialized graph 616A and/or low- level code within devices 624A. It is noted that devices 624A and 624B may be different (i.e., different architectures).

Reference is now made to FIG. 7, which is a schematic diagram depicting adaptation of an intermediate representation, in accordance with some embodiments of the present invention. It is noted that the adaptation may be performed centrally, and/or locally at each processing node.

DIR 702 is processed by a DIR interpreter module 704. DIR 702 includes an IR component, such as a graph 706, and an associated set of rules 708 component. Rules 708 include one or more predicates, each of which is associated with an action. The predicates are evaluated based on real-time system information 710 (i.e., feedback information) received from the target DPS, to trigger the relevant actions. Different adapted DIRs 712A-B may be generated (at the same time, or during different iterations), which may be partially or completely re-written versions of DIR 702.

Examples of applying the systems and/or methods described herein to common scenarios are now described.

In a first example, the code is automatically adapted to a changing processing environment. With reference to FIG. 1 (assuming an existing IR), at 108, the interpreter module receives feedback information of the addition of a new processing node within the DPS. At 1 10, the respective rule is triggered, to adapt the IR by re-partitioning of the IR according to the new number of processing nodes, taking into account the new node. (Blocks 1 12-114 are omitted for clarity). In another related example, at 108, the interpreter module receives feedback information of a change in input load threshold based on gathered statistics from the DPS. At 110, the respective rule is triggers, to adapt the IR by re-factoring of the local graph to achieve a new partition balance.

In a second example, a centrally generated DIR is forwarded to local nodes for local optimization and execution. With reference to FIG. 3, at 302, a DIR that has been centrally partitioned a graph to run on multiple nodes is received by one of the nodes. The partitioned graph contains an operation x. At 304 and 306, a local set of rules is mapped to operation x. At 308, the node cannot identify an instance of operation x in the local operation store, and provides the related feedback information to the local interpreter. At 310, the rule related to operation x is evaluated, to determine what to do when operation x is missing. The rule triggers a search for an instance of operation x, which is suitable for the hardware of the node. An instance of operation x is found written in a high-level DSL language. The rule triggers re-compilation of the source code for operation x, and the resulting low-level code is stored in the operation store for future use. At 312, the generated low- level code is scheduled.

In a third example, a local interpreter modifies existing rules to implement local optimizations. With reference to FIG. 3, at 302, a local node receives an IR partitioned for N GPUs. At 308, the local node receives feedback information that the local GPUs are sometimes in use by another process. At 310, the local node adds a rule to the local DIR to check the current GPU usage and to re-partition the local DIR when some of the GPUs are already in use. In a fourth example, a DIR is gradually materialized over diverse processors in a cluster. With reference to FIG. 1, a master node partitions an IR to slave nodes, without knowing what processors are available at each slave node. With reference to FIG. 3, each slave node re-evaluates and re-partitions the central IR to generate local DIRs suitable for the processors of each slave node.

In a fifth example, the DIR is adapted to the addition of a previously unknown type of processor. With reference to FIG. 3, at 302, a node receives a centrally partitioned IR from the master scheduler. At 308, feedback information indicative of detection of a new type of system on a chip (SOC) previously known to the system is provided to the local interpreter. At 310, the local interpreter adds the SOC transformation logic to the local DIR, and re-interprets the local DIR. The correct low- level code is generated and optimized for the new architecture. The generated code is stored in the local repository (e.g., operation store) for future use. At 312, the new low- level code is executed on the new architecture.

A sixth example refers to algorithm specific optimization rules. With reference to FIG. 1, at 102 and 104, an algorithm with a set of unique optimization is created with an associated set of optimization rules designed to work with the algorithm. At 106, the interpreter module adds the unique algorithm rules to the existing rule based, to generate the DIR. At 110, the unique rules are evaluated along with the default rules based on the feedback information. When any of the rules are triggered, the interpreter module triggers the appropriate action needed to optimize the algorithm. The generated optimized instructions may be stored in a repository (e.g., operations store) for future use. In another related example, with reference to FIG. 1, at 108, feedback information is provided to the interpreter module indicating that nodes A, B, and C are determined to be continuous and each contain GPU hardware. At 110, the algorithm specific optimization rule is triggered, invoking the action of fusing nodes A, B, and C into a more efficient node D. Nodes A, B, and C are replaced with node D in the graph of the DIR.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant systems and methods will be developed and the scope of the terms intermediate representation, feedback information, and interpreter are intended to include all such new technologies a priori.

As used herein the term "about" refers to ± 10 %.

The terms "comprises", "comprising", "includes", "including", "having" and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of and "consisting essentially of.

The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless such features conflict.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1 , 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases "ranging/ranges between" a first indicate number and a second indicate number and "ranging/ranges from" a first indicate number "to" a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:

1. An apparatus adapted to generate code for an execution on an, in particular distributed, processing system, comprising:

an intermediate representation, IR, of a computer program;

an interpreter to evaluate the intermediate representation, whereas the interpreter is configured to:

receive feedback information that comprises information about the processing system; and

adapt the IR based on the feedback information.

2. The apparatus of the previous claim, wherein the IR includes:

a dependency dataflow graph representing a computation flow of the computer program; the graph comprising the following elements:

nodes denoting one or more data operations,

in particular incoming, edges denoting one or more arguments of the data operations,

in particular outgoing, edges denoting one or more results of the data operations; and/or

one or more rules that encode how to evaluate the IR, in particular the elements of the dataflow graph.

3. The apparatus of the previous claim, wherein the interpreter is configured to adapt the IR in response to the feedback information, performing at least one of the operations of:

adding at least one new rule to the IR,

cancelling at least one pre-existing rule of the IR, and

changing at least one pre-existing rule of the IR.

4. The apparatus of any of the previous claims, wherein the interpreter is further configured to centrally generate a partially materialized IR to be executable by a plurality of target distributed processing systems as a basis for local adaptation and local generation of a fully materialized local IR for local execution at each of a plurality of nodes in a certain distributed processing system.

5. The apparatus of any of the previous claims, wherein each respective node of a plurality of nodes of the certain distributed processing system includes a local interpreter to evaluate a centrally generated adapted IR, whereas the interpreter is configured to:

receive local feedback information that comprises local information about the certain distributed process system; and

locally adapt the centrally generated adapted IR based on the local feedback information.

6. The apparatus of any of the previous claims, wherein the interpreter is further configured to provide the adapted IR to a central scheduler configured to centrally schedule the adapted IR for local execution at each respective node for a plurality of respective target architectures at each respective node.

7. The apparatus of any one of claims 4-6, further comprising a local set of rules at each respective node of the plurality of nodes that encode how to evaluate the locally adapted IR, in particular the elements of the a local dataflow graph of the locally adapted IR.

8. The apparatus of any of the previous claims, wherein the feedback information is selected from a group consisting of: a pattern of graph topology representation of the processing system, a pattern of processing system topology of the processing system, at least one logical expression based on at least one system run-time variable of the processing system, and a pattern of at least one function and argument of the computer program.

9. The apparatus of any of the previous claims, wherein adapt the IR is selected from a group consisting of: dynamic adaptation of a runtime graph representing computational flow of the computer program, adaptation of operations in the computer program, re-compilation of one or more portions of the computer program for optimization on a certain platform, and updating variables that trigger one or more rules.

10. The apparatus of any of the previous claims, wherein a set of rules included in the IR is implemented as an application programming interface based on a rule-based language.

11. The apparatus of any of the previous claims, wherein the interpreter is further configured to provide the adapted IR to a low-level compiler for compilation and generation of low- level code for execution within the processing system.

12. The apparatus of any of the previous claims, wherein the feedback information includes at least one member selected from a group consisting of: addition of new processing unit, removal of existing processing unit, failure of a process, failure of a processing unit, changes in availability of processing unit, changes in availability of processing resources, changes in input data, changes in processing complexity.

13. The apparatus of any of the previous claims, further comprising a data-base configured to store computer executable code compiled from the adapted-DIR for reuse in future executions of similar set of rules evaluations.

14. A method for generating code for an execution on an, in particular distributed, processing system, comprising:

providing an intermediate representation, IR, of a computer program;

receiving feedback information that comprises information about the processing system; and

adapting the IR based on the feedback information.

15. A computer program product comprising a readable storage medium storing program code thereon for use by an interpreter to evaluate an intermediate representation, IR, of a computer program, the program code comprising:

instructions for receiving feedback information that comprises information about an, in particular distributed, processing system that executes the computer program; and

instructions for adapting the IR based on the feedback information.