WO2020169182A1

WO2020169182A1 - Method and apparatus for allocating tasks

Info

Publication number: WO2020169182A1
Application number: PCT/EP2019/054078
Authority: WO
Inventors: Faycal AIT AOUDIA; Andrea ENRICI
Original assignee: Nokia Solutions And Networks Oy
Priority date: 2019-02-19
Filing date: 2019-02-19
Publication date: 2020-08-27

Abstract

An apparatus (100) and method (500) for allocating tasks (v) in a program to hardware units (u) in a target platform is described. A state (s_t) is input into a system (110) that includes an algorithm (120) having at least some trainable weights. The state (s_t) comprises a representation of the program (G) and hardware resource data (H) indicating an amount of resource types available to each hardware unit on the target platform (P). Performance value estimates (q_i) for allocations (a_i) of tasks (v_i) to hardware units (u_i) are determined by the algorithm (120) having at least some trainable weights from the state (s_t). A task is allocated to a hardware unit according to the determined performance value estimates.

Description

METHOD AND APPARATUS FOR ALLOCATING TASKS Technical Field

[001] Various example embodiments relate generally to methods and apparatus for allocating tasks to a target platform’s hardware.

Background

[002] To cope with the design complexity of modem embedded systems, a new design methodology has emerged called Model Driven Engineering (MDE). MDE aims at raising the level of abstraction of a design process to improve the design productivity. MDE advocates the use of abstractions in models that become the primary artifact used to develop a design instead of programs written in traditional programming languages such as C/C++, VHDL/Verilog. The models capture the behavior of platform components (e.g., processors in a Multi-Processor System-on-Chip, MPSoC), applications (e.g., signal, image and video processing) as well as their interactions at a high level of abstraction. As such, high-level models minimize the development effort and are optimized for execution and simulation speed.

[003] When compiling these models into software implementations, a compiler must map intermediate representations (e.g., a graph that represents data and control flow dependencies among tasks) to the available resources of a target platform (e.g., CPU’s registers in case of a single-processor, RAM memory banks in case of a multi-processor). When compiling models for multi-processor systems, a large variety of different design alternatives are explored, such as the number and type of processors deployed in the final platform architecture, the type of interconnection network used to connect system components (e.g., bus, crossbar, Network-on-Chip), or the spatial binding (i.e., mapping) and temporal binding (i.e., scheduling) of application tasks to processor cores.

[004] The mapping process can be challenging for several reasons. First, the number of design points can be very high: for instance, ten cores or processors each with ten possible configurations yields 10¹⁰ different design points (10 billion). Secondly, multiple optimization objectives or design criteria such as performance, power/energy consumption, reliability, throughput, latency and engineering cost may need to be considered simultaneously. Since these multiple optimization objectives are often in conflict, there cannot be a single optimal solution that simultaneously optimizes all objectives. Thirdly, depending on the target platform’s architecture and application as well as on the computing resources available for simulation, performance evaluation for a single design point might take minutes, hours, or even days.

[005] The mapping process often requires different alternative solutions are explored in order to select the one(s) that best match a set of given requirements (e.g., throughput, latency, power consumption) that are specified in the form of compilation options. The spectrum of possible implementation alternatives is called the design space and its inspection is called Design Space Exploration (DSE). Traditional DSE techniques requiring the simulation of large portions of the design space, such as Genetic Algorithms or Simulated Annealing are likely to be too slow to be practical.

Summary

[006] A method and apparatus are disclosed herein which, once trained, may be able to map a given program’s tasks to a target platform to achieve high performance with respect to a predetermined performance metric without having to simulate large portions of the design space.

[007] In one embodiment, an apparatus comprises means configured to perform: inputting a state into a system, the system including an algorithm having at least some trainable weights. The state comprises a representation of a program and hardware resource data for a target platform including a plurality of hardware units. The representation of a program comprises a plurality of tasks and for each task a task-type indicator. The hardware resource data comprises, for each hardware unit of the target platform, resource data indicating an amount of each of a plurality of resource types available to that hardware unit. The means are further configured to perform determining a plurality of performance value estimates from the algorithm having at least some trainable weights, and allocating, using the system, a task to a hardware unit according to the determined performance value estimates.

[008] In one embodiment, a method comprises inputting a state into a system including an algorithm having at least some trainable weights, the state comprising a representation of a program and hardware resource data for a target platform including a plurality of hardware units. The representation of a program comprises a plurality of tasks and for each task a task-type indicator. The hardware resource data comprises, for each hardware unit of the target platform, resource data indicating an amount of each of a plurality of resource types available to that hardware unit. The method further comprises determining a plurality of performance value estimates from the algorithm having at least some trainable weights, and allocating, using the system, a task to one of the hardware units according to the determined performance value estimates.

[009] In one embodiment, a computer readable storage medium stores instructions which, when executed by a computer, cause the computer to perform a method. The method comprises inputting a state into a system including an algorithm having at least some trainable weights, the state comprising a representation of a program and hardware resource data for a target platform including a plurality of hardware units. The representation of a program comprises a plurality of tasks and for each task a task-type indicator. The hardware resource data comprises, for each hardware unit of the target platform, resource data indicating an amount of each of a plurality of resource types available to that hardware unit. The method further comprises determining a plurality of performance value estimates from the algorithm having at least some trainable weights, and allocating, using the system, a task to one of the hardware units according to the determined performance value estimates.

Brief Description of the Drawings

[010] Example embodiments will now be described with reference to the accompanying drawings, in which:

[O1] FIG. 1 shows an apparatus according to some embodiments described herein;

[012] FIG. 2 shows an example representation of a program according to some embodiments described herein;

[013] FIG. 3 shows an example hardware platform according to some embodiments described herein;

[014] FIGs. 4 and 5 show a system according to some embodiments described herein;

[015] FIG. 6 shows an example algorithm having at least some trainable weights according to some embodiments described herein;

[016] FIG. 7 depicts a high-level block diagram of an apparatus 400 suitable for use in performing functions described herein according to some embodiments; and

[017] FIGs. 8 and 9 show a method according to some embodiments described herein. Detailed Description

[018] Example embodiments will now be described, including methods and apparatus that avoid the requirement of having to simulate large portions of the design space when allocating a given program’s tasks to a target platform. The method and apparatus, once trained, may be able to map a given program’s tasks to a target platform to achieve high performance with respect to a predetermined performance metric.

[019] Functional blocks denoted as "means configured to perform ..." (a certain function) shall be understood as functional blocks comprising circuitry that is adapted for performing or configured to perform a certain function. A means being configured to perform a certain function does, hence, not imply that such means necessarily is performing said function (at a given time instant). Moreover, any entity described herein as "means", may correspond to or be implemented as "one or more modules", "one or more devices", "one or more units", etc. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional or custom, may also be included. Their function may be carried out through the operation of program logic, through dedicated logic, through the interaction of program control and dedicated logic, or even manually, the particular technique being selectable by the implementer as more specifically understood from the context.

[020] A brief overview of reinforcement learning is presented, focusing only on the notions relevant to the description of the example embodiments below. Reinforcement learning aims to maximize the behavior of an agent which interacts with an environment, by taking actions according to its current state, and receiving scalar feedbacks, called rewards, from the environment. The aim of the agent is to take actions which maximize the long-run received reward. Formally, a reinforcement learning problem can be formulated as a Markovian Decision Process (S,A,R,E), where S is the state space, A is the action space, R is the reward function and E describes the environment’s dynamics. At every time-step t, the agent is in a state s_t S, and takes an action a_t A according to some policy : S A , which maps each state to an action to be taken. After having taken an action, the agent“jumps” into a new state s_t+i e S, according to the environment’s dynamics: s_t+i = E(s_t, a_t).

[021] One form of reinforcement learning is episodic reinforcement learning, which assumes that the agent will reach some final state, S_f, at which point the process ends. The sequence of states and actions taken by the agent t = (so, a₀, S_I, a_I, . . ., S_f ) during an episode is called a trajectory, and at the end of an episode, the agent receives a reward R = R(t), which depends on the trajectory. The aim of reinforcement learning techniques is to find a policy which maximizes the reward received at the end of episodes.

[022] FIG. 1 illustrates an example apparatus 100 according to embodiments described herein. The apparatus 100 comprises means configured to perform inputting a state s into a system 110, the system 110 including an algorithm 120 having at least some trainable weights.

[023] The state s comprises a representation of a program G. The representation of a program G comprises a plurality of tasks V and, for each task, a task-type indicator. Referring now to Fig. 2, in some embodiments the representation of a program G may take the form of a directed graph in which the plurality of tasks V are shown as nodes 210 which are connected by directed edges 220. The directed edges 220 represent data and/or control information dependencies, i.e., an edge 220 directed from task v₁ to V₂ means that the execution of V₂ depends on the production of data and/or control information by vi. In the illustrative example shown in Fig. 2, there are nine tasks labelled vi to V₉ however other numbers of tasks may be present in the program. Each task v₁... V₉ of the plurality of tasks V includes a task-type indicator, shown as tl to t9 in Fig. 2. Each task-type indicator tl to t9 indicates the type of task the corresponding task vi to V₉ from a set of task types T. Examples of task types which may be in the set of task types T in some embodiments include Fast Fourier Transforms (FFT), matrix component-wise product, etc. In some embodiments the representation of a program G is output from a compiler, for instance a compiler’s intermediate output.

[024] In some embodiments, the representation of a program G may be a static program representation, namely a representation generated without the need to run the executable generated by a compiler. Such representations can be generated using the compiler’s intermediate representations (e.g., number and type of instructions, control graphs).

[025] The state s further comprises hardware resource data H for a target platform P on which the program G may be executed. The target platform P includes a plurality of hardware units U. The hardware resource data H comprises, for each hardware unit U of the target platform P, resource data indicating an amount of each of a plurality of resource types available to that hardware unit. In some embodiments, the hardware units U may include some or all of a central processing unit (CPU), digital signal processor (DSP), field-programmable gate array (FPGA), graphics processing unit (GPU). In some embodiments, more than one of each type of hardware unit may be present in the target platform P, for example several CPUs may be present. Example resource types available to a hardware unit may include e.g., computational power, memory size. In some embodiments, hardware resource data H comprises a hardware resources matrix (HRM) H = (hi_j)i_j as a U-by-K matrix such that fr_j corresponds to the amount of resources of type j available on the unit i, and where U denotes the number of hardware units U and K denotes the number of resource types available to a hardware unit. Referring now to Fig. 3, an example target platform P is shown which comprises hardware units U. In the illustrative example shown in Fig. 3, there are nine hardware units U labelled ul to u9 however other numbers of hardware units may be present in the target platform.

[026] Referring again to FIG. 1, the apparatus 100 further comprises means 130 configured to perform determining a plurality of performance value estimates from the algorithm 120, and allocating, using the system 110, a task V to a hardware unit U according to the determined performance value estimates. In some embodiments means configured to perform allocating a task V to a hardware unit U may comprise an allocation selector.

[027] In some embodiments the apparatus 100 is configured as an episodic reinforcement learning system, in which the state space is S = G x H. The system 110, which constitutes an agent in reinforcement learning terminology, observes the input state s at each time step t and determines which task v_t from V to allocate on which hardware unit u_t from U, as described in further detail below. The couple (v_t, u_t) forms the action a_t taken by system 110 at time step t. The action space is therefore A = V x U. This decision is taken according to a policy JI : G ^X H ^ V X U, which is implemented by the algorithm 120.

[028] After allocating the task v_t e V on the hardware unit u_t e U, the hardware resource data H is updated. We denote by C: HxTxU— » H the operator which associates to hardware resource data H, task v_t and hardware unit u_t, the updated hardware resource data H which reflects the available resources after having allocated resources to execute task v_t on hardware unit u_t. The new state is therefore calculated by updating the hardware resource data H according to the chosen allocation: H_t+i = C(H_t, (v_t, u_t)). The updated hardware resource data H _t+i then forms part of the state s_t+i input to the system 110 to make the next allocation. Updating the hardware resource data H implements the environment’s dynamics. C is specific to the hardware platform P, as well as the considered task types T. The environment is shown in Fig. 1 as E.

[029] In some embodiments, the means configured to perform inputting a state s into the system 110 is further configured to input the new state s_t+i into the system 110 such that the system 110 performs allocating another task V to a hardware unit U according to determined performance value estimates. In some embodiments, the means configured to perform inputting a state s into the system 110 is further configured to repeat input each new state into the system until all the tasks V have been allocated to hardware units U on the target platform P. An episode ends once all the tasks in V have been allocated to hardware units U. The sequence of taken allocations determined by the system 110 ((vo, uo), (vi, ui), . . ., (v_|v_|, U_|v_|)) during the episode is also referred to as a trajectory, where |V| denotes the number of tasks in the representation of the program G.

[030] The aim of the training phase, described below, is to find a policy, i.e. values for the trainable weights of the algorithm 120, which maximizes the reward obtained at the end of episodes, which is equivalent to finding a mapping of tasks to hardware units that maximizes the objective function (i.e. which maximizes the performance metric).

[031] Referring now to Fig. 4, one configuration of the system 110 is shown according to some embodiments. The state s is input to the algorithm 120. The algorithm 120 in Fig. 4 is configured to provide performance value estimates qi ...q_n, each of which correspond to an allocation of a task to a hardware unit. The algorithm 120 produces a plurality of outputs, qi ...q_n each of which correspond to a predetermined allocation of a corresponding task to a hardware unit.

[032] In embodiments incorporating the system 110 of Fig. 4, the means 130 are configured to perform allocating a task to a hardware unit according to which allocation of the plurality of allocations has the greatest performance value estimate qi ...q_n. In some embodiments, the means 130 may comprise an allocation selector which receives performance value estimates q₁...q_n, and determines which of them has the greatest value, and determining the corresponding action a.

[033] The embodiment described in relation to Fig. 4 is suitable for cases where the number of possible allocations is sufficiently small that the algorithm 120 can be realized. In other words, the action space A needs to be of a size to allow practical implementation and training of the algorithm 120. In many cases the action space A will be of a size where it is impractical to implement the algorithm 120 to provide performance value estimates for even a sizable portion of the possible allocations of tasks to hardware units. In such cases, embodiments incorporating the system 110 of Fig. 5 may be more suitable.

[034] Referring now to Fig. 5, another configuration of the system 110 is shown according to some embodiments. The algorithm 120 of the system 110 in Fig. 5 is configured to receive the state s and a predetermined allocation of a task to a hardware unit, a;, and to provide a performance value estimate, qi, for that predetermined allocation. The algorithm 120 produces a single output in the system 110 of Fig. 5.

[035] In embodiments incorporating the system 110 of Fig. 5, the means 130 are configured to perform sending, in turn, each of a plurality of predetermined allocations, a,, determined from a plurality of possible allocations to the algorithm 120. Each predetermined allocation, a;, is sent to the algorithm together with the state s. The means 130 are further configured to perform receiving a performance value estimate qi from the algorithm 120 for each sent predetermined allocation a;. The means 130 are further configured to perform allocating a task v_t to a hardware unit u_t corresponding to the predetermined allocation, a_t, having the greatest performance value estimate received from the algorithm 120. In some embodiments, the means 130 may comprise an allocation selector 130 which receives the state s, and sends in turn each of the predetermined allocations, a;, to the algorithm 120. The allocation selector 130 receives the performance value estimate qi for the send allocation a; from the algorithm 120. Once the allocation selector 130 has received performance value estimates qi for each predetermined allocation a;, the allocation selector 130 determines which performance value estimate has the greatest value and selects the corresponding allocation a_t of a task to a hardware unit.

[036] Where the number of considered tasks and hardware units, respectively | V| and |U|, are relatively small (up to a few tens), the means 130 may be configured to perform an exhaustive search of the possible allocations at each time step. That is, the plurality of predetermined allocations, a;, determined from a plurality of possible allocations may consist of every possible allocation.

[037] In some embodiments, the number of predetermined allocations sent to the algorithm may be reduced, thereby making the system 110 less resource consuming. First, at each time step t, only the set of tasks not yet allocated may be considered for allocation. Thus, V_t = (v e V I v g (vo, . . ., v_t-i)}. In some embodiments, this may be affected by including in the environment dynamics an update to the set of tasks be allocated, i.e. V_t+i = V_t - v_t, where V_t is the set of tasks remaining to be allocated at time step t and v_t is the task allocated at time step t.

[038] Moreover, in some embodiments, at time step t, for a given task v e Vt, not all hardware units need to be considered. Only hardware units that (i) are capable of executing v (e.g., in the sense of instruction set), and (ii) have enough available resources, need to be considered. Thus means 130 may be configured to determine the plurality of predetermined allocations a; from the plurality of possible allocations according to the task- type indicator for each task v to be allocated and an available amount of at least one of the plurality of resource types available to each hardware unit. This determination may be made from the state s_t at time step t.

[039] Referring now to Fig. 6, in some embodiments the algorithm 120 is implemented as a neural network 140 having a plurality of trainable weights. The algorithm 120 in some embodiments comprises a plurality of dense layers 142. In the system 110 shown in Fig. 5, the algorithm 120 outputs a single real number representing the performance value estimate qi, so the last layer 142 of the algorithm 120 must be made of a single neuron only in such embodiments. The choice of the activation function of the last layer 142 depends upon the range of the performance value estimate, or reward, which is application specific. For example, if the reward takes values in [0, 1], a sigmoid activation could be used, whereas if the reward takes value in the set of non-negative real numbers, R+, a ReLu activation could be preferred.

[040] An important consideration is how the state s is input to the neural network 140. The hardware resource data H may be represented as a matrix as described above, in which case the matrix may be input to the neutral network 140. The representation of the program G also needs to be in a form suitable for feeding to the neural network 140 while keeping information about the nodes 210 proximity. One approach to achieve this goal consists of mapping each node 210 of V to a real- valued vector, such that nodes that are close in G are mapped to vectors close, in the sense of a norm such as the Euclidean distance. This can be done using, e.g., the DeepWalk algorithm (Perozzi et al. [2014]). The vector representation of the directed graph G may then be fed to the neural network 140. In some embodiments, the algorithm 120 includes a representation extractor 150 that receives the directed graph G and produces therefrom a vector representation that is input to the neural network. In some embodiments, such as the embodiment illustrated in Fig. 5, the allocation a; is also input to the neural network 140. In some embodiments, such as the embodiment illustrated in Fig. 4, only the state s is input to the neural network 140. In the embodiment illustrated in Fig. 4, the neural network 140 outputs a plurality of performance value estimate qi, so the last layer 142 of the algorithm 120 must be made of a corresponding plurality of neurons in such embodiments.

[041] Training of the system 110, which aims to find a policy which maximizes the reward obtained at the end of an episode, will now be described. That is, during training the trainable weights of the algorithm 120 are adjusted such that the performance value estimates provided by the algorithm 120 result in the system 110 allocating tasks to hardware units in a manner that maximizes the performance of the program G when executed on target platform P with respect to a given performance criteria.

[042] Referring now to FIG. 1, in some embodiments, the apparatus further comprises means 160 configured to perform training of the algorithm 120’s trainable weights. Representations of a plurality of programs 170, which will be used to train the algorithm 120, are provided. The means 160 is configured to perform training of the algorithm 120 by selecting a program G from the plurality of training program 170. Initially, t is set to 0 and the hardware resource data Ho is established according to the resources initially available for allocating to the program tasks. The representation of the program G gives the initial set of tasks Vo to be allocated.

[043] The training algorithm described below is based on Qleaming (Watkins and Dayan [1992]), however other suitable training algorithms may be used. Qleaming involves training the algorithm 120 to leam an approximation of the optimal value function Q* : S xA— » R which gives for each state-action couple the reward obtained at the end of the episode if the optimal policy is followed. We denote the learned approximation by Qe, where Q is the parameters vector (i.e. the values of the trainable weights of the algorithm 120). The policy used by the system 110 to allocate application tasks to hardware units is then simply the deterministic policy which chooses for each state s_t = (H_t, G) the action a_t that maximizes the performance value estimate: where (H_t, G ) is the state s_t and (v_t, u_t) is the allocation a_t denoting the allocation of task v_t to hardware unit u_t.

[044] As the number of considered tasks and hardware units, respectively |V| and |U|, are relatively small (up to a few tens), in some embodiments the means 160 is configured to train the algorithm by performing an exhaustive search. That is, at each time step t, every possible allocation, or action, a,, is assessed by passing the state s_t and allocation a, to the algorithm 120. The algorithm 120 returns a performance value estimate qi, which corresponds is the allocation a, denoting the allocation of task v; to hardware unit Uj. The allocation having the highest performance value estimate received from the algorithm 120 is selected as the allocation a_t time step t.

[045] In some embodiments, training of the system 110 may be made less resource consuming by considering for allocation only the set of tasks not yet allocated at each time step t. Thus, V_t = {v V I v Ï (vo, . . ., v_t-i)}. Alternatively, or in addition, in some embodiments, at time step t, for a given task v Vt, not all hardware units need to be considered. Only hardware units that (i) are capable of executing v (e.g., in the sense of instruction set), and (ii) have enough available resources, need to be considered. Thus means 160 may be configured to determine the plurality of predetermined allocations a; from the plurality of possible allocations according to the task-type indicator for each task v to be allocated and an available amount of at least one of the plurality of resource types available to each hardware unit. This determination may be made from the state s_t at time step t.

[046] During the training, exploration of the environment is required to sample new trajectories and improve the estimation (¾ of the optimal value function Q*. To explore the environment at training, an e-greedy exploration strategy is used, wherein means 160 may be further configured to configure means 130 during training such that at each time step t, means 130 selects: the allocation a; corresponding to the highest performance value qi estimate received from the algorithm 120, with probability l-s; and randomly selects an allocation a; from all available allocations, with probability s.

[047] The parameter e [0, 1] controls the amount of exploration used during training. A value of 1 leads to pure exploration, meaning the means 130 randomly chooses an allocation a; from all available allocations at each time step, resulting in a random trajectory. A value of 0 leads to no exploration, meaning the means 130 chooses the allocation a; having the greatest performance value estimate at each time step.

[048] The means 160 may be configured to set =0 at the end of training so that the means 130 always chooses the best action learned so far according to the performance value estimates.

[049] The means 160 is further configured to perform training of the algorithm 120 by generating a new state after the allocation a_t is determined, by updating the hardware resource data according to H_t+i = C(H_t, v_t, u_t) and updating the set of tasks to be allocated according to V_t+i = V_{t -} {v_t}. The means 160 is further configured to perform training of the algorithm 120 by performing further allocations until all tasks in the program G have been allocated to hardware units U on target platform P, which constitutes the end of the episode.

[050] The means 160 is further configured to perform training of the algorithm 120 by adjusting the trainable weights of the algorithm 120 using a loss function pertaining to the performance metric. In some embodiments, the means 160 is further configured to adjust the trainable weights of the algorithm 120 as follows. For every allocation a_t made during the episode, where t = 1, ..., |V| the trainable weights Q are updated according to:

where a is a learning rate, V is a gradient operator, and Q represents the target which is used instead of the optimal value function Q*, since Q* is often not available. Q takes the form:

Q( (H_t, G), (v_t, u_t) )

the state is not final,

= R, if the state is final

where R is the reward. Thus the target Q is the best available performance value estimate for the action if the state is not final, and is the reward if the state is final. The target Q is chosen to take this form since the optimal value function corresponds to the reward received at the end of the episodes if the actions are chosen so to maximize this reward. During the training, at the end of an episode for which the reward R was received, the parameters vector is updated by stochastic gradient descent (SGD) as described above and can be performed using any algorithms available in the literature (e.g., ADAM, RMSProp, Momentum). In some embodiments, the reward R may be obtained by benchmarking. Program benchmarking aims at evaluating the effectiveness of the system’s allocations by measuring the performance of the program when executing on the target platform P. This evaluation may be conducted by providing the executing program with different types of workloads according the program. Data are collected by profiling the executable program (e.g., monitor variables’ values, memory accesses, control statements, power usage), and from this data the reward R, relevant to a certain performance metric, is determined. In some embodiments, the reward R may be determined from data obtained by modelling or simulating the performance of the program when executing on the target platform P.

[051] The means 160 is next configured to perform training of the system 110 by either stopping or repeat the above steps for another iteration. In some embodiments, the stop criterion can take multiple forms, including: stop after a fixed number of training iterations, stop when the reward R reaches a desired value.

[052] In some embodiments, the parameter s, the learning rate a and possible other parameters of the chosen SGD variant (e.g., ADAM, RMSProp, Momentum) are parameters of the means 160.

[053] As the means 160 does not rely on examples of optimized task allocations for programs forming the training set 170, any new program for which the apparatus 100 might be useful can be added to the training set 170, and be used to optimize furthermore the system 110 after its initial training phase.

[054] In some embodiments, the training of the algorithm 120 may be specific to the hardware platform P as well as to the resource needs to the different task types T forming the program.

[055] Referring now to FIG. 7, in some embodiments, the means configured to perform may take the form of an apparatus 400 comprising at least one processor 410 (e.g., a central processing unit (CPU) and/or other suitable processor(s)) and at least one memory 420 (e.g., random access memory (RAM), read only memory (ROM), or the like). The apparatus 400 further comprises computer program code 430 and various input/output devices 440 (e.g., a user input device (such as a keyboard, a keypad, a mouse, a microphone, a touch-sensitive display or the like), a user output device 450 (such as a display, a speaker, or the like), and storage devices (e.g., a tape drive, a floppy drive, a hard disk drive, a compact disk drive, non-volatile memory or the like)). The computer program code 430 can be loaded into the memory 420 and executed by the processor 410 to implement functions as discussed herein and, thus, computer program code 430 (including associated data structures) can be stored on a computer readable storage medium, e.g., RAM memory, magnetic or optical drive or diskette, or the like.

[056] Some embodiments relate to a method 500 as shown in FIG. 8. The method 500 comprises, at 502, inputting a state into a system including an algorithm having at least some trainable weights. The state comprises a representation of a program. The representation of a program comprises a plurality of tasks and, for each task, a task-type indicator. In some embodiments the representation of a program G may take the form of a directed graph as described above in relation to Fig. 2.

[057] The state further comprises hardware resource data for a target platform on which the program may be executed. The target platform includes a plurality of hardware units. The hardware resource data comprises, for each hardware unit of the target platform, resource data indicating an amount of each of a plurality of resource types available to that hardware unit. In some embodiments, the hardware units may include some or all of a central processing unit (CPU), digital signal processor (DSP), field-programmable gate array (FPGA), graphics processing unit (GPU). In some embodiments, more than one of each type of hardware unit may be present in the target platform, for example several CPUs may be present. Example resource types available to a hardware unit may include e.g., computational power, memory size. In some embodiments, hardware resource data H comprises a hardware resources matrix (HRM) H = (h_ij)_ij as a U-by-K matrix such that h_y corresponds to the amount of resources of type j available on the unit i, and where U denotes the number of hardware units U and K denotes the number of resource types available to a hardware unit.

[058] The method 500 further comprises, at 504, determining a plurality of performance value estimates from the algorithm. The method 500 further comprises, at 506, allocating, using the system, a task to a hardware unit according to the determined performance value estimates.

[059] In some embodiments, the method 500 further comprise at 504, determining, using the system, performance value estimates for a plurality of allocations of a task to a hardware unit, and allocating, at 506, a task to one of the hardware units according to which allocation of the plurality of possible allocations has the greatest performance value estimate.

[060] In some embodiments, the method 500 further comprises allocating, at 506, using the system, a task to a hardware unit by sending, in turn, each of a plurality of predetermined allocations determined from a plurality of possible allocations to the algorithm. Each predetermined allocation is sent to the algorithm together with the state. In such embodiments, the method 500 further comprises receiving a determined performance value estimate from the algorithm for each sent predetermined allocation. In such embodiments, the method 500 further comprises allocating a task to a hardware unit corresponding to the predetermined allocation having the greatest performance value estimate received from the algorithm.

[061] In some embodiments, the method 500 further comprises performing an exhaustive search of the possible allocations at each time step. That is, the method comprises sending, in turn, every possible allocation to the algorithm.

[062] In some embodiments, the method 500 further comprises excluding tasks already allocated from consideration for allocation. Moreover, in some embodiments the method 500 further comprises for a given task, determining hardware units that (i) are capable of executing the task (e.g., in the sense of instruction set), and (ii) have enough available resources, and including only these hardware units in the predetermined allocations for the task that are sent to the algorithm. In some embodiments, the method 500 further comprises determining the plurality of predetermined allocations from the plurality of possible allocations according to the task-type indicator for each task to be allocated and an available amount of at least one of the plurality of resource types available to each hardware unit.

[063] In some embodiments, the method 500 further comprises, at 508, determining a new state by updating the hardware resource data according to the allocation. The new state is determined by updating the hardware resource data according to the chosen allocation: H_t+i = C(H_t, (v_t, u_t)). In some embodiments, at 508, the method further comprises determining the new state s_t+i by also determining an updated set of tasks that remain to be allocated. In some embodiments, the method 500 further comprises, at 510, inputting the new state into the system such that the system 110 performs allocating another task to a hardware unit according to determined performance value estimates, and repeating until all tasks have been allocated by the system.

[064] In some embodiments, the method 500 further comprises, at 512, training of the system as described in more detail below with reference to Fig. 9. In some embodiments, training 512 of the system is performed prior to use. In some embodiments, training 512 of the system may also be performed after the system has been used to further improve its allocations. For example, in some embodiments, the system may be used to allocate tasks to hardware units for a plurality of programs provided in an input queue, and further training 512 of the system may be performed while the queue is empty i.e. while the system is idle. Referring now to Fig. 9, training 512 of the system comprises, at 520, providing representations of a plurality of programs which will be used to train the algorithm. In some embodiments, training 512 further comprises, at 522, selecting a program from the plurality of training programs. Initially, t is set to 0 and the hardware resource data Ho is determined according to the resources initially available for allocating to the program tasks. The initial set of tasks Vo to be allocated is determined from the representation of the program.

[065] In some embodiments, training 512 further comprises, at 524, determining a plurality of performance value estimates from the algorithm having at least some trainable weights. As the number of considered tasks and hardware units, respectively |V| and |U|, are relatively small (up to a few tens), in some embodiments an exhaustive search is performed. That is, at each time step t, every possible allocation, or action is assessed by passing the state and allocation to the algorithm. The algorithm returns a performance value estimate for each allocation.

[066] In some embodiments, training 512 further comprises, at 526, allocating, using the system, a task to a hardware unit according to the determined performance value estimates by selecting the allocation having the highest performance value estimate received from the algorithm. In some embodiments, training 512 may be made less resource consuming by excluding tasks already allocated from consideration for allocation. Moreover, in some embodiments the training 512 further comprises for a given task, determining hardware units that (i) are capable of executing the task (e.g., in the sense of instruction set), and (ii) have enough available resources, and including only these hardware units in the predetermined allocations for the task that are sent to the algorithm. In some embodiments, training 512 further comprises determining the plurality of predetermined allocations from the plurality of possible allocations according to the task- type indicator for each task to be allocated and an available amount of at least one of the plurality of resource types available to each hardware unit.

[067] In some embodiments, allocating 526 further comprises an s-greedy exploration strategy, wherein at each time step, the allocation corresponding to the highest performance value estimate received from the algorithm is selected with probability l-s, and a random allocation from all available allocations is selected with probability s. The parameter s e [0, 1] controls the amount of exploration used during training.

[068] In some embodiments, training 512 further comprises determining a new state once the allocation is determined, by updating the hardware resource data according to H_t+i = C(H_t, v_t, u_t) and updating the set of tasks to be allocated according to V_t+i = V_{t -} {v_t} .

[069] In some embodiments, training 512 further comprises, at 528, performing further allocations by repeating steps 524 and 526 until all tasks in the program have been allocated to hardware units on target platform, which constitutes the end of the episode.

[070] In some embodiments, training 512 further comprises, at 530, performing training of the algorithm by adjusting the trainable weights of the algorithm using a loss function pertaining to the performance metric. In some embodiments, the trainable weights of the algorithm are adjusted using stochastic gradient descent or a variant thereof.

[071] In some embodiments, the trainable weights Q of the algorithm are adjusted according to:

where a is a learning rate, V is a gradient operator, and Q represents the target. Q takes the form:

Q( (H_t, G), (v_t, u_t) ) the state is not final,

= R, if the state is final

where R is the reward. Thus, the target Q is the best available performance value estimate for the action if the state is not final, and is the reward if the state is final. The target Q is chosen to take this form since the optimal value function corresponds to the reward received at the end of the episodes if the actions are chosen so to maximize this reward.

[072] In some embodiments, training 512 further comprises either stopping or repeating the above steps for another iteration. In some embodiments, the stop criterion can take multiple forms, including: stop after a fixed number of training iterations; stop when the reward R reaches a desired value.

[073] It will be appreciated that the functions depicted and described herein may be implemented in software (e.g., via implementation of software on one or more processors, for executing on a general purpose computer (e.g., via execution by one or more processors) so as to implement a special purpose computer, or the like) and/or may be implemented in hardware (e.g., using a general purpose computer, one or more application specific integrated circuits (ASIC), and/or any other hardware equivalents).

[074] A further embodiment is a computer program product comprising a computer readable storage medium having computer readable program code embodied therein, the computer readable program code being configured to implement one of the above methods when being loaded on a computer, a processor, or a programmable hardware component. In some embodiments, the computer readable storage medium is non-transitory.

[075] A person of skill in the art would readily recognize that steps of various above- described methods can be performed by programmed computers. Herein, some embodiments are also intended to cover program storage devices, e.g., digital data storage media, which are machine or computer readable and encode machine-executable or computer-executable programs of instructions where said instructions perform some or all of the steps of methods described herein. The program storage devices may be, e.g., digital memories, magnetic storage media such as magnetic disks and magnetic tapes, hard drives, or optically readable digital data storage media. The embodiments are also intended to cover computers programmed to perform said steps of methods described herein or (field) programmable logic arrays ((F)PLAs) or (field) programmable gate arrays ((F)PGAs), programmed to perform said steps of the above-described methods.

[076] It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.

[077] While aspects of the present disclosure have been particularly shown and described with reference to the embodiments above, it will be understood by those skilled in the art that various additional embodiments may be contemplated by the modification of the disclosed machines, systems and methods without departing from the scope of what is disclosed. Such embodiments should be understood to fall within the scope of the present disclosure as determined based upon the claims and any equivalents thereof.

Claims

1. An apparatus (100) comprising means configured to perform:

inputting a state (s, s_t) into a system (110), the system including an algorithm (120) having at least some trainable weights, the state comprising a representation of a program (G) and hardware resource data (H) for a target platform (P) including a plurality of hardware units (ul ...u9), the representation of a program comprising a plurality of tasks (vl ...v9) and for each task a task-type indicator, the hardware resource data comprising, for each hardware unit of the target platform, resource data indicating an amount of each of a plurality of resource types available to that hardware unit;

determining a plurality of performance value estimates (qi) from the algorithm (120) having at least some trainable weights;

allocating (a_t), using the system (110), a task (v_t) to a hardware unit (u_t) according to the determined performance value estimates (qi).

2. The apparatus of claim 1, wherein:

the algorithm (120) having at least some trainable weights is configured to provide performance value estimates (qi) for a plurality of allocations (a;) of a task (v;) to a hardware unit (_¾), and

allocating (a_t), using the system (110), a task (v_t) to a hardware unit (u_t) according to the determined performance value estimates (qi) comprises allocating (a_t) a task (v_t) to a hardware unit (u_t) according to which allocation of the plurality of allocations (a;) has the greatest performance value estimate (¾).

3. The apparatus of claim 1, wherein:

the algorithm (120) having at least some trainable weights is configured to receive the state (s, s_t) and a predetermined allocation (a;) of a task (v;) to a hardware unit (U_i) and to provide a performance value estimate (qi) for the predetermined allocation; and

allocating (a_t), using the system (110), a task (v_t) to a hardware unit (u_t) comprises: sending, in turn, each of a plurality of predetermined allocations (a;) determined from a plurality of possible allocations to the algorithm, each predetermined allocation being sent to the algorithm together with the state;

receiving a performance value estimate (qi) for the sent predetermined allocation (a;) from the algorithm (120); and

allocating (a_t) a task (v_t) to a hardware unit (u_t) corresponding to the predetermined allocation (a;) having the greatest performance value estimate (qi) received from the algorithm (120).

4. The apparatus of claim 3, wherein:

the plurality of predetermined allocations (a;) is determined from the plurality of possible allocations according to the task-type indicator for each task (v) to be allocated and an available amount of at least one of the plurality of resource types available to each hardware unit (u).

5. The apparatus of claim 4, wherein the means are further configured to perform:

determining a new state (s_t+i) by updating the hardware resource data (H_t) according to the allocation (a_t);

inputting the new state (s_t+i) into the system (110); and

repeating until all tasks (v) have been allocated by the system (110).

6. The apparatus of any preceding claim, wherein the means (160) are further configured to perform:

training of the system (110), comprising:

providing representations of a plurality of programs (170);

selecting a representation of a program from the plurality of programs (170);

allocating (a_t), using the system (110), a task (v_t) to a hardware unit (u_t) according to the determined performance value estimates ( q);

repeating the previous two steps until all tasks (v) have been allocated to hardware units (u); and

adjusting the trainable weights (Q) of the algorithm (120) using a loss function pertaining to the performance metric.

7. The apparatus of claim 6, wherein:

the trainable weights (Q) of the algorithm (120) are adjusted using stochastic gradient descent or a variant thereof

8. The apparatus of any preceding claim, wherein the means comprises:

at least one processor (410); and

at least one memory (420) including computer program code (430), the at least one memory (420) and computer program code (430) configured to, with the at least one processor (410), cause the performance of the apparatus (100).

9. A method (500) comprising:

inputting (502) a state into a system including an algorithm having at least some trainable weights, the state comprising a representation of a program and hardware resource data for a target platform including a plurality of hardware units, the representation of a program comprising a plurality of tasks and for each task a task-type indicator, the hardware resource data comprising, for each hardware unit of the target platform, resource data indicating an amount of each of a plurality of resource types available to that hardware unit; and

determining (504) a plurality of performance value estimates from the algorithm having at least some trainable weights;

allocating (506), using the system, a task to one of the hardware units according to the determined performance value estimates.

10. The method of claim 9, further comprising:

determining (504), using the system, performance value estimates for a plurality of allocations of a task to a hardware unit, and

allocating (506) a task to one of the hardware units according to which allocation of the plurality of possible allocations has the greatest performance value estimate.

11. The method of claim 9, further comprising:

allocating (506), using the system, a task to a hardware unit by:

sending, in turn, each of a plurality of predetermined allocations to the algorithm, each predetermined allocation being sent to the algorithm together with the state; and

receiving a determined (504) performance value estimate for the sent predetermined allocation from the algorithm; and

allocating a task to one of the hardware units corresponding to the predetermined allocation having the greatest performance value estimate received from the algorithm.

12. The method of claim 11, further comprising:

determining the plurality of predetermined allocations from a plurality of possible allocations according to the task-type indicator for each task to be allocated and an available amount of at least one of the plurality of resource types available to each hardware unit.

13. The method of any preceding claim, further comprising:

training (512) the system, comprising:

providing (520) representations of a plurality of programs;

selecting (522) a representation of a program from the plurality of programs;

determining (524) a plurality of performance value estimates from the algorithm having at least some trainable weights;

allocating (526), using the system, a task to a hardware unit according to the determined performance value estimates;

repeating (528) the previous two steps until all tasks have been allocated to hardware units; and

adjusting (530) the trainable weights of the algorithm using a loss function pertaining to the performance metric.

14. The method of claim 13, wherein:

the trainable weights of the algorithm are adjusted using stochastic gradient descent or a variant thereof

15. A computer readable storage medium storing instructions which, when executed by a computer, cause the computer to perform a method, the method comprising:

allocating (506), using the system, a task to a hardware unit according to the determined performance value estimates.