AU2013266988A1

AU2013266988A1 - Application specific MPSoC synthesis using optimized code partitioning

Info

Publication number: AU2013266988A1
Application number: AU2013266988A
Authority: AU
Inventors: Jude Angelo Ambrose; Sridevan Parameswaran; Jorgen Peddersen; Yusuke Yachide
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2013-12-03
Filing date: 2013-12-03
Publication date: 2015-06-18

Abstract

-42 Abstract Application Specific MPSoC Synthesis Using Optimised Code Partitioning 5 A method (1000) of modifying high level code (402) to generate partitioned code (1010) for a heterogeneous multicore system (103), the method comprising the steps of: determining (1001, 403) statement properties (404, 405, 406) of the statements of the high level code; constructing (1003, 407) from said statement properties a dependency graph (408, 310) comprising at least some of the statements (320) linked by edges (324) representing dependencies between the 10 statements; determining (1002) performance properties (413) of the statements; mapping (1004) the statements in the dependency graph to cores of the multicore system according to the dependency graph and the performance properties to form a partitioned graph (1009); and generating (1005) from the partitioned graph the partitioned code for the heterogeneous multicore system DAQGA1AR / 21( A21 I F- 00 a) IL o -. < < m m m C)' 2i) ) E / 0/ 0 o C0 C'JCj Z5 E CC ' C~ 0\ c~ci ~ < 0 x L cD~~ I Eu), c~E 0- E-- - iE o- a = C Lo(O aL c 0 I 2 f~lf~lf Io LO ) C c:) OI.. o D E-' ' ___ ___ __ ___ __ ___~- - - ~ 0

Description

-1 APPLICATION SPECIFIC MPSOC SYNTHESIS USING OPTIMIZED CODE PARTITIONING TECHNICAL FIELD [0001] The present invention relates to automation tools for designing digital hardware systems in the electronics industry and, in particular, to a method and apparatus for converting code written for one processing environment into code useful for another processing environment. BACKGROUND [0002] One goal, when developing a parallel processing system, is to be able to take legacy code, typically written to be executed on a single processor, and convert the code to execute on a machine or machines with multiple processors. Current systems are unable to automatically convert any legacy code to execute on a parallel machine. Most current attempts at parallelization are directed at large hardware architectures with identical processor that use shared memory. For example, machines like the Intel Xeon Cores with six processors and Niagara cores by Sun Microsystems are typical targets of parallelization. [0003] The difficulty of programming such Multi-processor Systems on Chips (MPSoCs) to maximally exploit parallelism is well-known. Many computation-intensive applications, particularly streaming applications for signal processing and multimedia, often spend most of their runtime in control loops. In such applications, improving the throughput and latency of each iteration of the loop is critical. [0004] Automatic parallelization of applications has re-emerged as an important technology with the advent of multicore (also referred to as a multi-processor) architectures. Techniques exist for automatic parallelization of control loops, at the level of iterations. Such techniques include loop converters using polyhedral techniques which automatically convert loops to create parallelism by rearranging the loop indices, to code conversion arrangements supporting OpenMP execution. OpenMP is an Application programming interface (API) which supports multithreading for shared memory multiprocessors. Explicitly tagging the source code with OpenMP APIs allows an operating system to parallelise threads to improve performance. The polyhedral approaches are limited to Affine Nested Loops, which are loops with self-contained variables and affine indices. Such techniques are not applicable to all general program structures. The OpenMP -2 execution requires an operating system such as Windows, Mac OS X or Solaris which could parallelize threads. OpenMP techniques further require a shared memory based architecture. OpenMP requires manual intervention to identify code segments in a legacy code which are amenable to parallelisation. Polyhedral approaches are typically limited to homogeneous MPSoC architectures. SUMMARY [0005] It is an object of the present invention to substantially overcome, or at least ameliorate, one or more disadvantages of existing arrangements. [0006] Disclosed are arrangements, referred to as Dependency Based Partitioned Code (DBPC) arrangements, which seek to address the above problems by creating an MPSoC system (also referred to as a multicore system) from high level sequential legacy code by (a) performing statement-level analysis of code using a rule based dataflow analysis to parallelize control loops, (b) performing statement-level optimization for the synthesis of the MPSoC using an Integer Linear Programming (ILP) based optimizer to optimally map statements of the control loop to processors for the given constraints, and (c) performing heterogeneous optimized MPSoC synthesis by performing a traversal based code distribution to move code to specific cores which utilize them based on the dependencies. [0007] According to a first aspect of the present disclosure, there is provided a method of modifying high level code to generate partitioned code for a heterogeneous multicore system, the method comprising the steps of: determining statement properties of statements in the high level code; constructing from said statement properties a dependency graph comprising statements in a control loop in the high level code, said statements in the dependency graph being linked by edges representing dependencies between the statements; determining performance properties of the statements in the control loop; mapping the statements in the dependency graph to cores of the multicore system according to the dependency graph and the performance properties to form a partitioned graph and generating from the partitioned graph the partitioned code for the heterogeneous multicore system.

-3 [0008] According to another aspect of the present disclosure, there is provided an apparatus for implementing any one of the aforementioned methods. [0009] According to another aspect of the present disclosure, there is provided a computer program product including a computer readable medium having recorded thereon a computer program for implementing any one of the methods described above. [00010] Other aspects of the invention are also disclosed. BRIEF DESCRIPTION OF THE DRAWINGS [00011] One or more embodiments of the invention will now be described with reference to the following drawings, in which: [00012] Fig. 1 depicts how code can be parallelized to execute on either a homogeneous MPSoC with shared memory or on a heterogeneous MPSoC with distributed memory; [00013] Figs. 2A and 2B depict statement-level code partitioning and MPSoC synthesis; [00014] Fig. 3 is Statement-level decomposition of an example; [00015] Fig. 4 is a block diagram of MPSoC system generation methodology according to one DBPC arrangement; [00016] Fig. 5 is a partial RPDG of the code snippet shown in Fig. 1; [00017] Fig. 6 depicts a process, represented by a code fragment for producing an optimal MPSoC according to the disclosed DBPC approach; [00018] Fig. 7 depicts a process, represented by a code fragment, for generating a RPDG; [00019] Fig. 8 is a block diagram of a general purpose computer system upon which the disclosed DBPC arrangements can be practiced; [00020] Fig. 9 depicts dependency links referred to as "inter-" links outside a procedure; -4 [00021] Fig. 10 is a flow chart of a typical process for performing the disclosed DBPC arrangement; [00022] Fig. 11 is a flow chart of a typical process for performing the statement / properties analyser process in the process of Fig. 10; [00023] Fig. 12 is a flow chart of a typical process for performing the mapper process in the process of Fig. 10; [00024] Fig. 13 is a flow chart of a typical process for performing the code generator process in the process of Fig. 10; and [00025] Figs. 14A and 14B form a mode detailed schematic block diagram of the general purpose computer system of Fig. 8. DETAILED DESCRIPTION INCLUDING BEST MODE [00026] Where reference is made in any one or more of the accompanying drawings to steps and/or features, which have the same reference numerals, those steps and/or features have for the purposes of this description the same function(s) or operation(s), unless the contrary intention appears. [00027] It is to be noted that the discussions contained in the "Background" section and that above relating to prior art arrangements relate to discussions of documents or devices which may form public knowledge through their respective publication and/or use. Such discussions should not be interpreted as a representation by the present inventor(s) or the patent applicant that such documents or devices in any way form part of the common general knowledge in the art. Context [00028] In order to create an MPSoC system (also referred to as a multicore system) from legacy code, three steps are necessary, these steps being (a) statement-level analysis of the legacy code, (b) statement-level optimization of the code for the synthesis of the MPSoC, and (c) heterogeneous optimized MPSoC synthesis. In the disclosed DBPC arrangements a statement level analysis of the code uses a rule based dataflow analysis to parallelize control loops as -5 required by (a) above. In regard to the step (b), in the DBPC arrangements an Integer Linear Programming (ILP) based optimizer is used to optimally design a system for the given constraints such as latency, throughput, area, overall communication data size and code size. Finally, in regard to (c) above, in regard to the MPSoC generation step the DBPC arrangement performs a traversal based code distribution that moves code to specific cores which utilize them based on the dependencies. [00029] Fig. 8 depicts a high level functional block diagram of a system 800 to generate heterogeneous MPSoC with distributed memory 103 according to the disclosed DBPC arrangements. An element 801 is a CPU for controlling the entire system 800. An element 802 represents Read Only Memory (ROM) for storing boot program/BIOS and an element 803 represents Random Access Memory (RAM) which is utilized as a work area for the CPU 801 and for storing an operation system and DBPC application. An element 804 is a hard disk drive for storing the DBPC software application for generating a heterogeneous MPSoC with distributed memory 103 according to the disclosed DBPC arrangements, and for storing various kinds of data. An element 805 and an element 806 are a keyboard and a mouse respectively for providing a user interface. An element 807 is a display control device storing video memory and display controller internally. A display 808 can receive video signal from the display control device 807 to display a video signal. An element 809 is an interface for communication to external devices. In this system, once this system powers up the CPU 801 operates the boot program which is stored in the ROM 802 and the operating system (OS) which is stored in the HDD 804 is loaded to RAM 803. Then, this system 800 exeutes a DBPC software application 1433 (see Fig. 14A) to perform the DBPC method and to generate a heterogeneous MPSoC with distributed memory. [00030] Figs. 14A and 14B depict the general-purpose computer system 800 in more detail. [00031] As seen in Fig. 14A, the computer system 800 includes: a computer module 1401; input devices such as the keyboard 805, the mouse pointer device 806, a scanner 1426, a camera 1427, and a microphone 1480; and output devices including a printer 1415, the display device 808 and loudspeakers 1417. An external Modulator-Demodulator (Modem) transceiver device 1416 may be used by the computer module 1401 for communicating to and from a communications network 1420 via a connection 1421. The communications network 1420 may -6 be a wide-area network (WAN), such as the Internet, a cellular telecommunications network, or a private WAN. Where the connection 1421 is a telephone line, the modem 1416 may be a traditional "dial-up" modem. Alternatively, where the connection 1421 is a high capacity (e.g., cable) connection, the modem 1416 may be a broadband modem. A wireless modem may also be used for wireless connection to the communications network 1420. [00032] The computer module 1401 typically includes at least the one processor unit 801, and a memory unit 803. For example, the memory unit 803 may have semiconductor random access memory (RAM) and semiconductor read only memory (ROM). The computer module 1401 also includes an number of input/output (1/0) interfaces including: an audio-video interface 807 that couples to the video display 808, loudspeakers 1417 and microphone 1480; an 1/0 interface 1413 that couples to the keyboard 805, mouse 806, scanner 1426, camera 1427 and optionally a joystick or other human interface device (not illustrated); and an interface 809 for the external modem 1416 and printer 1415. In some implementations, the modem 1416 may be incorporated within the computer module 1401, for example within the interface 809. The computer module 1401 also has a local network interface 1411, which permits coupling of the computer system 800 via a connection 1423 to a local-area communications network 1422, known as a Local Area Network (LAN). As illustrated in Fig. 14A, the local communications network 1422 may also couple to the wide network 1420 via a connection 1424, which would typically include a so-called "firewall" device or device of similar functionality. The local network interface 1411 may comprise an Ethernet circuit card, a Bluetooth@ wireless arrangement or an IEEE 802.11 wireless arrangement; however, numerous other types of interfaces may be practiced for the interface 1411. [00033] The 1/0 interfaces 809 and 1413 may afford either or both of serial and parallel connectivity, the former typically being implemented according to the Universal Serial Bus (USB) standards and having corresponding USB connectors (not illustrated). Storage devices 1409 are provided and typically include a hard disk drive (HDD) 804. Other storage devices such as a floppy disk drive and a magnetic tape drive (not illustrated) may also be used. An optical disk drive 1412 is typically provided to act as a non-volatile source of data. Portable memory devices, such optical disks (e.g., CD-ROM, DVD, Blu-ray DiscTM), USB-RAM, portable, external hard drives, and floppy disks, for example, may be used as appropriate sources of data to the system 800.

-7 [00034] The components 801 to 1413 of the computer module 1401 typically communicate via an interconnected bus 1404 and in a manner that results in a conventional mode of operation of the computer system 800 known to those in the relevant art. For example, the processor 801 is coupled to the system bus 1404 using a connection 1418. Likewise, the memory 803 and optical disk drive 1412 are coupled to the system bus 1404 by connections 1419. Examples of computers on which the described arrangements can be practised include IBM-PC's and compatibles, Sun Sparcstations, Apple MacTM or a like computer systems. [00035] The DBPC methods may be implemented using the computer system 800 wherein the processes of Figs. 6, 7 and 10-13, to be described, may be implemented as one or more software application programs 1433 executable within the computer system 800. In particular, the steps of the DBPC method are effected by instructions 1431 (see Fig. 14B) in the software 1433 that are carried out within the computer system 800. The software instructions 1431 may be formed as one or more code modules, each for performing one or more particular tasks. The software may also be divided into two separate parts, in which a first part and the corresponding code modules performs the DBPC methods and a second part and the corresponding code modules manage a user interface between the first part and the user. [00036] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 800 from the computer readable medium, and then executed by the computer system 800. A computer readable medium having such software or computer program recorded on the computer readable medium is a computer program product. The use of the computer program product in the computer system 800 preferably effects an advantageous apparatus for performing the DBPC methods. [00037] The software 1433 is typically stored in the HDD 804 or the memory 803. The software is loaded into the computer system 800 from a computer readable medium, and executed by the computer system 800. Thus, for example, the software 1433 may be stored on an optically readable disk storage medium (e.g., CD-ROM) 1425 that is read by the optical disk drive 1412. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer system 800 preferably effects an apparatus for performing the DBPC methods.

-8 [00038] In some instances, the application programs 1433 may be supplied to the user encoded on one or more CD-ROMs 1425 and read via the corresponding drive 1412, or alternatively may be read by the user from the networks 1420 or 1422. Still further, the software can also be loaded into the computer system 800 from other computer readable media. Computer readable storage media refers to any non-transitory tangible storage medium that provides recorded instructions and/or data to the computer system 800 for execution and/or processing. Examples of such storage media include floppy disks, magnetic tape, CD-ROM, DVD, Blu-rayTM Disc, a hard disk drive, a ROM or integrated circuit, USB memory, a magneto-optical disk, or a computer readable card such as a PCMCIA card and the like, whether or not such devices are internal or external of the computer module 1401. Examples of transitory or non-tangible computer readable transmission media that may also participate in the provision of software, application programs, instructions and/or data to the computer module 1401 include radio or infra-red transmission channels as well as a network connection to another computer or networked device, and the Internet or Intranets including e-mail transmissions and information recorded on Websites and the like. [00039] The second part of the application programs 1433 and the corresponding code modules mentioned above may be executed to implement one or more graphical user interfaces (GUIs) to be rendered or otherwise represented upon the display 808. Through manipulation of typically the keyboard 805 and the mouse 806, a user of the computer system 800 and the application may manipulate the interface in a functionally adaptable manner to provide controlling commands and/or input to the applications associated with the GUI(s). Other forms of functionally adaptable user interfaces may also be implemented, such as an audio interface utilizing speech prompts output via the loudspeakers 1417 and user voice commands input via the microphone 1480. [00040] Fig. 14B is a detailed schematic block diagram of the processor 801 and a "memory" 1434. The memory 1434 represents a logical aggregation of all the memory modules (including the HDD 1409 and semiconductor memory 803) that can be accessed by the computer module 1401 in Fig. 14A. [00041] When the computer module 1401 is initially powered up, a power-on self-test (POST) program 1450 executes. The POST program 1450 is typically stored in a ROM 802 of the semiconductor memory 803 of Fig. 14A. A hardware device such as the ROM 802 storing -9 software is sometimes referred to as firmware. The POST program 1450 examines hardware within the computer module 1401 to ensure proper functioning and typically checks the processor 801, the memory 1434 (1409, 803), and a basic input-output systems software (BIOS) module 1451, also typically stored in the ROM 802, for correct operation. Once the POST program 1450 has run successfully, the BIOS 1451 activates the hard disk drive 804 of Fig. 14A. Activation of the hard disk drive 804 causes a bootstrap loader program 1452 that is resident on the hard disk drive 804 to execute via the processor 801. This loads an operating system 1453 into the RAM memory 803, upon which the operating system 1453 commences operation. The operating system 1453 is a system level application, executable by the processor 801, to fulfil various high level functions, including processor management, memory management, device management, storage management, software application interface, and generic user interface. [00042] The operating system 1453 manages the memory 1434 (1409, 803) to ensure that each process or application running on the computer module 1401 has sufficient memory in which to execute without colliding with memory allocated to another process. Furthermore, the different types of memory available in the system 800 of Fig. 14A must be used properly so that each process can run effectively. Accordingly, the aggregated memory 1434 is not intended to illustrate how particular segments of memory are allocated (unless otherwise stated), but rather to provide a general view of the memory accessible by the computer system 800 and how such is used. [00043] As shown in Fig. 14B, the processor 801 includes a number of functional modules including a control unit 1439, an arithmetic logic unit (ALU) 1440, and a local or internal memory 1448, sometimes called a cache memory. The cache memory 1448 typically include a number of storage registers 1444 - 1446 in a register section. One or more internal busses 1441 functionally interconnect these functional modules. The processor 801 typically also has one or more interfaces 1442 for communicating with external devices via the system bus 1404, using a connection 1418. The memory 1434 is coupled to the bus 1404 using a connection 1419. [00044] The application program 1433 includes a sequence of instructions 1431 that may include conditional branch and loop instructions. The program 1433 may also include data 1432 which is used in execution of the program 1433. The instructions 1431 and the data 1432 are stored in memory locations 1428, 1429, 1430 and 1435, 1436, 1437, respectively. Depending upon the relative size of the instructions 1431 and the memory locations 1428-1430, a particular -10 instruction may be stored in a single memory location as depicted by the instruction shown in the memory location 1430. Alternately, an instruction may be segmented into a number of parts each of which is stored in a separate memory location, as depicted by the instruction segments shown in the memory locations 1428 and 1429. [00045] In general, the processor 801 is given a set of instructions which are executed therein. The processor 1105 waits for a subsequent input, to which the processor 801 reacts to by executing another set of instructions. Each input may be provided from one or more of a number of sources, including data generated by one or more of the input devices 805, 806, data received from an external source across one of the networks 1420, 805, data retrieved from one of the storage devices 803, 1409 or data retrieved from a storage medium 1425 inserted into the corresponding reader 1412, all depicted in Fig. 14A. The execution of a set of the instructions may in some cases result in output of data. Execution may also involve storing data or variables to the memory 1434. [00046] The disclosed DBPC arrangements use input variables 1454, which are stored in the memory 1434 in corresponding memory locations 1455, 1456, 1457. The DBPC arrangements produce output variables 1461, which are stored in the memory 1434 in corresponding memory locations 1462, 1463, 1464. Intermediate variables 1458 may be stored in memory locations 1459, 1460, 1466 and 1467. [00047] Referring to the processor 801 of Fig. 14B, the registers 1444, 1445, 1446, the arithmetic logic unit (ALU) 1440, and the control unit 1439 work together to perform sequences of micro-operations needed to perform "fetch, decode, and execute" cycles for every instruction in the instruction set making up the program 1433. Each fetch, decode, and execute cycle comprises: * a fetch operation, which fetches or reads an instruction 1431 from a memory location 1428, 1429, 1430; e a decode operation in which the control unit 1439 determines which instruction has been fetched; and -11 e an execute operation in which the control unit 1439 and/or the ALU 1440 execute the instruction. [00048] Thereafter, a further fetch, decode, and execute cycle for the next instruction may be executed. Similarly, a store cycle may be performed by which the control unit 1439 stores or writes a value to a memory location 1432. [00049] Each step or sub-process in the DBPC processes of Figs. 6, 7 and 10-13 is associated with one or more segments of the program 1433 and is performed by the register section 1444, 1445, 1447, the ALU 1440, and the control unit 1439 in the processor 801 working together to perform the fetch, decode, and execute cycles for every instruction in the instruction set for the noted segments of the program 1433. [00050] The DBPC method may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the DBPC functions or sub functions. Such dedicated hardware may include graphic processors, digital signal processors, or one or more microprocessors and associated memories. Overview of the DBPC arrangement [00051] Fig. 10 is a flow chart of a process 1000 for modifying code 402 to both generate code for an MPSoC and to generate the MPSoC itself using the disclosed DBPC approach. The method 1000 commences with a step 1001 that determines statement properties of the high level code 402 which is represented as a data flow graph of program points. The step 1001 generates a reduced program dependency graph 408. The high level code 402 is then profiled at a following step 1002 in a system simulation or emulation environment to determine performance properties of each statement. This step uses architecture level information (e.g., ISA type, such as ARM, X86, etc.), a profiler such as 411 (e.g., system simulator) described hereinafter in more detail in relation to Fig. 4 and flavours (e.g., possible configurations for the processor type), including clock cycles and instructions for each statement on each processor flavour. The step 1002 then collates, by performing simulation or emulation based profiling of statements with a single processor, the performance properties of the statements of the high level code to output collated performance properties 1007 of the statements. A following step 1003 forms a dependency graph 1008 by combining the collated performance properties 1007 from the step 1002 and the statement properties from the step 1001. At a following step 1004 statements from the formed -12 dependency graph 1008 are mapped using statement-level optimisation techniques to create a partitioned graph 1009. A following step 1005 uses the partitioned graph 1009 to create partitioned code 1010 for the MPSoC. A hybrid heterogeneous MPSoC 1011 is generated by a following step 1006, by instantiating and connecting processor units based on the mapping output 1009 from 1004 to create the MPSoC hardware and then integrating the generated MPSoC code from the step 1005 into the MPSoC hardware. [00052] Fig. 11 is a flow chart of a typical method that can be used to implement the step 1001 in Fig. 10. A step 1102 reads the high level code 402 and utilises a compiler front-end such as gcc or clang to generate an Abstract syntax tree (AST) 404, a control flow graph (CFG) 405 and a call graph (CG) 406, described hereinafter in more detail with reference to Fig. 4. A directed program dependency graph (PDG) 424 is created by combining the AST, the CFG and the CG in a following step 1103. A rule-based statement-level data flow analysis is performed on the PDG 424 in a following step 1104. In a following step 1105 the reduced program dependency graph (RPDG) 408 is created from the PDG 424, which is a reduced version of the PDG, containing only the statements and their dependency edges. [00053] Fig. 12 is a flow chart of a typical method that can be used to implement the step 1004 in Fig. 10, to perform the statement level partitioning of the high level code. A step 1203 receives the reduced program dependency graph 408 which includes the collated performance properties 1007 from Fig. 10 that are generated in the step 1002 and statement properties from the step 1001. A following step 1204 receives a specification of the optimization method (such as ILP), constraints (such as throughput which is a requirement of the desired multicore system) and the details of the MPSoC (such as the number of processors which is a constraint of the desired multicore system) required to perform MPSoC optimization. A following step 1205 performs the performance optimization (using ILP for example) of the Performance annotated dependency graph 415 to form a mapping1208 of the statements to the required processors, to satisfy the given constraints. [00054] Fig. 13 is a flow chart of a typical method that can be used to implement the step 1005 in Fig. 10 to generate the partitioned code 1010 for the MPSoC. A step 1301 generates an optimized RDPG 1307, where the optimized RDPG 1307 includes the information indicating statement to core and core to flavour mappings from the partitioned graph 1009 . A following step 1302 receives a control flow graph (CFG) 405 of the high level code from the step 1102. A -13 following step 1303 performs a bottom up traversal on the optimized RDPG 1307 to find the dependent statements 1309 of each loop statement. A following step 1304 creates dependency information 1310 for code generation. A following step 1305 performs a top down traversal of the CFG 405 together with the dependency information 1310 and a following step 1306 generates the partitioned code for the MPSoC. [00055] As previously noted, in order to create an MPSoC system from legacy code, three steps are necessary namely (a) statement-level analysis (see 1001 in Fig. 10) of code, (b) statement-level optimization (see 1301 in Fig. 13) for the synthesis of the MPSoC, and (c) synthesis (see 1006 in Fig. 10) of the heterogeneous optimized MPSoC. The DBPC arrangement performs statement-level analysis (see 1001 in Fig. 10) using a rule based dataflow analysis to parallelize control loops. In the second step (see 1301 in Fig. 13), a novel ILP based optimizer is used to optimally design a system for the given constraints such as latency, throughput, area, overall communication data size and code size. Finally, the DBPC arrangement MPSoC generation step (see 1006 in Fig. 10) performs a traversal based code distribution. The DBPC arrangement code distribution moves code segments from the partitioned code 1010 to specific corresponding cores which utilize them based on the dependencies, as described hereinafter in more detail with reference to an example in Figs. 2A and 2B. Problem Formulation [00056] A streaming application can be formulated as a directed graph G = (U, E), where U is a set of u vertices (also known as program points, where each program point refers to a unit of execution in the code), where each statement can have multiple vertices and E is a set of m edges between (ie linking) the vertices. G is commonly known as a Program Dependency Graph (such as 424). Fig. 3 in 310 shows an example for G, where a vertex 319 is also a single statement 308, whereas a statement 309 includes vertices 321, 325, 326 and 327. We term Gc E G as the set of control loops of G, which are the targets for parallelization. For example, in Fig. 3 vertices under the control loop vertex 316 are included in Gc of 316. A 3-tuple (tp, sz, kn) is attributed to each communication edge Ej of a program dependency graph, the 3-tuple parameters being type, size and kind respectively. The communication edge Etype can be either control (Ujk=c ) or data (Uk,=a ), whereas the communication edge Ej size (in Bytes) and communication edge Ej kind are primitive properties such as char, int, long, etc. For example, in Fig. 3, an edge 322 is of type control and edge an 324 is of type data. An edge 324 is of kind -14 integer, since it is representing the statement 304 and of size the integer based on the intended architecture (ie size=4 Bytes if the intended architecture is 32-bit). [00057] Fig. 3 depicts an example 300 of code 301 and its statement-level decomposition 323 from the example code 301 to form a program dependency graph G (ie 310) which is composed of multiple vertices/program points. Each vertex in the figure represents a statement or part of a statement of the code 301. For example, a vertex 319 represents all of the statement 308, whereas a vertex 321 is a part of the statement 309. The terms vertices, program points and nodes are used interchangeably in this specification, unless specifically noted to the contrary It is worth nothing that a statement such as the statement 309 of the example code 301 can contain multiple vertices such as vertices 321, 325, 326 and 327.The multiple vertices can, for example, include add(), 10, $res and k = i + h + $res vertices belonging to the same statement 309. Data dependency links (such as 324, shown in dotted lines) are created based on the dependencies in variables across the statements 320-321. Each vertex 311-321 from the dependency graph G (ie 310) is based on the statements 302-309 in the example code 301. For example, the j = i + h statement 308 depends on the h = 25 statement 304. Not all data dependency links are shown in Fig. 3 to avoid complexity. The size and type of the edge 322 connecting the main statement node 311 and the h = 25 statement node 320 is 4 bytes and int respectively. It is worth noting that the size property depends on the type of target processor. [00058] Program statements related to the control loop portion of the code are categorized into two categories namely (1) loop statements, which are within a control loop Gcq; and (2) dependent (also referred to as dependency) statements, which are outside the control loop (including the start and end statements of the control loop, for example statement 305 in Fig. 3) and affect the loop statements. For example, the h = 25 statement 304 is a dependency statement, whereas add(i) statement 306 is a loop statement in Fig. 3. [00059] Each statement Sj E S of a program dependency graph is associated with a 5 tuple (D, R, or, 1d, 'U): where: a) D is a set of declarators for the statement Si (e.g., j = i + h statement 308 in Fig. 3 will have declarators for variables j, i and h); -15 b) 3? is a set of statements which affect statement Si data-wise (e.g., statement 304, h = 25 in Fig. 3 affects data flow of statement 308, j = i + h). [00060] Each variable is passed from statements Rk to Sj via a respective dependency data edge Ekknd E E. c) or is the order of the statement Si in the original execution flow (which is the sequence order in which each statement is executed in the code) ; d) 1 is the execution load of the statement Si (either in cycles or instructions); and e) kn is the kind ofstatement Si and is uniquely identified by combining the variable ID and Rk statement ID. [00061] This entire set of data dependent variables within the loop statements in Gare placed into a set R. The order of Si in the original execution flow is defined as or. The execution load, which is the amount of time a statement takes to execute in a given processor, of Si (either in cycles or instructions) is defined as 1d. Vertices that are attached to a statement Sj are formulated as a set U. Each vertex (i.e., program point), Uz, is associated with a "kind" ie kn. [00062] For example, j = i + h statement 308 is a statement of kind kn expression and statement 306, add(i) is a statement of kind kn call-site. A 3-tuple (tp, sz, sc) is attributed to each declarator Dz of the program dependency graph , where tp is type, sz is size and sc is the scope. Each data dependent variable Rk is assigned with a size. [00063] An MPSoC M is characterized as a 4-tuple (C, lo, 1r, P), where C is a set of cores, lo is local latency within a core, and 1r is remote latency outside the core. Each core c E C can be configured from a set of configurations F (referred to as flavours, where each flavour f E F has a unique hardware property for a core, such as the size of cache, inclusion of multiplier and divider, the base processor requiring software implementations , etc.). Finally, P includes the rest of the properties of the MPSoC , such as hop count, additional latencies, etc., which are useful to compute performance.

-16 [00064] The problem statement is as follows. Given source code for a sequential application (the source code being referred to as legacy code), and a range of number of cores (MINPROCS to MAXPROCS) and the list of flavours F, find an MPSoC M and a mapping from statements to cores, ensuring that appropriate constraints are optimized and the number of cores is between MINPROCS and MAXPROCS. [00065] There are five appropriate constraints considered in the DBPC arrangements. They are: 1), area (UR), which is the total area in logic gates of the MPSoC, not including any memory; (2), latency (LA), which is the entire execution latency of the application running on the MPSoC; (3), throughput (T), which is the overall throughput of the dependency graph G; (4), communication (CO), which is the amount of communicated information/data between processors; (5), code size (CS), which is the set of instructions used in each core, affecting the instruction memory size. DBPC arrangement No. 1 Methodology [00066] Embedded systems typically execute an application or a class of applications repeatedly. They are often designed specifically for these applications with careful consideration given to constraints such as speed, area and cost. In particular, design is performed and optimized for a given set of constraints. Thus the application parallelization for such application specific systems can be performed with greater freedom than the parallelization required in larger general purpose machines. For example, in embedded systems it is possible to have a custom memory architecture in which the memory does not have to be shared. An interconnect, which is a communication architecture to connect processors and other peripherals such as the shared memory, can be specialized and processors themselves may be different from one another. Furthermore, customized architectures can support statement level parallelism in addition to iteration level parallelism. [00067] Fig. 1 depicts how code 101 can be parallelized to execute on either a homogeneous MPSoC with shared memory or, using the disclosed DBPC method, on a heterogeneous MPSoC with distributed memory. For example in Fig. 1(a), an example of sequential code 101 is given (this being a code segment from mpeg2 encoding) which contains a control loop 108 (i.e., for loop) with multiple statements S3 - S7. The sequential code 101 -17 processes a sequence of four frames of video. There are no data dependencies across iterations in the statements S3 to S7, and therefore each iteration (i = 0, 1, 2, 3) can be executed in parallel. A (prior art) OpenMP execution of the for loop will schedule, as depicted by a dashed arrow 106, each iteration (from i = 0 to i = 3) to a thread (which will be mapped to a corresponding core such as 107 in a homogeneous MPSoC 102 with shared memory 109 as shown in Fig. 1(b)), hence parallelizing all the iterations to significantly reduce the latency of the entire for loop 108. A shared memory architecture is utilised to communicate data between threads/processors. The terms "core" and "processor" are used interchangeably in this specification unless noted to the contrary. [00068] If however, each statement S is scheduled to a heterogeneous MPSoC 103, instead of the homogeneous MPSoC 102, where the heterogeneous MPSoC has distributed memory such as 110, and the MPSoC 102 operates in a pipelined fashion (as shown in Fig. 1(c)), the performance (i.e., throughput) will improve overall. Since each of the processors 111 is now customized (with custom instructions, smaller memories, point to point links, and with the right caches) the size of each processor (also referred to as a core) may be smaller as well. The system 103 can be further improved if performance is a critical constraint. To improve performance further, four copies 112 (i=0,3) of the pipeline of the heterogeneous MPSoC 103 with distributed memory can be created as shown in Fig. 1(c) (with a total of 20 small yet highly customized processors 111). Other configurations (say with two processor pipelines) are also possible. Note that since the code is targeted with specific statements for specific processors (a main goal of this DBPC arrangement), the total memory footprint will also reduce, as well as minimize memory contention. [00069] At a rudimentary level, the throughput T-' for the various architectures can be determined in the following manner for the for loop. Equation 1 is used for the TP calculations, where #frames refers to the output number of frames and IL(pe) defines the latency (in cycles or time units such as seconds) of the critical processor. This equation is used to find the average time (in cycles or time units such as seconds) a MPSoC will take to produce an output frame after receiving an input frame. TP - #frames (1) IL(Pc) -18 where T-' is the throughput in frames per cycles or frames per time units, where time units can be seconds, #frames is the number of output frames produced after the execution and IL(Pe) is the latency IL in clock cycles or time units of the critical processor P, in the MPSoC. A critical processor is defined as the processor in the MPSoC which incurs the highest execution time. TI' is computed by dividing the number of output frames with the latency of the critical processor. [00070] If there is only one processor 111 (and each statement takes 1 ms) then the throughput of the entire control loop 108 is 0.2 frames per millisecond (as shown in the table in Fig. 1 in the row 113 in the 2 column). The architecture of the prior art homogeneous MPSoC 102 with shared memory shown in Fig. 1(b) has a throughput of 0.8 frames/ms (each statement takes 1ms) as shown in the table in Fig. 1 in the row 114 in the 2"d column. If on the other hand the processing of the frames is pipelined in the heterogeneous MPSoC 103 with distributed memory, as shown in Fig. 1(c), with a single pipeline and if each of the processors takes 1ms per statement, then the throughput of the system is 1 frame/ms as shown in the table in Fig. 1 in the row 115 in the 2 column. If multiple pipelines are used, then the throughput may be doubled or even quadrupled, assuming that the communication times are absorbed into the processing times. Note that these throughputs are calculated ignoring statements S 1 and S2 as these statements can change the timings significantly. If the processors are optimized using custom instructions and/or caches and if the processing takes less time, as annotated in Fig. 1(c), the critical stage for the pipeline takes 0.5 ms. As a result the throughput is 2 frames/ms with one pipeline as shown in the table in Fig. 1 in the row 116 in the 2 column. If the number of pipelines is increased, the throughput increases accordingly. [00071] Code and communication data size can be minimized by carefully mapping statements to the processors which need them. [00072] Fig. 4 is a block diagram of a system 400 for creating a hybrid heterogeneous MPSoC system generation methodology according to one DBPC arrangement. The system 400 has three major sections namely an analysis process 401, an optimization process 409 and a synthesis process 420. A rule-based statement level data flow analysis 407 is performed to create a Reduced Program Dependency Graph (RPDG) 408 while maintaining standard Program Dependency Graph (PDG) characteristics. The ILP-based statement level optimization process 416 identifies efficient partitioned code and MPSoC configuration subject to certain constraints such as code size, throughput, area, latency and communication. The multicore code generation -19 process 421 generates the partitioned code 1010. The processor optimisation process 422 uses the partitioned code 1010 to synthesise the desired hybrid heterogeneous MPSoC 423. Program Dependency Graph [00073] A Program Dependency Graph (PDG), also known as a System Dependency Graph (SDG), combines both control flow and data flow between the program points/vertices of a program. A program point is a fragment of source code which is captured at an operational granularity, which is typically lower than statement granularity. [00074] Fig. 9 depicts an example of a PDG, where each function is enclosed as a procedure with all program points inside the function. For example, main 901 and add 902 are procedures. An edge in the graph can indicate either a data dependency (depicted by a dotted arrow line such as a link 903 for a variable e) or a control dependency (depicted by a straight arrow line such as a control dependency 904 for a variable a). Dependency links outside a procedure are referred to as "inter-" links , such as a link 905 for a variable h in Fig. 9. Dependency links within a procedure are "intra-" links, such as the link 903. Each program point, i.e., a node in the graph, such as a node 906 for verify vertex will have a kind. For example, verify function call in the node 906 is of a kind call-site while node "e=$res" in a node 907 is an expression kind. A pointer, such as &h in a node 908 for the add function call node 909, accessed in a function is handled using "global-" vertices, such as a node 910 in Fig. 9 for a variable h (globals, such as extern and static type of variables, will also create "global " vertices similar to pointers. These global cases are not shown in Fig. 9 to avoid complexity). A Global-actual-in vertex , such as 910, indicates a global variable at 908 for a variable h being passed within a function at the call-site, whereas a global-actual-out, such as 911 for a variable a is a vertex within a procedure where a pointer variable is modified. Backward vertices (i.e., source) from a vertex, such as the node 909 add vertex for the node 910 h vertex within a procedure, such as the main procedure 912 in Fig.9, are referred to as intra-source vertices. For example, add vertex node 909 is the intra-source vertex of h vertex node 910 in Fig. 9 Statement-level Data Flow Analysis 407 [00075] Returning to Fig. 4, the analysis section 401 reads the legacy sequential code 402 and then utilizes a compiler front-end 403, such as gcc or clang, to convert the code to an -20 Abstract Syntax Tree (AST) 404, a Control Flow Graph (CFG) 405 and a Call Graph (CG) 406, the AST, the CFG and the CG constituting the dependencies between the program points of the code 402. The directed graph G (such as 310, ie a PDG) is constructed by combining these three graphs 404, 405, 406. A rule-based statement level data flow analysis 407 analyses the combined graph of a control loop, Gc , which is a subset of the combined graph of 404, 405, 406, only containing the nodes related to the control loop and as described hereinafter in more detail with reference to Fig. 7, to create an RPDG 408. The RPDG 408, described hereinafter in more detail with reference to Fig. 5 is a reduced version of a PDG, revealing only the loop statements and their dependent statements and data/control dependencies between them in terms of variables. [00076] Fig. 5 depicts a partial RPDG 500 of the code snippet 101 shown in Fig. 1. As shown, the RDPG 500 contains only the valid dependencies of variables between the loop statements 502-508 , as well as properties 516-527 of the statements 502-509 as discussed below. For example, the vertex S3 (ie 504) has a one way data dependency from the node S2 (ie 503) via a variable fr (ie 511), which has a type 517 of character pointer and a size 518 of 100 bytes with a kind 516 of d. [00077] Fig. 7 depicts a process 700, represented by a code fragment, for generating the RPDG 408 in Fig. 4. The process 700 performs a rule-based traversal of the set of control loops each defined as Gcq(an element from set Gc) constituting the PDG 310 to identify dependency links between statements, in terms of variables. The code shown in Fig. 7 defines the rule-based traversal where each loop statement is analysed for dependency on other statements based on types of the vertices they are composed of, each type category being considered as a rule. After the control code segment Gc is identified (e.g., a for statement), loop statements are traversed using control edges of the dependency graph in the order of the control flow. Each loop statement is analysed for dependency with other statements based on the types of vertices they are composed of (each type category is considered as a rule). After the control code segment Gci is identified in a line 2 (e.g., the for statement which has to be parallelised), loop statements are traversed using the control edges of the vertices in the order of the control flow. It is worth noting that there may be more than one vertex for each statement. Initially, each vertex in a loop statement is identified as either a call-site (Rule 1) or an expression (Rule 2). Any other types of statement, such as a control statement, are captured as special cases. The actuals are extracted if the vertex is a call-site, and separated into actual-in (Rule 1.1) or global actual-in (Rule 1.2). Intra-source vertices are then traversed for each actual vertex. If the -21 traversal reaches either an expression (Rule 1.1.1) or a global-actual-out (Rule 1.1.2) vertex, a link is established between r and s in RPDG. It is worth noting that the actual-out is include as global-actual-out as well to reduce complexity. In case of vertices in the loop statements being expressions (Rule 2), vertices are first analysed for non-killed variables (i.e., variables which are only used and not killed, or variables which are used first and then killed). For each non-killed variable, the intra-source vertex is analysed for its type. If the intra-source vertex is either an expression (Rule 2.1) or a global-actual-out (Rule 2.2), a link is established in RPDG. Statement-level Optimisation 416 [00078] Returning to Fig. 4 the optimization section 409 receives architecture information 410 (e.g., ISA type, such as ARM, X86, etc.), the profiler 411 (e.g., system simulator, which enables functional simulation of the hardware components in the system with given test vectors) and flavours 412 (e.g., possible configurations for the processor type). A Statement-level Profiling process 413 is performed on the legacy code 402 and the output of the Statement-level profiling process 413 is used to create a profile annotated RPDG 415 using a profile annotation process 414. The profile information of the annotated RPDG 415, produced by the profile annotation process 414 and referred to as the performance properties of the statements , includes clock cycle and instruction estimates including a worst case or an average case, based on the goal, for each statement on each processor flavour set out in the flavours 412. Next, a statement-level optimization process 416 processes the profile annotated RPDG 415 using an ILP approach described hereinafter in more detail with reference to sections (A) - (G) below, to produce the partitioned code 1010 . The statement-level optimization process 416 takes as input the optimization method constraints 417 and details of the MPSoC 419. A. ILP formulation [00079] The described ILP formulation process in the statement-level optimization process 416 in Fig. 4 maps statements to processors/cores, where each processor will be of one flavour based on the mapped statement. Dependencies between variables are considered across statements to perform an optimal/efficient mapping based on various synthesis constraints such as throughput, latency, area, code size and communication. Equation 2 depicts a decision variable xs,p to map statements with processors, indicating which statements map to which processors.

-22 1, if statements is mapped to processor p, x tO, else (2) where x,, is a binary variable and will be 1 when statement s is mapped to processor p. x,, is set to 0 if statement s is not mapped to processor p. [00080] A constraint in Equations 3 and 4 enforces a condition that every statement is mapped to either one or multiple processors. The sum of the x,, variables for each statement on all processors has to be equal to 1 which makes sure that a statement is mapped to one and only one processor. Vs E S : ZpEC Xs'p=1(3 where s is a statement in set S of all loop statement of a control loop and p is a processor in a set Cof processors. [00081] To minimize the communication cost, the dependency between processors is formulated as shown in Equation 4, where the binary variable dim is set to 1 when processor m directly depends on processor 1 via dependencies between the statements which are mapped on to this processors. dim will be set to 0 if processor m does not show a direct dependency with processor 1. (1, if processor m directly depends on processor 1, dim = to, else (4) where dim is a binary variable, 1 and m are processors. This formulation used for Equation 4 changes dynamically, based on which loop statements map to which processors, as shown in Equation 5, since it is the statements which reveal the actual dependencies and not processors. As a result the dependencies may be re-formulated as shown in Equation 5. For example if there are loop statements s1, s2, s3 and s4 which have data dependencies from s1 to s2 and then s3 to s4, and processors p1 and p2. Mapping statements s1 and s3 to pl and, s2 and s4 to p2 will make dplp2 = 1, since statements s2 and s4 depend on -23 statements s1 and s3 respectively. Mapping s1 and s2 to processor p1 and statements s3 and s4 to processor p2 will make dptp2 = 0, since statements s1, s2 and s3, s4 do not dependent on each other. V1 E C: Vm E C: 1 # m Vn ES: Vo ES:n# o (5) dim > ECkn=d,(no * 6X, A Xom) Where the A operator is formulated in Equation 6 to comply with any constraints of integer linear programming (ILP). Such a condition is not directly supported by the ILP formulation, hence an extended formulation is used. Here 1, m are processors in processor set C, where 1 should not be equal to m. Similarly, n and o are statements in statement set S and n should not be equal to o. Ek=dA,[,o} indicates whethere there is a data dependency edge from statement n to statement o. xa,, and xo,m reveal whether statement n is mapped to processor 1 and statement o is mapped to processor m, respectively. z = (x A y) E (0,1} z > x + y -1 (6) Z < X Where x and y are conditions and (x A y) produces a binary output to z indicating whether conditions x and y are both valid (z=1) or conditions x and y are either not valid (z=O). [00082] Each core 1 E C can be configured differently to create a flavour which allows optimization based on area, code size, communication latency and throughput. Different flavours of processors may be achieved by changing hardware properties of a processor, such as caches and accelerators. Each statement n E S will incur a distinctive performance penalty when n is mapped to a core 1 E C which is configured with the flavour f E F. As a result it is possible to identify an execution cost of a statement based on a mapped flavour of a processor with the statement. COSTs,f,(ecc) and COSTsf,e(inst) ,representing execution cost in cycles and instructions of each statement respectively, are two-dimensional vectors which are populated by profiling each statement of the code with the available processor flavours. A significantly large value (BIGCONST) is set in the vectors for cases when statements are not supported by certain configurations (e.g., a multiplication statement will not be supported by a flavour which does not -24 include a multiplier). This will enable the ILP process to avoid invalid cases by heavily penalising any invalid cases in a cost function. [00083] The execution cost of each processor is constrained in Equation 7 to compute the execution cost of each processor based on statement to processor and processor to flavour mapping. The execution cost of processor p, execcostp, is computed as the sum of the execution cost (in cycles) of statements based on their mapped flavours. Clock cycle information of each statement is used, such as the execution cost COSTje,,), extracted from profiling as a performance property. The formulation xsp A yp,f constraints the mapping of statement s to processor p, and processor p to flavour f. Where yp,fis a decision variable, as shown in Equation 8 to map a processor to a flavour. yp,7=1 indicates that processor p is implemented as flavour f. Vp E C : execcostp P (ESAfE s * COSTs,,eeco (7) 1, if processor p is mapped to flavour f, yJ = tO le(8) P-f 0o, else where p is a processor in a processor set C, s is a statement in a statement set S, f is a flavour of flavour set F , COSTs,,e(cc) is the execution cost in clock cycles when statement s is mapped to flavour f. This information is obtained through profiling in 414. B. Throughput Formulation [00084] Latency of each core is constrained to optimize the entire MPSoC for throughput.. The throughput of a streaming pipeline (i.e., pipeline processor) is limited by the processor which has the largest execution load. It is worth noting that the execution load of the processors, where execution load refers to the amount of time in cycles a processor takes to execute the mapped statements, includes both reading input data from FIFOs ) and writing output data to FIFOs, as well as the execution time of the statements. However the latency of each core has to be constructed using behaviour of the statements, in order to make sure that the throughput is maximized for a pipeline structure. The extensive throughput formulation of the ILP process in the statement-level optimization process 416 supports multi path optimizations for pipelines, allowing creation of hybrid merges, where the hybrid merge is referred to having a different number of merged statements across paths as shown in 2A05 in Fig. 2A, and supporting a highly -25 customized MPSoC synthesis, as described with reference to Figs. 2A and 2B. The formulation of the ILP determines a maximal latency, i.e., the latest time a statement could finish its execution, and minimal latency, i.e., the earliest time a statement could start its execution, of each statement. The maximum and minimum latencies of each statement is then optimized to find the minimal latency, i.e., maximal throughput, of the entire pipeline. [00085] An accumulated latency component for each statement, accumcostT,n is an aggregation of statement latencies via the communication links. Equations 9, 10 and 11 depict three different cases for the accumulated latency accumucostnto find the maximum possible value; lower bound, when two statements are mapped to different cores and the lower bound when two statements are mapped on the same core. Equation 9, which is used to allow the ILP formulation to estimate the throughput of the MPSoC by providing a lower bound for the throughput of each statement as the execution cost in clock cycles without the communication cost, forces the accumulated latency accumucostn of statement n to be at least the execution cost of n. Equation 10 is used to compute the accumulated latency when two statements are mapped to separate cores, considering the read and write cost. As shown in Equation 10 if the communicating statements n and o (data flow from n to o, where n and o are different, ie. n # o) are mapped to two different processors 1 and m (ie. 1 # m) respectively, the accumulated latency accumucostT,o of the receiving statement o will also include the read cost between n and o. A cost COST ,,,c) is a static vector, including a read communication cost between statements n and o, in clock cycles, as shown in Equation 11. Equation 10 is used to find the read cost between statements, which will assist finding the accumulated cost accumcostTo Vs E S: Vf E : Vp E C: (9) accumucostT,s (xs'p ^ ypAr) * COSTs,f,e(cc) Vn E S: Vo E S: Vf E : n # o: Vl E C: Vm E C: 1 # m accumcostT,o > (Xom A ym,r) * COSTo,r,e(cc) + accumcostTl * Ekn=a,{n,o} + COSTn,o,r(cc)* * (xn, A Xom) (10) where s is a statement in statement set S, f is a flavour within a set of flavours F, p is a processor within a set of processors C. xs,p is binary variable with xs,p =1 defines that statement s is mapped to processor p and xs,p =0 indicates that statement s is not mapped to processor p. yp,f is binary variable with Yp, =1 indicates that processor p is mapped to flavour f and yp,f =0 indicates that processor p is not mapped to flavour f. COSTs,,,e(cc)is the execution cost of -26 statement s mapped on flavour f in clock cycles. A symbol denotes the and condition (e.g., xs'P A yp,fformulating the condition of statement s mapped to processor p and processor p mapped to flavour f). Ekn=d,{,o} refers to a binary variable with a value of 1 showing the existence of a data edge (kn = d, kind = data) between statements n and o. accumcostT,n denotes the accumulated latency of statement n. [00086] All the variables passed to o from dependent processors of the statements are accumulated with their COST, in words, to be transferred. For example, a 32-bit or 4-byte communication bus for FIFO transfers will use one word each to send a variable of type char or int. As such, both char and int will have a COST, = 1. Likewise, a heap variable of 9 bytes will have a COST, = 3 . A read latency value, READLATENCY, is assumed to be constant since the target architecture uses a point-to-point FIFO based communication. Merging of statements on a core will affect the variables being passed between the statements. Duplications of variables in a processor should be avoided to maintain practicality. For example, if statement n passes a variable v to statements o and q, then mapping n, o and q to separate cores 1, m and r respectively will result in core 1 passing v to cores m and r via separate FIFOs. However, if statements o and q are mapped to a single core m and statement n is mapped to core 1, then 1 should send v only once to core m via a single FIFO. In order to perform this optimization, a variable is uniquely identified based on an associated variable ID and a statement ID associated with a statement which the variable is sent from. That is the src statement which is n in this example. A set of such variables is formulated as V. Two binary vectors are created to represent passing variables to a destination or a source, VARDST,o and VARSRC,n respectively, where v E V and n, o E S. When VARDST,o = 1 variable v is passed to statement o, whereas a when VARDST,n = 1 variable v is passed from statement n, that is n is the src statement and o is the dst statement for variable v. [00087] Equation 11 is used to find the read cost, which is the amount of clock cycles to read the communicating variables from First-In-First-Out buffers (FIFOs) between statements n and o.. The variable cost, COSTv, is added for every dependency link between n and o by multiplying VARDST,o and VARSRC,,. READLATENCY is a fixed constant to read a word from FIFO into the processor. COSTn,o,r(cc) = EvnoesvV E 8 kn=con, * VARDSTv, 0 * VARSRC,n* COSTv* READLATENCY (11) -27 where n and o are statements within the statement set S, Ekn=a,[A,o} refers to a binary variable with a value of 1 showing the existence of a data edge (kn = d, kind = data) between statements n and o. VARDSTQ and VARSRCQare binary variables, where VARDST 0=1 denotes that statement o is receiving variable v from other statements. VARSRCy,=l denotes that statement n is sending variable v to other statements. COST, denotes the size of variable v in words, where the word size is determined by the width of the FIFO communication bus (for example, a 32-bit or 4-byte FIFO communication bus for FIFO transfers will use one word each to send a variable of type char or integer). READLATENCY is a constant, referring to the time in clock cycles to read a single word by the processor from FIFO. [00088] Equation 12 is used to define the accumulation cost of statement o, accumcostT,o considering that statements n and o are mapped to the same processor p. The constant BIGCONST is used to negate a situation when the statements n and o are mapped to two different processors. It is worth noting that n and o do not necessarily have to communicate but could be mapped to the same processor. Vn E S: Vo E S: Vp E C :Vf E : n # o accumcostTo > (xop A yp,r) * COSTofr,ecc) + accumucostTl - BIGCONST + BIGCONST * (xOP A xn,P) (12) where n and o are different statements (ie. n # o) within a statement set S, p is a processor within a set of processors C, f is a flavour in a set of flavours F, (x 0 ,p A ypg) is an and condition (operator A) to consider the cases when statement o is mapped to processor p (ie. x 0 ,P) and processor p is mapped to flavour f. Similarly x 0 ,p A x,,p is an and operation to force that statements o and n are mapped to the same processor p. accumcostn denotes the accumulated latency of statement n. BIGCONST is a significantly large constant to negate the contribution to the accumulation latency when two statements are mapped to the same processor. ie. when (x 0 op A xn,p)=1, the BIGCONST component will be nullified to zero, but when (x 0 ,p A xn,p)=0 the accumcostTo will get a significantly large negative value, hence will be considered as an invalid case by the ILP. The above statement based accumulation latency cost may be converted to address the effective latency of each processor considering the mapped statements in each processor, in order to find -28 the optimal throughput of the pipeline. As mentioned above, a maximal latency and minimal latency of a processor is formulated, as shown in Equation 13 and Equation 14 respectively. Equation 13 is used to compute the latest possible time a processor can finish (the latest time is defined as the moment after writing all the output data to other processors), and Equation 14 is used to find the earliest time a processor can start (the earliest time is defined as the moment before the processor starts reading the input data from FIFOs). The maximum latency, maxlat, of each processor realizes the latest possible time the processor will finish its execution, whereas a minimum latency, min_lat, captures the earliest possible starting time of a processor execution. Since the accumulated latency cost accumucostTn in Equation 10 includes the read cost for each statement, the writecost of each statement is added to the maximum latency, maxlat as shown in Equation 13. Vp E C : Vn S (13) max-latp ;> accumucostTf + writecost, - BIGCONST + BIGCONST * xn,p where max-lat, will contain the maximal possible finish latency of processor p which is an element in the processor set C, whereas accumucostT,nis the accumulation latency in clock cycles of statement n which is an element in statement set S, writecostnis the writing cost of statement n in cycles. BIGCONST is a significantly large constant to negate the effect in maximal latency when statement n is not mapped to processor p. [00089] The write cost, however, should make sure that variables are not duplicated as mentioned above. Hence the writecost of each statement is formulated as depicted in Equation 15, which is used to find the writing communication cost for sending the variables which are transferred outside the statement to FIFOs. . A write latency, WRITELATENCY, is assumed to be constant, similar to read latency, READLATENCY The minimum latency, minlat, formulation in Equation 14 deducts the read cost of statement n (as formulated in Equation 16, to find the earliest starting time of statement n, and an execution cost of the statement n, COSTmJ,e(cc). Vp E C : Vn E S: Vf E- F: n # o minlatp ; accumucostTl - readcostn - (xn,P A ypf) * COSTnr,e(cc) + BIGCONST - BIGCONST * xn,p (14) where minlatpis the earliest accumulated latency (in clock cycles) the processor could start, accumucostT,n is the accumulation latency (in clock cycles) of statement n, readcostnis the -29 reading cost in clock cycles to read the input data from FIFOs, COST )1,,(ceis the execution time in clock cycles when statement n is mapped to flavour f. BIGCONST is significantly large value to make sure min-latpis invalidated when statement n is not mapped to processor p. Vn E S : Vo E S : n # o: Vf E F: V1 E C : Vm E C : 1 # m writecosta ZVEy(xnj A xo,m) * VARDSTv, * VARSRCvn * COST * WRITE_LATENCY (15) where writecostnis the writing cost (in cycles) of statement n, VARDST,,Q and VARSRC,,are binary variables, where VARDST,o=1 denotes that statement o is receiving variable v from other statements. VARSRCv,=1 denotes that statement n is sending variable v to other statements. COST, denotes the size of variable v in words, where the word size is determined by the width of the FIFO communication bus (for example, a 32-bit or 4-byte FIFO communication bus for FIFO transfers will use one word each to send a variable of type char or integer). WRITELATENCY is a constant, referring to the time in clock cycles to write a single word from the processor to an output FIFO. Vn E S : Vo E S : n # o: Vf E F: V E C : Vm E C : 1 # m readcost ZVEV(xng A xo,m) * VARDSTv, 0 * VARSRC,n * COST, * READLATENCY (16) Where readcostodenotes the reading cost (in clock cycles) of statement o to read all the input data from FIFOs, VARDST,o and VARSRCv,are binary variables, where VARDST,o=1 denotes that statement o is receiving variable v from other statements. VARSRCv,=1 denotes that statement n is sending variable v to other statements. COST, denotes the size of variable v in words, where the word size is determined by the width of the FIFO communication bus (for example, a 32-bit or 4-byte FIFO communication bus for FIFO transfers will use one word each to send a variable of type char or integer). READLATENCY is a constant, referring to the time in clock cycles to read a single word by the processor from FIFO. (xnl A xo,m) is an and condition which will be true (=1) when statement n is mapped to processor 1 (xl = 1) and statement o is mapped to processor m. [00090] Once the maximum and minimum latency, max-lat and min_lat, are constrained, the overall throughput of the pipeline is formulated as shown in Equation 17. Equation 17 is used to calculate the final throughput of the MPSoC, which will be the maximal difference between -30 the maximum latency and minimum latency across processors. This allows the ILP to optimize throughput across multiple paths in the pipeline. Yp C overall-tp > maxlatp - min-lat( where overall-tp will contain the maximum latency of the latencies of all the processors, max-lat, and minilat, are the maximum latency and minimum latency for processor p respectively C. Latency formulation [00091] Overall latency of the control loop, Ge,, is formulated using accumulated latencies of each statement o, accumcostTo. A maximal possible accumulated latency of the entire control loop (overallla) may be extracted, as shown in Equation 18, since the last statement in the control loop will not always have the highest accumulated latency. Equation 18 allows optimisation of the pipeline MPSoC for its entire latency, where the entire latency is defined as the entire execution time of the MPSoC to execute the application. Vn E S overallla ;> accumcostTr (18) Where overallla is the entire latency of the MPSoC and accumcostnis the accumulation latency cost of statement n which is an element in statement set S. [00092] However, the overall latency formulation in Equation 18 is not sufficient to minimize for latency. If the above equation is used, then the ILP will place all the statements into a single processor to improve the overall latency, by minimizing the communication cost. To avoid such a scenario, a threshold on the latency of each core (TXf) is used, so that the ILP can allocate the statements to multiple cores. This approach aligns with practicality where latency constraints typically are provided by the application designers, who will carefully analyse an application to deduce the latency constraint. Equation 19, which is used to force a threshold on the latency of each processor hence the ILP would not force all the statements on to one processor, below, depicts a formulation for the latency threshold, where the totallatency is the worst case total latency, that is when the entire set of statements are mapped to a single processor. The throughput TI- is chosen from the upper bound, totallatency, and the lower bound, -31 totallatency divided by offsetting the number of processors with a variable y. The y component, which is an integer, may be varied by the user if the lower bound of TH needs to be restricted. Increasing y will increase the lower bound of TH, enabling more statements to be mapped in a processor. VP C C latencyp > maxlatp - minlat( latencyp ! T-C totallatency/(numprocs - y) TI - total-latency where latency, is the effective latency of processor p in clock cycles, max_lat, and minlat, are the maximum latency and minimum latency for processor p in clock cycles respectively, TIis the threshold in clock cycles, total-latency is the addition of the latencies of all the statements in clock cycles, numprocsis the number of processors in the MPSoC and yis a parameterised variable which can be adjusted by the user to provide a required T. D. Code Size Formulation [00093] There may be statements in the code which share a common code segment. For example, two function call-site statements to the same function will use the same code segment, hence the code segment should not be duplicated if both the call-sites are mapped to the same core. In order to handle this, the entire control loop may be categorized into code segments, g in terms of instructions/ memory consumption. Two static vectors are used, COSTf , the cost in bytes of each code segment mapped on each flavour, and STMTSEGs.9 a binary two dimensional vector to indicate the mapping between the statement and code segments. When the value of STMTSEGS.9 = 1 then the statement s has code segment g. Equation 20 is used to determine if a processor is mapped to a code segment. Vp E C : g E : S E S zp,g ;> Xs,p * STMTSEGs,g (20) where zyq is binary variable to reveal the processor to a code segment mapping, which will be set to 1 when statement s is mapped to processor p (ie xs,p = 1) and STMTSEGs,g binary variable is 1 for the statement s to code segment g mapping.

-32 [00094] Equation 21, which is used to determine the processor which has the maximal code size, depicts the formulation for the code size cost of each core cscostp. The cost is determined by adding the code size cost (in instructions) of each segment g which is mapped to statement s when s is mapped to processor c, where the processor c is mapped to a flavour f. The maximum code size is formulated as shown in below Equation 22. Vp E C cscostmax > CSCOSt (21) where cscostmax depicts the maximum code size cost in number of instructions over all the processors and cscostpis the code size cost in instructions of each processor. E. Area Formulation [00095] The area, i.e., hardware area in logic gates, is directly related to the processor flavour. It is worth noting that memory area is proportional to the code size. Equation 22, which is used to optimise the total area of the MPSoC, depicts how the total area of the entire MPSoC is determined. A cost COSTf is a static vector, listing an area cost (in logic gates) of each flavour. COSTf is extracted from synthesis of each flavour in a processor. totalarea ZP,fcreyp,f * COSTf (22) where totalarea is the total area of the MPSoC in logic gates, yp,f is a binary variable denoting that processor p is mapped to flavour f, and COSTf is a static vector in logic gates. Communication formulation [00096] The communication cost commcost,m is formulated by considering the variables used in statements as shown in Equation 23. The communication cost commcost,m requires the variable cost function varcostlm and the communication cost COSTv of a variable v. COSTv is the size of a variable v in bytes. Vl E C : Vm E C : 1 # m commcosti.m Zvev varcostv'i'm * COSTV (23) -33 where commcost,m is the communication costs in clock cycles between processor I and m, varcost,l,m is a binary decision vector for variable v and set to 1 when variable v is passed from processor 1 to processor m, COSTv denotes the size of variable v in words, where the word size is determined by the width of the FIFO communication bus (for example, a 32-bit or 4-byte FIFO communication bus for FIFO transfers will use one word each to send a variable of type char or integer). [00097] Merging of statements on a core will affect which variables are passed between cores. Duplications of variables between cores should be avoided. For example, if statement n passes a variable v to statements o and q, such that statement n has dependency on variable v, , mapping statements n, o and q to separate cores 1, m and r respectively will result in core 1 passing variable v to cores m and r via separate FIFOs. If statements o and q are mapped to a single core m and statement n to core 1, the core 1 should send variable v only to core m via a single FIFO. In order to perform this optimization, a variable is identified based on an ID of the variable and a statement ID from which the variable is sent. In the present example this is the src statement, n that sends the variable n. A set of variables sent from a statement is formulated as V. Two binary vectors are created, the variable to a destination and a variable to a source, VARDST,, and VARSRC,, respectively, where v E V and n, o E S. A value of the binary vector VARDST,, = 1 indicates that the variable v is passed to statement o, whereas a value of the binary vector VARSRCh = 1 indicates that the variable v is passed from statement n. In other words, n is the source statement and o is the destination statement for variable v. Using the two binary vectors a decision variable varcost[v, l, m] is defined in Equation 24, which determines whether a variable v is passed from processor 1 to processor m. 1, if v passed from core 1 to m, varcostm = , else (24) where varcostvlm is a binary variable and set to 1 when a particular variable v is transferred from core 1 to core m. [00098] The variable dependency formulation is shown in Equation 25, which determines whether a variable v is passed from processor 1 to processor m.

-34 V E C: Vm E C: 1 # m Vn E S: Vo E S : n # o Vv E V (25) varcostim ;> Ekn=d,n,oJ * (xn,i A Xo,m) * VARDST, * VARSRCn where varcostlm is a binary variable and set to 1 when a particular variable v is transferred from core 1 to core m, Ekn=ain,oJ refers to a binary variable with a value of 1 showing the existence of a data edge (kn = d, kind = data) between statements n and o. VARDSTo and VARSRCanare binary variables, where VARDSTQ=1 denotes that statement o is receiving variable v from other statements. VARSRQCn=1 denotes that statement n is sending variable v to other statements, xn,l and xo,m reveal whether statement n is mapped to processor 1 and statement o is mapped to processor m, respectively using the combined and operation A. F. Objective functions [00099] Minimizing for the overall throughput, as shown in Equation 26, optimizes all the paths to reduce the possible number of processors by merging statements. Overall latency, overall code size and the total area are also minimized. These components can be also combined to create a multi-objective constraint, as depicted in Equation (27). minimize overalltp (26) MPSoC synthesis 422 [000100] . Returning to Fig. 4, the statement level optimization 416 iteratively applies the ILP optimization for each number of cores, where the number of cores is varied between a minimum number of cores and a maximum number of cores according to constraints 418. Such an iterative step, described hereinafter in more detail, uses a maximum allowable number of processors for the MPSoC, which is decided a priori. [000101] Fig. 6 depicts a process 600, represented by a code fragment for producing an optimal MPSoC according to the disclosed DBPC approach. The process 600 produces an optimal MPSoC where an ILP formulation process, in the statement-level optimization process 416, is iterated from MINPROCS to MAXPROCS. A control loop in the form of Gc, for example 108 in Fig. 1, is provided as an input. It is possible to parallelize all the control loops of -35 G or to parallelize a specific control loop which is determined as a major bottleneck in the application. Multicore Code Generation [000102] The ILP optimization procedure, in the statement level optimization 416 described above, only performs optimization on the loop statements within the main loop being parallelized. In order to complete the DBPC approach , the dependent statements that form the rest of the code must also be executed and have any necessary outputs provided to each partitioned processor that requires them. In the example in Fig. 1(a), the statements in question are S1 and S2. There are two methods that can be used to handle such statements, depending on the workload of these statements compared to the other optimized processors. The traditional approach to handling such code is to map the code to an additional processor and transmit the results to each processor that needs them. Code showing this approach can be seen in Fig. 2B at (e). [000103] In Fig. 2B at (e), the system has been optimized and code for an additional core, coreinit.c, has been added with the partitioned and reduced code for one of the processors, coreputpict.c. This core is only be utilized at the start of operation of the MPSoC and transmits data only once, unlike the other cores which have code contained within the loop and hence communicate frequently. Thus, the core is often unutilized. [000104] The alternative approach is to distribute the dependency statements into cores that require them, so that each core performs the computation in parallel and no data needs to be transferred. This has a benefit of reducing area and communication costs of the MPSoC without having an impact on the ILP optimization of the statement level optimization 416 performed on the partitioning of the ILP code, as it is only executed once. Code size will increase by a small amount, as there is some duplication of code when two different processors using the same variable are initialized in the dependency statements, but the size increase will be minor compared to the area and communications savings for most applications. Resultant source code transcripts for two processors created in this manner are shown if Fig. 2B at (f) and the method to create this code is described as follows.

-36 [000105] Utilizing the same procedures defined earlier for dependency detection between loop statements, a bottom-up traversal of the RPDG is performed, from each critical statement mapped to a processor, to determine which nodes in the RPDG for the dependency statements are needed by that processor. Such nodes are tagged, along with nodes that a tagged node depends on. Once all loop statements have been analyzed, a top-down traversal of the RPDG is performed to generate the necessary statements into the modified source code including only the code for necessary dependency statements and any loop statements mapped to the processor. DBPC arrangement No. 2 [000106] DBPC arrangement No. 1 describes an ILP-based optimization method however the DBPC arrangement is not limited to using ILP-based optimization and can use evolutional algorithms, simulated annealing and so on. DBPC arrangement No. 3 [000107] As described above, throughput, latency, area, communication level costs are optimised separately. However, it is also possible in the DBPC arrangements to combine the costs as the ILP system does not have a limitation of separated optimization of each cost aspect. As a result, it is possible to optimize every cost aspect simultaneously with a ratio, as shown following equation. Weights are added to each cost (such as a, fl, y) based on their priority and their effect on the design goal. minimize a X accumcostL,m + f X EpEC CSCOStp + y X CSCoStmax + 5 X totalarea + E X accumcostT,m (27) where accumcostLmis the accumulation latency cost of processor m, cscost, is the code size cost in instructions for processor p, cscostmaxis the maximum possible code size out of all the processors, totalarea is the entire area of the MPSoC, accumcostT,mis the accumulation throughout cost of processor m, a, #l, y, 5, E are constants which are used to normalised these five separate design constraints, such as latency, throughput, maximum code size, code size per processor and the total area of the MPSoC. Contributions -37 [000108] The contributions of this DBPC arrangement are as follows; a) A design flow is used in the DBPC arrangements which, given a program loop, produces a pipelined heterogeneous MPSoC. Each processor in the MPSoC is chosen from among a number of different available processor configurations (referred to as processor "flavours"). The DBPC arrangement produces a pipeline heterogeneous MPSoC that can have multiple paths, where the number of processors in each path need not be equal. b) The DBPC arrangements consider many constraints critical to embedded systems such as throughput, latency, area, and code. c) The DBPC arrangements provide an improved slicing method, where slicing is referred to as extracting certain portion of the code by traversing a program dependency graph, for extracting reusable functions from ill-structured programs. The MPSoC can be synthesized using existing optimizations techniques, once the code is generated per core. d) The DBPC arrangement preserves the original code structure and distributes this structure across multiple cores. The DBPC arrangement uses the program slicing approach, where slicing refers to a procedure of extracting a portion of the code by traversing the program dependency graph, to generate code, in a manner that supports global/shared variables. e) As shown in Fig. 2A at (b) at 202, the DBPC arrangement supports paths with different number of processors in each path, this being referred to as "hybrid" paths, and creation of heterogeneous MPSoCs (this being referred to as hybrid & heterogeneous MPSoC synthesis 202). As indicated in Fig. 2A at (a) at 201, a prior art arrangement places S3 and S4 statements into a single processor, since the number of processors in each path is restricted to be the same. However, it is possible that pipelining S3 and S4 to two processors (i.e., hybrid & heterogeneous MPSoC synthesis 202 as marked in Fig. 2A at (b)) can improve the overall throughput, while keeping S5 in a single processor. The DBPC approach supports such finer optimizations to further improve throughput, while enabling hardware optimizations to reduce hardware costs such as area. Instead of iterations mapped to heterogeneous processors as done in current arrangements, which will not improve the throughput of each iteration inside the control loop,, a formulation in this DBPC arrangement maps statements to heterogeneous processors, allowing finer -38 performance control and further improvement. This DBPC arrangement analyses the legacy code as it is, without any conversions to accommodate global variables (as pointed out in Fig. 2A at (c) at 203), pointers and static variables, as well as constrain the system for area, communication, throughput, latency and code size. Fig. 2A at (c) at 203 shows the global variable support and constraints structure 203. As shown in Fig. 2A at (d) at 204 for code distribution, the analysis and code generation makes decisions to move dependent statements (S 1 and S2 in this example) to respective statements to reduce communication or code size. This DBPC arrangement refers to this moving of dependent statement as code distribution. Code examples 204 and 205 are provided in Fig. 2B at (e) and Fig. 2B at (f)) to demonstrate the two code generation techniques this DBPC arrangement supports (without and with code distribution respectively). Industrial Applicability [000109] The arrangements described are applicable to the computer and data processing industries and particularly for the Application Specific MPSoC design and manufacturing industry. [000110] The foregoing describes only some embodiments of the present invention, and modifications and/or changes can be made thereto without departing from the scope and spirit of the invention, the embodiments being illustrative and not restrictive. [000111] In the context of this specification, the word "comprising" means "including principally but not necessarily solely" or "having" or "including", and not "consisting only of'. Variations of the word "comprising", such as "comprise" and "comprises" have correspondingly varied meanings.

Claims

1. A method of modifying high level code to generate partitioned code for a heterogeneous multicore system, the method comprising the steps of: determining statement properties of statements in the high level code; constructing from said statement properties a dependency graph comprising statements in a control loop in the high level code, said statements in the dependency graph being linked by edges representing dependencies between the statements; determining performance properties of the statements in the control loop; mapping the statements in the dependency graph to cores of the multicore system according to the dependency graph and the performance properties to form a partitioned graph and generating from the partitioned graph the partitioned code for the heterogeneous multicore system.

2. A method according to claim 1, comprising an additional step of generating the heterogeneous multicore system by moving code segments from the partitioned code to corresponding cores.

3. A method according to claim 1, wherein the constructing step comprises performing a rule-based traversal of the dependency graph to identify the dependency links depending upon the statement properties.

4. A method according to claim 1, wherein the mapping step comprises the steps of: optimising the dependency graph subject to the dependencies between the statements, to form the partitioned graph.

5. An apparatus for of modifying high level code to generate partitioned code for a heterogeneous multicore system, the apparatus comprising: a processor; and a memory storing a computer executable program for directing the processor to perform a method comprising the steps of: determining statement properties of statements in the high level code; -40 constructing from said statement properties a dependency graph comprising statements in a control loop in the high level code, said statements in the dependency graph being linked by edges representing dependencies between the statements; determining performance properties of the statements in the control loop; mapping the statements in the dependency graph to cores of the multicore system according to the dependency graph and the performance properties to form a partitioned graph and generating from the partitioned graph the partitioned code for the heterogeneous multicore system.

6. A computer readable non-transitory storage medium apparatus storing a computer executable program for directing a processor to perform a method of modifying high level code to generate partitioned code for a heterogeneous multicore system, the method comprising the steps of: determining statement properties of statements in the high level code; constructing from said statement properties a dependency graph comprising statements in a control loop in the high level code, said statements in the dependency graph being linked by edges representing dependencies between the statements; determining performance properties of the statements in the control loop; mapping the statements in the dependency graph to cores of the multicore system according to the dependency graph and the performance properties to form a partitioned graph and generating from the partitioned graph the partitioned code for the heterogeneous multicore system.

7. A heterogeneous multicore system generated using a method of modifying high level code to generate partitioned code for said heterogeneous multicore system, the method comprising the steps of: determining statement properties of statements in the high level code; constructing from said statement properties a dependency graph comprising statements in a control loop in the high level code, said statements in the dependency graph being linked by edges representing dependencies between the statements; determining performance properties of the statements in the control loop; -41 mapping the statements in the dependency graph to cores of the multicore system according to the dependency graph and the performance properties to form a partitioned graph and generating from the partitioned graph the partitioned code for the heterogeneous multicore system. CANON KABUSHIKI KAISHA Patent Attorneys for the Applicant/Nominated Person SPRUSON & FERGUSON