WO2016110316A1

WO2016110316A1 - Technique for loading instructions in a processor

Info

Publication number: WO2016110316A1
Application number: PCT/EP2015/050045
Authority: WO
Inventors: Zoltán NOVÁK; Sándor BALAJTHY
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2015-01-05
Filing date: 2015-01-05
Publication date: 2016-07-14

Abstract

A technique for loading code (102) in a processor (160) is provided. The processor (160) includes a pipeline of memory banks (162) for storing instructions of the code (102) to be performed by the processor (160). As to a method aspect of the technique, the code (102) is divided by a dividing unit (152) into a plurality of blocks Each of the plurality of blocks is associated with a first memory bank of the pipeline for storing a first instruction of the corresponding block and a second memory bank that is in the pipeline subsequent to a memory bank for storing a last instruction of the corresponding block. Each of the plurality of blocks is loaded starting at the first memory bank associated with the corresponding block according to a block order that at least partly matches the second memory bank associated with the corresponding block and the first memory bank of another block that is subsequent to the corresponding block according to the block order.

Description

Technique for Loading Instructions in a Processor Technical Field

The present disclosure generally relates to loading instructions in a processor for performing the instructions. More specifically, and without limitation, a method and a device for loading code into a processor including a pipeline of memory banks for storing instructions of the code are provided.

Background

Many programmable integrated circuits, such as integrated circuits employed in telecommunication network routers, comprise instruction memory organized in memory banks with separate processing units including memory controllers, instruction decoders, etc. Parallel operation of the instruction processing units results in an increased sequential reading and processing performance.

By way of example, if processing one instruction from the instruction memory takes 4 clock cycles, by operating in parallel 4 memory banks with 4 separate and properly synchronized memory controllers, a sequential reading rate of 1 instruction per clock cycle is achievable. An exemplary Application-Specific Integrated Circuit (ASIC) comprising a pipelined instruction memory is the EZChip NP4 network processor (described at http://www.ezchip.com/p_np4.htm), which is used in some Ericsson SSR routers (described at http://www.ericsson.com/ourportfolio/products/ssr-8000- family).

As this pipelined architecture is optimized for fast sequential reading, it is prevalent in dedicated instruction memories. The pipelined architecture can be employed in programmable chips that differentiate data memory and instruction memory (so- called Harvard architectures), e.g., in ASICs. The pipelined architecture can also be employed in general-purpose Central Processing Units (CPUs, usually implementing von Neumann architectures), e.g., in an instruction cache.

The pipelined architecture is sensitive to code branching. Upon running into a conditional or unconditional branching (also referred to as jump instruction), the instructions to be executed next are not necessarily the ones following sequentially in the instruction memory or instruction cache. In this case, the instruction processing units in the pipeline may have to be reset with a new address and the time already spent to decode the sequentially stored and unneeded instructions is lost.

An additional delay is caused upon performing the jump instruction, if memory banks and controllers are not able to work asynchronously so as to start a new reading operation at any time. The delay depends on which memory bank the destination instruction is stored in. Either some clock cycles have to be waited until the memory controller of the destination memory bank is in the proper phase to start the reading, or the code layout includes additional no-operation instructions so that possible jump destination instructions are stored in the proper memory banks.

However, fulfilling synchronization requirements by means of padding instructions consumes valuable instruction memory. In ASICs and general-purpose CPUs, fast instruction memory is a limited resource.

Summary

Accordingly, there is a need for a technique that loads code in a processor so as to avoid or reduce both waiting clock cycles and wasting instruction memory by no- operation instructions at least in some situations.

In one aspect, a method of loading code in a processor is provided. The processor includes a pipeline of memory banks for storing instructions of the code to be performed by the processor. The method comprises a step of dividing the code into a plurality of blocks; a step of associating with each of the plurality of blocks a first memory bank of the pipeline for storing a first instruction of the corresponding block and a second memory bank that is in the pipeline subsequent to a memory bank for storing a last instruction of the corresponding block; and a step of loading each of the plurality of blocks starting at the first memory bank associated with the corresponding block according to a block order that at least partly matches the second memory bank associated with the corresponding block and the first memory bank of another block subsequent to the corresponding block according to the block order.

At least some embodiments of the technique achieve an economical instruction layout in pipelined memory by virtue of the block order that at least partly matches the first and second memory banks. The technique can be implemented so as to minimize a number of no-operation instructions necessary to fill the code in cases of not matching blocks.

The code may be divided based on branching instructions in the code. The branching instructions may include jump instructions and/or destination instructions of the jump instructions. The jump instructions may include at least one of an unconditional jump instruction and/or a conditional jump instruction wherein all conditions cause some jump.

The first instruction may include or relate to a start instruction of the code and/or one of the destination instructions. The last instruction may include or relate to a termination instruction of the code and/or one of the jump instructions. The termination instruction may include at least one of a halt instruction and an exit instruction.

The at least partly matching may include minimizing deviations between the second memory bank of the corresponding block and the first memory bank of the subsequent block. A sum of the deviations, e.g., in terms of a number of no- operation instructions inserted in the code, may be minimized.

The method may further comprise a step of selectively inserting one or more no- operation instructions in the code at the end of the corresponding block so that the second memory bank of the corresponding block matches to the first memory bank of the subsequent block.

The loading may include statically storing or dynamically caching the instructions in the processor.

The block order may be determined by means of a graph, e.g., a directed graph. The graph may include nodes representing the memory banks and directed edges representing the blocks. The first memory bank may determine a start of the directed edge and the second memory bank may determine an end of the directed edge corresponding to the associated block. The block order may correspond to an edge order of an Eulerian trail in the graph.

As to another aspect, a computer program product is provided. The computer program product comprises program code portions for performing any one of the steps of the method aspect disclosed herein when the computer program product is executed by one or more computing devices. The computer program product may be stored on a computer-readable recording medium. The computer program product may also be provided for download in a data network, e.g., the Internet. The data network may be accessed using the processor. Alternatively or in addition, the processor may be comprised in a node of the data network.

The technique may be implemented as a static code analysis. The technique may reduce or avoid wasting memory space in certain pipelined architectures, thus increasing available program memory (e.g., in simple ASICs), or improving cache utilization in general-purpose CPUs.

As to a hardware aspect, a device for loading code in a processor is provided. The processor includes a pipeline of memory banks for storing instructions of the code to be performed by the processor. The device comprises a dividing unit adapted to divide the code into a plurality of blocks; an associating unit adapted to associate with each of the plurality of blocks a first memory bank of the pipeline for storing a first instruction of the corresponding block and a second memory bank that is in the pipeline subsequent to a memory bank for storing a last instruction of the

corresponding block; and a loading unit adapted to load each of the plurality of blocks starting at the first memory bank associated with the corresponding block according to a block order that at least partly matches the second memory bank associated with the corresponding block and the first memory bank of another block subsequent to the corresponding block according to the block order.

Any one of the units of the device, or a further dedicated unit, may be adapted to perform the any one of the steps disclosed in the context of the method aspect. Furthermore, the device may comprise any further feature disclosed in the context of the method aspect.

Brief Description of the Drawings

In the following, the present disclosure is described in more detail with reference to exemplary embodiments illustrated in the drawings, wherein:

Fig. 1 shows a schematic block diagram of a system including a device for

loading code in a processor; Fig. 2 shows a flowchart for a method of loading code in a processor, which is implementable by the device of Fig. 1;

Fig. 3 schematically illustrates a first code flow that can be performed by the processor of Fig. 1;

Fig. 4 schematically illustrates a second code flow that is equivalent to the first code flow when performed by the processor of Fig. 1;

Fig. 5 schematically illustrates a result of conventional code padding that is

optionally input to the device of Fig. 1;

Fig. 6 schematically illustrates a first exemplary code-padding step compatible with the method of Fig. 2;

Fig. 7 shows a graph for representing an exemplary result of an associating step of the method of Fig. 2;

Fig. 8 shows a flowchart for an implementation of the method of Fig. 2;

Fig. 9 schematically illustrates a second exemplary code-padding step compatible with the method of Fig. 2;

Fig. 10 shows a table illustrating code duplication compatible with the method of

Fig. 2

Fig. 11 shows a flowchart for an implementation of the associating step in the method of Fig. 2 or 8; and

Fig. 12 shows a table for adding edges to the graph of Fig. 7.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as a specific system environment and specific application scenarios in order to provide a thorough understanding of the technique disclosed herein. It will be apparent to one skilled in the art that the technique may be practiced in other embodiments that depart from these specific details. Moreover, while the following embodiments are primarily described for a router operating according to Long Term Evolution (LTE), it will be readily apparent that the technique described herein may also be implemented in other mobile and stationary communication networks, including Wireless Local Area Networks (WLANs), Global System for Mobile Communications (GSM) networks, Universal Mobile

Telecommunications System (UMTS) networks, LTE-Advanced networks and next- generation antenna array networks.

Moreover, those skilled in the art will appreciate that the functions, steps and units explained herein may be implemented using software functioning in conjunction with a programmed microprocessor, an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP) or a general purpose computer, e.g., including an Advanced RISC Machine (ARM). It will also be appreciated that, while the following embodiments are primarily described in context with methods and devices, the invention may also be embodied in a computer program product as well as in a system comprising a computer processor and memory coupled to the processor, wherein the memory is encoded with one or more programs that may perform the functions, steps and implement the units disclosed herein.

Fig. 1 schematically illustrates a system 100 for processing data, e.g., for routing data packets in an Internet Protocol (IP) network. The IP network may be part of a telecommunications network (e.g., a backhaul network of a mobile

telecommunications network).

The system 100 receives code 102 to be performed for processing the data. The system 100 includes a device 150 for loading the code 102 and a processor 160 for processing the data by performing the loaded code.

The device 150 comprises a dividing unit 152 for determining a plurality of blocks that partition the code 102. The device 150 further comprises an associating unit 154 for associating a first memory bank and a second memory bank with each of the blocks, and a loading unit 156 that loads the code 102 into the processor 160 according to a block order determined by the respective first and second memory banks.

The processor 160 comprises a plurality of memory banks 162 for storing instructions of the code 102. An executing unit 164 performs the instructions stored in the memory banks 162 according to a pipelined architecture. A memory controller (MC) is associated with each memory bank of the memory banks 164.

The processor 160 may be implemented by an application-specific integrated circuit (ASIC) and/or a general-purpose central processing unit (CPU). The pipelined memory banks 162 may store code 102 that is essentially static in the system 100 or may dynamically cache instructions, e.g., retrieved from a mass storage unit.

Fig. 2 shows a flowchart for a method 200 of loading code in a processor. The processor comprises a pipeline of memory banks for storing instructions of the code to be performed by the processor. The code is divided into a plurality of self- contained blocks in a step 202 of the method 200.

Each of the plurality of blocks is associated with a first memory bank and a second memory bank, e.g., indicated by first and second memory bank numbers,

respectively, in a step 204. The first memory bank of the pipeline is associated with storing a beginning or a first instruction of the corresponding block. The second memory bank is in the pipeline subsequent to a memory bank associated with storing and ending or a last instruction of the corresponding block.

In an exemplary implementation of the method 200, a block order is determined in the step 204.

In a step 206, the plurality of blocks is loaded in the processor according to a block order. One of the blocks is loaded starting at the first memory bank associated with the corresponding block. The block order at least partly matches the second memory bank associated with the corresponding block and the first memory bank of another block subsequent to the corresponding block according to the block order.

The device 150 may perform the method 200, e.g., for the pipeline of memory banks 162. For example, the units 152, 154 and 156 may perform the steps 202, 204 and 206, respectively.

At least some embodiments of the technique thus obtain an instruction memory layout that optimizes memory controller synchronization in pipelined architectures with minimal instruction memory waste and without introducing additional hardware wait mechanisms. The code 102 is divided in the step 202 into instruction blocks in such a way that the instruction blocks are not sensitive to reordering. The instruction blocks are loaded into the instruction memory 162 in the optimal block order that minimizes a number of required no-operation instructions, which results in a better utilization of the instruction memory or instruction cache.

A code layout specific for the processor 160 is calculated, e.g., in the step 204, to ensure that the memory controller of a destination instruction is ready to start its reading operation after a jump instruction leading to the destination instruction has been processed.

In most cases, an ASIC implementing the processor 160 runs statically loaded code 102, e.g., code 102 that usually does not change during operation. The infrequent updates of the code 102 makes it possible and economic to introduce complex and advanced code layout constraints and optimizations upon loading the resulting code into the instruction memory 162. The time required for performing the method 200 by the device 150 is thus irrelevant for the performance of the processor 160. In this way, proper memory controller synchronization is achievable without introducing additional delay or waiting mechanisms into the system 100.

A general-purpose CPU implementing the processor 160 with an instruction cache as the pipelined memory banks 162 dynamically loads the code 102. There is a trade-off between the better utilization of the memory banks 162 and a delay introduced at the device 150.

The technique may be implemented in some existing ASICs without modifying a hardware design of a given ASIC, e.g., if instruction layout is completely controlled by software. For example, the device 150 may be implemented by a software development kit accompanied with the given ASIC.

General-purpose CPUs load the instructions 102 into the instruction cache 162 partly or entirely hardware-driven. The device 150 may be implemented at hardware level, e.g., comprised in the processor 160.

The processor 160 may be implemented by one or more chips, e.g., with dedicated instruction memory according to a Harvard architecture. Static code may be loaded based on the code 102. In one embodiment, the processor 160 is not capable of dynamically loading or changing its code during standard operation. Without high performance requirements at the time of code loading, the device 150 can be implemented using software. In another embodiment, the processor 160 includes a general-purpose CPU. Instruction cache loading usually occurs dynamically during standard operation. The code loading is controlled by hardware with very strict performance requirements. In this context, the device 150 is implemented using hardware, e.g., an integrated circuit specialized to implement the method 200.

Document US 6920530 describes a caching system to dynamically recognize, load, and reorder basic blocks of a code in an instruction cache independent of an optimality of the block reordering. Each of the documents US 5950009, US 5664191, US 8677336 and US 7747992 discusses optimal basic block ordering based on a profile determined by runtime data collected at previous executions. The present technique may be used together with profile-guided optimizations that increase performance by ordering related code blocks near to each other in order to decrease instruction cache misses. The prior art provides different basic block ordering constraints. The conventional constraints either can be fulfilled simultaneously or in cases of conflict as a trade-off between performance (e.g., according to the profile- guided optimization constraints) and an economic cache utilization (e.g., according to the present technique).

The step 202 of dividing the code in reorderable blocks is illustrated in Figs. 3 and 4. The code division step 202 determines blocks so that the blocks are not sensitive to their actual in the memory banks 162, e.g., as compared to an initial order in the code 102, e.g., stored in an external instruction memory.

In one embodiment, the start of each block is an instruction that is exclusively reached through some kind of jump instruction, and not by fall-through from the previous instruction. The end of the block is determined by an instruction without fall-through option, e.g., an unconditional jump.

An exemplary code flow 300 for the code 102 is shown in Fig. 3. The code 102 includes a plurality of instructions, generically referred to by reference sign 302. As is manifest from the code flow 300, there are three blocks 304, 306 and 308, so that the instructions 302 within each block are (not necessarily exclusively) connected without a jump (also referred to as a fall-through edge).

The block 304 includes the instruction {A}, the block 308 includes the instruction flow {B -> C ->D -> E -> F}, and the code block 306 includes the instruction flow {G -> H -> I -> J -> K}. All other edges of the code flow 300 represent jumps. Each of the edges that are not jumps has to be between consecutive positions in the pipelined memory banks 162. The positions are labeled by a program counter (PC) in Figs. 3 and 4.

Since each of the instructions B and G in the code flow 300 is reached only by jump instructions, and the instructions F and K are not sequentially followed by an instruction on the fall-through path, the two blocks 308 including {B -> C -> D -> E - > F} and 306 including {G -> H -> I -> J -> K} can be loaded in different order into the memory banks 162, while the property that all fall-through arrows are between consecutive memory addresses (e.g., PC=i and PC=i+l) is preserved, as is

exemplified in a code flow 400 shown in Fig. 4. The code flow 400 is thus equivalent to the code flow 300.

For most of the architectures of the processor 160, the code division step 202 can be implemented by sequentially reading the code 102 and by the following rules for marking block starts in terms of the first memory block and block ends in terms of the second memory block.

First, the start instruction of the code 102 determines a start of a block.

Second, an instruction at which fall-through execution (i.e., the next instruction is the one on the following instruction address) is not an option determines an end of a block. For example, the block end is determined by an unconditional jump instruction (e.g., "jmp", "return", etc.), an instruction that completely ends code flow by stopping code execution (e.g., "halt", "exit", etc.), a conditional jump instruction if all possible branch lead to a jump (no fall-through) and a last instruction of the code 102 (e.g., the program end).

Third, the first instruction after a block end that is a destination of some jump is the start of a new block. For example, the first instruction includes a labeled destination instruction or an automatically inserted no-operation (abbreviated by NOP or NOOP) instruction before the labeled destination instruction if the code 102 is padded.

While it is possible that instructions between a block end and the start of the next block are left out, these instructions are dead code within the code 102. E.g., the dead code may result from previous code padding (i.e., the dead code includes automatically added NOP instructions). These parts of the code 102 are not included in any one of the blocks.

Fig. 5 schematically illustrates conventional code padding. Code padding is a simple way of solving synchronization issues in the pipelined architecture of the processor 160.

The synchronization, e.g., in the EZChip NP4 network processor, can require that the destination memory bank i (with the just-ready-to-read memory controller) of a jump instruction stored in the memory bank j is always i = ( + 3) modulo n, wherein n is the total number of memory banks.

The destination memory bank either contains the destination instruction or an automatically added NOP instruction preceding the destination instruction. Instead of introducing some kind of dedicated hardware waiting mechanism into the processor 160, the executing unit 164 of the processor 160 jumps to the nearest instruction before the destination address having a memory controller that is just ready-to-read, and follow from then sequentially.

To ensure correct operation, the instruction memory before each possible jump destination address have to be padded with a proper number of NOP instructions. The NOP instructions usually take 1 clock to run, consume 1 line in the instruction memory and (as the name suggests) do nothing so that they do not interfere with the program logic.

Conventional code padding 500 is illustrated by means an exemplary code 502 schematically shown Fig. 5. The code 502 is conventionally loaded into the processor, e.g., an ASIC. For a processor with an 8-banked architecture and without code duplication, 7 NOP instructions are additionally included in a padded code 504.

In this way, the processor is able to jump to a label "LABEL1" anytime from

anywhere within the code 504. The code flow jumps to the just-ready-to-read memory bank in the pipeline before the label. Synchronization issues are avoided by running the NOP instructions. The downside of the conventional code padding 500 is that NOP instructions consume valuable instruction space.

The conventional code padding 500 can be improved by the following rules.

First, where there is a possibility of an indirect jump (also referred to as indirect branch), the code 102 is padded with the maximum number NOP instructions. I.e., the number of NOP instructions is equal to n-1, wherein the symbol n denotes the number of memory banks 162. This means that the source addresses of the jumps can possible not be determined at compile time or code load time. Hence, it is assumed that one can jump to the destination from anywhere and anytime.

Second, if an instruction is the (e.g., labeled) destination of direct jumps only, the code flow is not configured for jumping to the destination at any time. Rather, a minimal number of required NOP instructions is calculated, e.g., based on the actual synchronization state of the memory controllers at the time of executing the jump instructions.

An example for the improved code-padding step is schematically illustrated for the same code 502 in Fig. 6. The resulting padded code 102 is shown at reference sign 604. In the exemplary padded code 604, the number of the destination memory bank is 3 and the number of the source memory bank is 0, thus satisfying the exemplary synchronization requirement.

The (conventionally or improved) padded code may be the code 102 that is input to the device 150. NOP instructions that are superfluous according to the present technique are eliminated as dead code.

The present technique thus allows, for at least certain codes 102, avoiding or further reducing the number of inserted NOP instructions by virtue of the block order. The block order is determined, e.g., in the step 204. Alternatively, the block order is determined in the step 206.

Fig. 7 shows an exemplary graph 700 for representing the first and second memory banks determined by the step 204. The memory banks 702 are represented by nodes (or vertices) of the graph 700. Starting from the first memory bank illustrated within each node, the corresponding block extends to or directly before the second memory bank. The first and second memory banks determined by the step 204 define directed edges (or arrows) representing the blocks 704 from the first memory bank to the second memory bank in the graph 700.

Hence, even if code padding by inserting NOP instructions (e.g., prior to the step 202 or in the step 204 to fulfill the matching condition) is necessary to match the second memory bank associated with one block and the first memory bank associated with the subsequent block, the number of inserted NOP instructions is significantly reduced as compared to the conventional code padding 500. The instruction layout resulting from the method 200 is advanced in that no or less NOP instructions compared to the conventional code padding are introduced, thus decreasing or avoiding a waste of instruction memory.

An implementation of the method 200 is shown by the flowchart in Fig. 8. The code is optionally padded in advance in a code-padding step 201 of the method 200. The code is divided into blocks according to the step 202. The blocks are (physically or logically) reordered in the step 204 so as to fulfill the matching condition. The step 204 may also be referred to as code block characterization and block order determination. The code is loaded into the instruction memory, or instruction cache, according to the step 206.

The optional code-padding step 201 is architecture specific. Alternatively or in addition, adding the minimum number of NOP instructions in the step 204 or 206 fulfills the matching condition.

The matching condition can be reformulated and fulfilled as a constraint satisfaction problem (CSP). The formal definition of a CSP involves variables and their domains, and constraints. For a set of variables, x x₂, ... ,x_B with domains D_{1 (} D₂ D_N, all variables x, have a value in their respective domain D_t. There is also a set of constraints, c_lr c₂, ... , c_m, such that a constraint c_t restricts (i.e., imposes a constraint on) the possible values in the domain of some subset of the variables. CSP

implementation is described in the "CSP tutorial" of the Cork Constraint Computation Centre published under http://4c.ucc.ie/web/outreach/tutorial.html by the University College Cork. The domain for all instructions is the one or more memory banks of their memory addresses. The constraint includes the matching condition, i.e., the synchronization requirement relating the source banks of the jump and the destination bank of the corresponding jump, which must be kept for proper synchronization. Given the padded code 102, the blocks determined by the dividing step 202 are reordered so that their first instructions still happen to be in the same memory banks as they were without reordering. Hence, the same synchronization requirements will be fulfilled as without reordering.

In the associating step 204, each block is associated with a pair of a first bank number representing the first memory bank and a second bank number representing the second memory bank of the method 200. The first bank number indicates the memory bank storing the first instruction of the block. If the associating step 204 is performed after the code-padding step 201 and the block-splitting step 202, this property is kept invariant by the reorder. In this way, all the synchronization requirements that have been fulfilled by the code-padding step 201 are preserved.

In case of code duplication in the memory banks 162, the first and second bank numbers represent the respective smallest bank number. By way of example, an 8- banked system 100 with each instruction 302 duplicated in two banks can be described in the same way as a 4-banked system 100 without duplication. In either case, the blocks can be characterized with a number in the range from 0 to 3, e.g., as the (smallest) memory bank for storing the starting instruction of the respective block.

The second bank number determined for each block in the associating step 204 is the bank number of the free space after the last instruction of the respective block. It is noted that the code 102 may include long instructions, e.g., on some

architectures, which consume more than one place in instruction memory.

For explanation, the associating step 204 may determine for a system 100 including two blocks in 4 memory banks 162 (or in 8 memory banks 162 with code duplication) the pairs (1,3) and (2,1) for the first and second memory banks associated with the respective blocks. The first pair (1,3) for the first block means that the first block has to be started in memory bank 1 and the next free instruction after the first block is in memory bank 3.

For reference, a block order is considered so that the second block is arranged after the first block. Since the second block has to be started in memory bank 2 according to the second pair (2,1), fulfilling the synchronization constraint requires 3 NOP instructions inserted after or at the end of the first block. The three inserted NOP instructions consume a place in memory bank 3, memory bank 0 and memory bank 1, respectively.

The present technique uses a different block order arranging the second block before the first block. As a result, no inserted NOP instructions are needed. After the directed edge (2,1), the directed edges (1,3) matches without any code padding.

A larger example for the code 102 has blocks 704 associated with the pairs

{(0,0), (0,2), (1,0), (1,2), (1,3), (2,0), (2,1), (2,2)].

The optimal layout,

(1,2)→ (2,2)→ (2,1)→ (1,0)→ (0,0)→ (0,2)→ (2,0)→ (1,3), has a cost of 1, i.e., one instruction space is lost due to inserting one NOP

instruction.

A worst-case layout,

(0,2)→ (1_,3)→ (2,2)→ (1,2)→ (1,0)→ (2,1)→ (0,0)→ (2,0), has a total cost of 19.

If we chose a random order, the average cost is 10.125, thus leaving the blocks in random original order, 10 NOP instructions are necessary to maintain the

synchronization requirement for the block 704, instead of the optimal 1 NOP instruction. Thus, embodiments of the technique may reduce the number of NOP instructions to be inserted by a factor of 10.

An implementation of the associating step 204 for selectively inserting the minimal number of NOP instructions to fulfill the matching condition is described.

First, the directed graph 700 with one vertex 702 representing each memory bank (or pseudo-bank, if there is code duplication) is derived. If some of the memory banks 162 are not represented in the block representation according to the associated pairs of first and second memory banks, they are left out in the graph 700. A directed edge represents each block 704. The graph 700 of the blocks

{(0,0), (o,2), (i,o), (1,2), (1,3), (2,0), (2,1), (2,2)} is depicted in Fig. 7.

The graph 700 is selectively extended in the step 204 with additional edges representing NOP blocks. For adding an edge of (0,2), a block of two NOP

instructions is added to the set of code blocks as determined by the step 202. The one or more additional edges represent the padding. Adding one (a,b)-padding block has a cost (in terms of instruction memory consumption) of:

0 < cost < numb er of banks

cost≡ b + numb er of banks— a (mod number of banks)

The blocks are ordered in a way to minimize the additional costs of these padding blocks.

Code padding can be used separately or in combination with the present technique. The improved code-padding step 201 for fulfilling the synchronization requirement is applicable to code including multiple jump instructions leading to the same

destination instruction. An example 900 resulting from the code-padding step 201 is schematically illustrated in Fig. 9. An exemplary code 902 includes two jump instructions. The padded code 904 is input to the dividing step 202.

Code duplication can be used separately or in combination with the present technique. For an embodiment of the system 100 including 8 independent instruction memory banks 162, the same code can be stored in all the 8 banks. This means that we have to use 8KB memory to store only 1KB code, but we can expect an 8-fold speedup in sequential reading. In this way, the problem of waiting for the memory controller of the destination address upon jump to be ready is avoided. After the jump, the executing unit 164 continues reading the instructions from any bank (as all contain the same code), so the one with the memory controller ready to read can always be chosen.

For the sake of convenience, further consider that instructions are numbered. Fig. 10 shows a table 1000 representing 4-fold code duplication. Half of the memory banks 162 (i.e., 4 memory banks) contain the same but only the odd-numbered instructions of the code; and the other half of the banks contains the even-numbered instructions of the code. In this way, 2KB of code 102 can be stored in 8KB of memory banks 162. The sequential reading is still expected to be eight times faster, but upon jump, it is possible that the destination of the jump includes an instruction that is not stored in one of the just-ready-to-read memory banks. One additional clock cycle has to be waited after such a jump to read out the instruction from the corresponding memory bank that will be available just then.

In the same way, a two-fold code duplication duplicates each instruction into two memory banks, allowing for storing 4KB code 102 in 8KB of memory banks 162. In the worst case, three clocks cycles are necessary until a memory controller after a jump that contains the destination instruction becomes available.

If all the banks without code duplication are used, the expected speed of sequential read stays the same. Upon executing a jump instruction, the executing unit 164 has to wait up to 7 clock cycles in the worst case to access the memory bank of the destination address when ready to read.

An exemplary implementation of the step 204 that minimizes the number of inserted NOP instructions by determining the block order that at least partly fulfills the matching condition is described with reference to the flowchart in Fig. 11. The graph 700 shown in Fig. 7 is merely referenced for the clarity of the explanation and without limitation.

The step 204 determines an extension set including edges with minimal cost. The extension set makes a Eulerian trail possible in the graph 700. A Eulerian trail is a path in the graph 700 that visits every edge exactly once. The block order is equivalent to the order of edges in the Eulerian trail, as each edge represents a certain block 704.

A directed graph has an Eulerian trail if and only if at most one vertex has (out- degree) - (in-degree) = 1, at most one vertex has (in-degree) - (out-degree) = 1, every other vertex has equal in-degree and out-degree, and all of its vertices with nonzero degree belong to a single connected component of the underlying

undirected graph.

Without limitation, the block order (e.g., represented by the Eulerian trail) is generated (e.g., as described below) after determining the extension set of edges by means of integer linear programming (ILP). The ILP, e.g., an ILP description of the Eulerian trail requirements, provides the extension edge set. The extension set provided by the ILP description may be the extension set with absolute minimal cost. Alternatively or in addition, the following method embodiment (also referred to as greedy algorithm) gives comparably good results, e.g., as to the extension set, with reduced implementation and/or execution complexity. In a substep 1102, each of the code blocks is associated with memory bank requirements in terms of first and second memory banks indicated in the format (starting block number, next free block number).

A directed graph with one vertex representing each memory bank 702 (or pseudo- bank if there is code duplication, and without unmentioned memory banks) and a directed edge representing each block 704 is generated in a substep 1104. The directed edge points from the first memory bank (holding the first instruction of the corresponding block) to the second memory bank (indicated by the number of the next free space).

The underlying undirected graph is connected in substeps 1106 and 1108. At the substep 1106 and 1108, the directedness of the edges is not taken into

consideration. The substep 1106 includes determining vertex pairs having no path between them (using for example the Floyd-Warshall algorithm), adding the minimal cost edge to the extension set of edges that makes two unconnected vertices connected in the substep 1108; and repeating from the substep 1106 until the underlying undirected graph is fully connected.

In substeps 1110 and 1112, the minimal cost edge is added to the extension set of edges. If there are multiple code blocks with the same minimal cost, one of them is arbitrarily chosen. The substeps 1110 and 1112 bring a balance for the in-edges and out-edges closer to the requirements for the Eulerian trail. According to the substep 1110, the substep 1112 is repeated until all requirements of the Eulerian trail are satisfied.

The Eulerian trail is generated (for example by using Hierholzer's algorithm) in a substep 1114 of the step 204 of the method 200.

The architecture may have additional constraints on the block order. E.g., it is possible that some instructions must not follow each other to avoid various chip specific hazard situations. Such additional constraints are taken into consideration at the substep 1114. During Eulerian trail buildup, a block that satisfies all the architectural requirements is chosen the as next block 704 (i.e., the next edge). If there is no such block at a given step, either backtracking or adding additional padding blocks is performed (as NOP instructions are usually not restricted, so they do not violate any constraints).

In the step 206, the code blocks including the selectively added padding blocks are loaded into the memory banks 162 in the block order determined by the edge order of the generated Eulerian trail.

The substeps 1106 and 1108 are implemented in a variant for better cost reduction and a little more complexity by adding (instead of adding only one edge (a,b)) all the edges of the minimal cost directed path between the nodes (i.e., memory banks) "a" and "b", which may or may not happen to be just the edge between them. The effect will be the same on the global edge balance, but the costs may be lower.

By way of example, instead of adding an edge from 0 to 3, three edges are added: An edge from 0 to 1, an edge from 1 to 2 and an edge from 2 to 3. The change in the in-out edge balance is the same, because the intermediate nodes on the path (at nodes 1 and 2) get an incoming and also an outgoing edge in each case. The number of outgoing edges of the node 0 will be increased by one compared to its incoming number, and the number of incoming edges of the node 3 will be increased by one compared to its outgoing number. If the path has a lower or equal cost compared to the direct edge from node 0 to node 3, the path shall be used instead.

The number of incoming edges and outgoing edges can be represented by a table, as is shown for above example in a table 1200 of Fig. 12. The left column indicates the node, i.e., the memory bank 702. The columns "in" and "out" indicate the respective number of edges, i.e., the number of blocks 704.

As is clarified by the exemplary table 1200, the condition for generating the Euler path is satisfiable by adding an edge of o→ ι with a cost of 1 (as pointed out above), so that in this case the greedy algorithm determines the optimal block order.

Practical results for the technique were obtained in an implementation for the EZChip NP4 network processor used in Ericsson Smart Services Routers (SSR). For the SSR code 102, the technique saves around 6% of instruction memory in 4-banked mode, and around 10% of instruction memory in 8-banked mode, compared to a

conventionally padded code without the at least partly matching block order. More specifically, for the 4-banked system 100, the conventionally padded code includes 6061 instructions and the optimally ordered code includes 5705 instructions. For the 8-banked system 100, the number of instructions is reduced from 8537 to 7664.

As has become apparent, at least some embodiments of the technique reduce development costs that are usually very high for platforms with limited resources (for example ASICs with small instruction memory), if the performance requirements are high. In these scenarios, chips can be used near to their limits, and the technique may be critical, e.g., for implementing further features in the code. The technique proved a way to push the limits further by automatically freeing up code space, e.g. in the network routers.

At least some embodiments can free-up instruction space, e.g., in certain ASIC designs or utilize the instruction cache better in general-purpose CPUs. Same or further embodiments can be used with existing technologies, e.g., on already existing hardware. The technique can be operated automatically without human intervention.

In simple ASICs, the technique has no performance drawbacks during operation, e.g., since all steps of the method can be performed statically before or during code load, without any impact on runtime. In general-purpose CPUs, cache loading happens during operation, so that there is a trade-off between a possibly negative performance impact and a less-frequent cache loading due to the better utilization.

For complex ASICs operating on a large codebase, the technique is scalable. For example, a computational complexity of the method may be implemented to linearly scale with the number of instructions in the code.

Many advantages of the present invention will be fully understood from the foregoing description, and it will be apparent that various changes may be made in the form, construction and arrangement of the units and devices without departing from the scope of the invention and/or without sacrificing all of its advantages. Since the invention can be varied in many ways, it will be recognized that the invention is limited only by the scope of the following claims.

Claims

1. A method (200) of loading code (102) in a processor (160), the processor (160) including a pipeline of memory banks (162) for storing instructions (302) of the code (102) to be performed by the processor (160), the method comprising:

dividing (202) the code (102) into a plurality of blocks (304, 306, 308; 704); associating (204) with each of the plurality of blocks (304, 306, 308; 704) a first memory bank (702) of the pipeline for storing a first instruction of the

corresponding block (304, 306, 308; 704) and a second memory bank (702) that is in the pipeline subsequent to a memory bank for storing a last instruction of the corresponding block (304, 306, 308; 704); and

loading (206) each of the plurality of blocks (304, 306, 308; 704) starting at the first memory bank (702) associated with the corresponding block (304, 306, 308; 704) according to a block order that at least partly matches the second memory bank (702) associated with the corresponding block (304, 306, 308; 704) and the first memory bank (702) of another block (304, 306, 308; 704) subsequent to the corresponding block (304, 306, 308; 704) according to the block order.

2. The method of claim 1, wherein the code is divided based on branching instructions in the code (102), the branching instructions including jump instructions and destination instructions of the jump instructions.

3. The method of claim 2, wherein the jump instructions includes at least one of an unconditional jump instruction and/or a conditional jump instruction wherein all conditions cause a jump.

4. The method of claim 2 or 3, wherein the first instruction includes a start instruction of the code (102) and/or one of the destination instructions.

5. The method of any one of claims 2 to 4, wherein the last instruction includes a termination instruction of the code (102) and/or one of the jump instructions.

6. The method of any one of claims 1 to 5, wherein the at least partly matching includes minimizing a sum of deviation between the second memory bank (702) of the corresponding block and the first memory bank (702) of the subsequent block.

7. The method of claim 6, the method further comprising:

selectively inserting (201; 1108) one or more no-operation instructions in the code (102) at the end of the corresponding block so that the second memory bank (702) of the corresponding block matches to the first memory bank (702) of the subsequent block.

8. The method of claim 7, wherein the loading (206) includes statically storing or dynamically caching the instructions (302) in the processor (160).

9. The method of any one of claims 1 to 8, wherein the block order is

determined by means of a directed graph (700) including nodes representing the memory banks (702) and directed edges representing the blocks (704) according to the first and second memory banks.

10. The method of claim 9, wherein the block order corresponds to an edge order of an Eulerian trail in the directed graph (700).

11. A computer program product comprising instructions for performing the method of any one of claims 1 to 10 when executed on a computing device.

12. The computer program product of claim 11, stored on a computer-readable recording medium.

13. A device (150) for loading code (102) in a processor (160), the processor (160) including a pipeline of memory banks (162) for storing instructions (302) of the code (102) to be performed by the processor (160), the device comprising:

a dividing unit (152) adapted to divide the code (102) into a plurality of blocks (304, 306, 308; 704);

an associating unit (154) adapted to associate with each of the plurality of blocks (304, 306, 308; 704) a first memory bank (702) of the pipeline for storing a first instruction of the corresponding block (304, 306, 308; 704) and a second memory bank (702) that is in the pipeline subsequent to a memory bank for storing a last instruction of the corresponding block (304, 306, 308; 704); and

a loading unit (156) adapted to load each of the plurality of blocks (304, 306, 308; 704) starting at the first memory bank (702) associated with the corresponding block (304, 306, 308; 704) according to a block order that at least partly matches the second memory bank (702) associated with the corresponding block (304, 306, 308; 704) and the first memory bank (702) of another block (304, 306, 308; 704) subsequent to the corresponding block (304, 306, 308; 704) according to the block order.

14. The device of claim 13, wherein the dividing unit (152) is adapted to divide the code (102) based on branching instructions in the code (102), the branching instructions including jump instructions and destination instructions of the jump instructions.

15. The device of claim 13 or 14, wherein the associating unit (154) is further adapted to determine the block order by means of a directed graph (700) including nodes representing the memory banks (702) and directed edges representing the blocks (704) according to the first and second memory banks.