WO2005036384A2 - Instruction encoding for vliw processors - Google Patents

Instruction encoding for vliw processors Download PDF

Info

Publication number
WO2005036384A2
WO2005036384A2 PCT/IB2004/052047 IB2004052047W WO2005036384A2 WO 2005036384 A2 WO2005036384 A2 WO 2005036384A2 IB 2004052047 W IB2004052047 W IB 2004052047W WO 2005036384 A2 WO2005036384 A2 WO 2005036384A2
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
instruction word
word
processing apparatus
issue
Prior art date
Application number
PCT/IB2004/052047
Other languages
French (fr)
Other versions
WO2005036384A3 (en
Inventor
Marco J. G. Bekooij
Alexander Augusteijn
Paul F. Hoogendijk
Original Assignee
Koninklijke Philips Electronics N.V.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koninklijke Philips Electronics N.V. filed Critical Koninklijke Philips Electronics N.V.
Publication of WO2005036384A2 publication Critical patent/WO2005036384A2/en
Publication of WO2005036384A3 publication Critical patent/WO2005036384A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/3012Organisation of register space, e.g. banked or distributed register file
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30156Special purpose encoding of instructions, e.g. Gray coding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/3826Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage
    • G06F9/3828Bypassing or forwarding of data results, e.g. locally between pipeline stages or within a pipeline stage with global bypass, e.g. between pipelines, between clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • G06F9/3889Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
    • G06F9/3891Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute organised in groups of units sharing resources, e.g. clusters

Definitions

  • TECHNICAL FIELD A processing apparatus, a method for processing data, a compiler program product and a computer program.
  • Computer architectures consist of a fixed data path, which is controlled by a set of control words. Each control word controls parts of the data path and these parts may comprise register addresses and operation codes for arithmetic logic units (ALUs) or other execution units. Each set of instructions generates a new set of control words, usually by means of an instruction decoder that translates the binary format of the instruction into the corresponding control word, or by means of a micro store, i.e. a memory that contains the control words directly.
  • a control word represents a RISC like operation, comprising an operation code, two operand register indices and a result register index. The operand register indices and the result register index refer to registers in a register file.
  • VLIW Very Large Instruction Word
  • a VLIW processor uses multiple, independent execution units to execute these multiple instructions in parallel.
  • the processor allows exploiting instruction-level parallelism in programs and thus executing more than one instruction at a time.
  • the compiler attempts to minimize the time needed to execute the program by optimizing parallelism.
  • the compiler combines instructions into a VLIW instruction word under the constraint that the instructions assigned to a single VLIW instruction word can be executed in parallel and under data dependency constraints.
  • Encoding of instructions can be done in two different ways, for a data stationary VLIW processor or for a time stationary VLIW processor, respectively.
  • a data stationary VLIW processor all information related to a given pipeline of operations to be performed on a given data item is encoded in a single VLIW instruction word.
  • time stationary VLIW processors the information related to a pipeline of operations to be performed on a given data item is spread over multiple instructions in different VLIW instructions, thereby exposing said pipeline of the processor in the program.
  • the execution units will be active all together only rarely. Therefore, in some VLIW processors, fewer instructions are provided in each VLIW instruction word than would be needed for all the execution units together.
  • Each instruction is directed to a selected execution unit that has to be active, for example by using multiplexers. In this way it is possible to save on instruction memory size while hardly compromising performance.
  • instructions are directed to different execution units in different clock cycles.
  • the corresponding control words are issued to a respective issue slot of the VLIW issue register.
  • Each issue slot is associated with one or more execution units.
  • a particular control word is directed to a specific one among the execution units that are associated with the particular issue slot.
  • the encoding of parallel instructions in a VLIW instruction word leads to a severe increase of the code size. Large code size leads to an increase in program memory cost both in terms of required memory size and in terms of required memory bandwidth. In modern VLIW processors different measures are taken to reduce the code size.
  • NOP no operation
  • the NOP operations can be encoded by single bits in a special header attached to the front of the VLIW instruction, resulting in a compressed VLIW instruction. Instruction bits may still be wasted in each instruction of a VLIW instruction, because some instructions can be encoded in a more compact way than others can.
  • a disadvantage of the compact representation of NOP operations is that the fields representing the instructions must be aligned to the right in the compressed VLIW instruction word.
  • complex and inherently slow variable length decoding logic is required, since the decoding logic requires a shift register in order to shift the instructions to their proper position in the decompressed instruction word. This holds especially for VLIW processors with a large number of issue slots, resulting in a very wide VLIW instruction word.
  • An object of the invention is to provide a processing apparatus, especially a VLIW processing apparatus, which allows an efficient encoding and decoding of instruction words, resulting in an increase of the performance of the processing apparatus.
  • a processing apparatus comprising a register file for storing data, and a plurality of issue slots, wherein each issue slot comprises at least one execution unit.
  • the processing apparatus is conceived for processing data, retrieved from the register file, under control of at least a first instruction word and a second instruction word.
  • the first instruction word is selected from a first instruction set, wherein the first instruction word encodes a plurality of instructions to be executed in parallel by the plurality of issue slots.
  • the second instruction word is selected from a second instruction set, wherein the second instruction word encodes at least one instruction to be executed by a subset of the plurality of issue slots.
  • Two instructions sets are used for encoding of instruction words, the first instruction set for wide instruction words and the second instruction set for small instruction words.
  • these instructions can be encoded in one or more instruction words of the second instruction set.
  • the decoding of the first instruction word becomes faster, since no shifting during decoding is required.
  • the decoding of the second instruction word is achieved with a relatively simple and fast decoder. As a result, an efficient encoding and decoding of instruction words is obtained.
  • EP 10507 98 describes a processor, supporting three instruction modes. In all modes, each fetch operation initiated to the program memory retrieves an instruction word of 128 bits in length. In a first instruction mode, during each machine cycle two 32 bit instructions are decoded. In a second instruction mode, during each machine cycle a pair of 16 bit instructions is decoded. In a third instruction mode, four 32 bit instructions are decoded during each machine cycle, in which case the processor behaves as a VLIW processor. The current instruction mode is held in a process status register at the decode unit and this instruction mode can be changed. An instruction mode signal is generated using the current value of the instruction mode.
  • a detector is able to detect a change in the instruction length when the processor is in the second instruction mode, which indicates that the subsequent instruction is of the first length, i.e. an instruction corresponding to the first instruction mode. In that case, the state of the instruction mode signal is temporarily altered to allow the first length instructions to be decoded without changing the instruction mode in the register. As a result, 32 bit instructions are allowed to be included in a sequence of 16 bit instructions. However, this document does not disclose the use of two instructions words having a different length, in order to efficiently encode and decode the instruction words. US2002/0004897 describes a processor capable of executing instructions from multiple instruction sets.
  • the processor has a CPU for executing a primary instruction word and a processor status register, which contains an instruction set selector (ISS) for indicating a current instruction set of the instruction sets.
  • the processor also has a pre-decoder for translating instructions from the instruction set to the primary instruction word, and a decoder for decoding the primary instruction word.
  • the ISS will indicate a new instruction set mode. In case the instruction set is not equal to that of the primary instruction, the instruction is first pre-decoded by the pre-decoder and subsequently decoded.
  • An embodiment of the invention is characterized in that the processing apparatus further comprises a first instruction memory for storing the first instruction word, and a second instruction memory for storing the second instruction word.
  • the instructions words corresponding to the second instruction set are smaller compared to the instruction words corresponding to the first instruction set.
  • the processing apparatus further comprises a first decoder for decoding the first instruction word, and a second decoder for decoding the second instruction word.
  • the second decoder is relatively simple and fast, allowing a fast decoding of the second instruction words.
  • An embodiment of the invention is characterized in that the processing apparatus is a Very Large Instruction Word (VLIW) processor, wherein the first instruction word is a VLIW instruction word, and wherein the second instruction word has a width smaller than the first instruction word.
  • VLIW processor allows executing multiple instructions in parallel, increasing the overall speed of operation, while having relatively simple hardware.
  • the first instruction word is a compressed instruction word, comprising dedicated bits for encoding of NOP operations. The use of dedicated bits for encoding of NOP operations strongly reduces the code size of VLIW instructions, reducing the required memory size and bandwidth. Further embodiments of the invention are described in the dependent claims.
  • a method for processing data using a processing apparatus according to the invention is defined in claim 8.
  • a compiler program product arranged for generating a sequence of instruction words that can be executed by a processing apparatus according to the invention is defined in claim 9.
  • a computer program comprising computer program code means for instructing a processing apparatus to perform the steps of the method according to the invention is defined in claim 10.
  • FIG. 1 shows a schematic block diagram of a VLIW processor according to the invention.
  • Fig. 2 shows an embodiment of the instruction decoder DEC1 for a VLIW processor according to Figure 1.
  • Fig. 3 shows the encoding of the instruction words for a VLIW processor according to the invention.
  • a schematic block diagram illustrates a VLIW processor comprising a plurality of issue slots, including issue slot ISo, ISi, IS 2 , IS 3 , IS 4 and IS5, and a register file, including register file segments RFo and RFi.
  • the processor has a controller CTR and a connection network CN for coupling the register file segments RFo and RF 1 , and the issue slots ISo, ISi, IS 2 , IS 3 , IS and IS5.
  • the controller CTR comprises a first decoder DEC1 for decoding of the VLIW instruction words.
  • the register file segments RFo and RFi are coupled to a bus, not shown in Fig.
  • Issue slots ISo, ISi, IS 2> IS 3> IS and IS5 represent issue slots with two execution units, requiring one or two operands and producing one result. Examples of execution units are a arithmetic and logic unit, a multiply-accumulate unit, a load/store unit or an application- specific unit. In different embodiments, one or more issue slots may contain a different number of execution units, or more complex functional units, which may require more than two operands and/or may produce more than one result.
  • Connection network CN allows passing of input data and result data between the register file segments RFo and RF 1 , and the issue slots IS 0 , ISi, IS 2 , IS 3; IS 4 and IS 5 .
  • the register file segments RFo and RFi are distributed register files, i.e. several register files, each accessible by a limited set of issue slots for retrieving data from the register file. In other embodiments, there is one central register file for all issue slots ISo - IS 5 .
  • An advantage of a distributed register file is that it requires less read and write ports per register file segment, resulting in a smaller register file area, decrease in power consumption and increase in speed of operation. Furthermore, it improves the scalability of the processor when compared to a central register file.
  • the connection network CN is a partially connected network, i.e. not each issue slot ISo — IS 5 is coupled to each register file RFo and RFi for writing of data to the register file.
  • the use of a partially connected communication network reduces the code size as well as the power consumption, and also allows increasing the performance of the processor. Furthermore, it improves the scalability of the processor when compared to a fully connected communication network.
  • the instruction decoder DEC1 comprises an address translation table ATT, instruction memories IMl and IM2, a second decoder DEC2, blocks 201 -221 of 32 AND gates, and blocks 223 - 233 of 32 OR gates.
  • the decoder generates a VLIW instruction word IW3, comprising six control words 235 - 245, each corresponding to a 32-bit wide instruction.
  • control words can be wider or smaller than 32 bits.
  • the VLIW instruction word IW3 is used for controlling a VLIW processor according to Figure 1, where control word 235 is mapped onto issue slot ISo, control word 237 is mapped onto issue slot ISi, and so on. Control word 245 is mapped onto issue slot IS5.
  • Instruction memory IMl stores VLIW instruction words of a first instruction set, where each instruction word encodes six instructions. Each instruction has a width of 32 bits, meaning that the width of the instruction words stored in instruction memory IMl is 196 bits.
  • Instruction memory IM2 stores VLIW instruction words of a second instruction set, where each instruction word comprises a single instruction of 32 bits wide.
  • a corresponding number of the issue slot ISo - IS 5 for which the instruction should be issued is stored at the same address.
  • the address translation table ATT a dedicated bit value is stored at each address, indicating whether the instruction referred to by the program counter PC should be fetched from instruction memory IMl or instruction memory IM2.
  • a bit value of '0' indicates that an instruction has to be fetched from instruction memory IMl, whereas a bit value of ' 1' indicates that an instruction has to be fetched from instruction memory IM2.
  • a corresponding instruction memory address is stored at the same address of the address translation table ATT, indicating at which address of instruction memory IMl or IM2, respectively, the instruction to be fetched is stored.
  • the controller CTR updates the value of the program counter PC in order to fetch a next instruction word.
  • the program counter PC points to an address of address translation table ATT.
  • the enable instruction signal El is set to the value of ' 1 ', and this signal activates the internal decoder DEC2.
  • a logic converter NOT converts the value ' 1 ' of the enable instruction signal El into a signal having the value '0'.
  • the corresponding value of ADDR derived from the address translation table ATT, is used for selecting the address of instruction memory IM2 where the instruction word is stored, that has to be fetched.
  • the internal decoder DEC2 reads the issue slot number IS from the address ADDR of instruction memory IM2, and the corresponding instruction word IW2 is sent to each block 201 - 209 of 32 AND ports. Within each block, the 32 bit values of the instruction word IW2 are used as input value for the first port of the respective 32 AND ports.
  • the local decoder DEC2 sets the issue slot select signal ISS equal to ' 1 ' for the line corresponding to the control word of the issue slot with issue slot number IS, and the remaining issue slot select signals ISS are set equal to '0'.
  • the issue slot select signal ISS for the block 201 of 32 AND gates is set equal to '1'
  • the input value of the second input port of the 32 AND ports in the selected block is set equal to ' 1 '.
  • the block 201 of 32 AND gates outputs the 32 bit values of instruction word IW2 to the first input ports of the 32 OR ports of the block 223 of 32 OR gates.
  • the input values of the second input port of the AND gates of all the blocks 203 - 209 of 32 AND gates is set to '0', since their corresponding value of the issue slot select signal ISS is equal to '0'.
  • the blocks 203 - 209 of 32 AND gates output a value '0' to the first input ports of the OR gates of all the blocks 225 - 233 of 32 OR gates.
  • the logic converter NOT inputs a '0' on the first input port of the AND gates of all the blocks 211 -221 of 32 AND gates.
  • the bit values of the instruction word IW1 are all equal to zero, and these bit values are put on the second input ports of the AND gates of all blocks 211 - 221 of 32 AND gates.
  • All blocks 211 - 221 of 32 AND gates output a value of '0' to the second input port of the OR gates of all blocks 223 - 233 of 32 OR gates.
  • the 32 OR gates of the block 223 of OR gates output the bit values of the instruction word IW2, corresponding to control word 235 of VLIW instruction word IW3.
  • the OR gates of all blocks 225 - 233 of 32 OR gates output a value '0', and a NOP operation is encoded for the control words 237 - 245 of the VLIW instruction word IW3.
  • the instruction word IW2 is issued to the first issue slot ISo of the VLIW processor.
  • the issue slot number IS has to be changed according to that issue slot, resulting in outputting the value of ' 1 ' for the issue slot select signal ISS to the second input ports of the AND gates of that block of the blocks 203 - 209 of 32 AND gates corresponding to the control word of that issue slot.
  • the instruction word IW2 is then mapped onto the corresponding control word 237 - 245.
  • the bit value stored at the address of address translation table ATT referred to by the program counter PC indicates that an instruction word has to be fetched from instruction memory IMl, i.e. the bit value is equal to '0', the enable instruction signal El is made equal to '0'.
  • the internal decoder DEC2 is not activated in this case.
  • a logic converter NOT converts the value '0' of the enable instruction signal El into a signal having the value ' 1 ', and puts this as an input signal to the first input ports of the AND gates of all blocks 211 - 221 of 32 AND gates.
  • the corresponding value of ADDR stored in the address translation table ATT, is used for selecting the address of instruction memory IMl where the instruction word is stored, that has to be fetched.
  • the instruction word stored at the address ADDR of IMl is output as instruction word IWl to the second input ports of the AND gates of the blocks 211 - 221 of 32 AND gates.
  • the first 32 bits of the instruction word IWl are send to the second input ports of the 32 AND gates of the block 211 of AND gates, the next 32 bits of the instruction word IWl are send the second input ports of the 32 AND gates of the block 213 of AND gates, and so on.
  • the last 32 bits of the instruction word IWl are send to the second input ports of the 32 AND gates of the block 221 of AND gates.
  • the AND gates of all the blocks 211 - 221 of 32 AND gates output the corresponding bit values of the instruction word IWl to the first input port of the corresponding OR gates of all the blocks 223 - 233 of OR gates.
  • all six instructions of instruction word IWl are encoded in the VLIW instruction word IW3 and issued to the corresponding issue slots ISo - IS5.
  • the encoding of the instruction words is shown for a VLIW processor according to the invention, having four issue slots.
  • instruction memory 301 The encoding of instruction words in a prior art VLIW processor is shown in instruction memory 301. Horizontally the issue slot number IS is shown, and vertically the instruction memory address ADDR.
  • the abbreviation "i" followed by a number refers to a certain instruction.
  • a NOP operation is encoded. On address '0' and '1' an instruction word is stored only encoding one instruction, for issue slots 1 and 2, respectively. On address '2' and instruction word is stored encoding three instructions, whereas on address '3' an instruction word is stored where instructions are encoded for all issue slots.
  • step 311 the instruction words having less than three instructions are rescheduled.
  • a different threshold value can be chosen, depending on the required encoding efficiency for the instruction words.
  • the instruction word stored at address 4 of table 301 is rescheduled, and the corresponding instructions ilO and il 1 are encoded in two different instruction words.
  • the new instructions words are stored at addresses 4 and 5, respectively, of instruction memory 303.
  • the encoding of the instruction words stored at addresses 0, 1, 2, 3, 5 and 6 of instruction memory 301 remains unchanged.
  • step 313 the instruction words having one instruction are transferred to instruction memory 305, together with the number of the issue slot to which the instruction should be issued.
  • the instruction words having three of four instructions are transferred to instruction memory 307.
  • the address transfer table 309 is generated.
  • a value '0' stored in the left column of table 309 indicates that the next instruction word should be fetched from instruction memory 305.
  • a value stored in the right column of table 309, at the same address ADDR indicates the address of instruction memory 305 where the instruction word is stored.
  • a value T stored in the left column of table 309 indicates that the next instruction word should be fetched from instruction memory 307.
  • a value stored in the right column of table 309, at the same address ADDR indicates the address of instruction memory 307 where the instruction word is stored.
  • instruction words are stored which encode one instruction that should be issued to a single issue slot, as in case of instruction memory IM2.
  • instruction memory 307 instruction words are stored which encode an instruction for each issue slot, as in case of instruction memory IMl .
  • instruction memory 301 In case only a few instructions are encoded in a VLIW instruction word, as shown by some instruction words stored in instruction memory 301, a substantial amount of the memory size is used for only storing NOP operations. Compressing instruction words by using dedicated bits to encode NOP operations, reduces the required memory size, but requires more complex decoding logic to decompress the compressed instruction words.
  • instruction memories 305 and 307. The decoding of the first and second instruction words is efficient as well, since the number of wide instruction words, i.e.
  • the first instruction words, that have to be decoded is reduced as well as the number of NOP operations in the wide instruction words, and the decoding of the second instruction words only requires relatively simple logic.
  • An example of an application where encoding in small and wide instruction words in useful, is a so-called folded for-loop. In a folded for-loop, instructions corresponding to a different iteration of the for-loop are executed in parallel. An example is given in the table below, for a for-loop with N iterations and comprising instructions A, B and C. preamble kernel postamble
  • Instructions A, B and C are mapped onto issue slot ISo, ISi and IS 2 , respectively.
  • the subscript of the instructions corresponds to the iteration number of the for-loop, where in the column "kernel" a range of iteration numbers is presented.
  • Ai is encoded in a first instruction word only instruction Ai.
  • a 2 and Bi are encoded in a second instruction word.
  • instructions A, B and C are encoded.
  • instructions BN and CN- I are encoded.
  • In the last instruction word only instruction C N is encoded.
  • the first two instruction words are referred to as the preamble of the folded for-loop
  • the next N-4 instruction words are referred to as the kernel of the folded for-loop
  • the last two instruction words are referred to as the postamble of the folded for-loop.
  • the instructions corresponding to the kernel of the folded for-loop are encoded using a first instruction set, where three instructions A, B and C are encoded in a VLIW instruction word. These instructions A, B and C are executed in parallel by issue slots IS 0 - IS 2 of a VLIW processor.
  • the instructions A, B and C corresponding to the preamble and the kernel of the folded for-loop are encoded using a second instruction set, where each instruction is separately encoded in a VLIW instruction word.
  • Each VLIW instruction word corresponding to the second instruction set can be mapped onto a specific issue slot, which is required, for example, for VLIW processors with a distributed register file and a partially connected network. In such a processor not every execution unit is coupled to every register file. An output value produced by an instruction corresponding to the kernel of the folded for- loop is stored in a fixed register of a register file, as encoded in that instruction. For an instruction corresponding to the postamble of the folded for-loop, it is required that this instruction is executed by an execution unit that can access the output value produced by an instruction corresponding to the kernel.
  • the second VLIW instruction word IW2 encodes more than one instruction.
  • a second instruction can be encoded, that is issued to the same issue slot each time a second instruction word IW2 is decoded.
  • the other instruction is issued to the issue slot indicated by the issue slot number IS.
  • This embodiment is useful in case the instruction level parallelism in a certain part of the program is equal to two, which is too low for encoding the instructions in a first VLIW instruction word IWl , but reduces the number of second VLIW instruction words IW2 when compared to the case these were only allowed to encode one instruction.
  • a VLIW processor with both issue slots comprising generic execution units, e.g.
  • the instructions encoded in a first, i.e. wide, VLIW instruction word may differ in width, for example five instructions of 32 bits and one instruction of 64 bits are encoded in the first VLIW instruction word.
  • the instruction in an instruction requiring three operands instead of two operands, the instruction must have additional bits to encode the source register of the third operand, compared to an instruction encoding two operands.
  • the width of the instruction of the second, i.e. small, VLIW instruction word must always have the same width as that of the widest instruction of the first VLIW instruction word. Since, this wide instruction of an instruction word may be encoded by a second VLIW instruction word, if not sufficient instructions are available in that instruction word.
  • the instructions that require more than a chosen default number of input operands, or produce more than a default number of output operands are expanded during encoding of the second VLIW instruction words, by the compiler, in a chain of instructions with the default number of input or output operands.
  • an instruction requiring three operands and performing an addition operation that is encoded by a second VLIW instruction word, where two operands is chosen as the default value, is expanded into two instructions, each encoded by a second VLIW instruction word.
  • the first instruction has two input operands
  • the second instruction has one input operand.
  • the execution unit that executes the instructions comprises registers in which input operands can be stored.
  • the input operands corresponding to the first instruction are stored in these registers.
  • the input operands are fetched from these registers when the second instruction is executed, and the execution unit produces the result of adding the three operands.
  • the instruction words of a second instruction set must always have the same width as that of the widest instruction of the first instruction word, corresponding to a first instruction set, which may be costly in terms of instruction memory and which may reduce the performance of the processor.
  • the first instruction words are compressed VLIW instruction words. Single bits in a set of dedicated bits, stored in a field of the instruction word, encode the NOP operations.
  • a bit '0' refers to a NOP operation and the position of the bit in the field points to the instruction within the instruction word that holds the NOP operation.
  • a bit '1' refers to an instruction having a non-NOP operation, and the position of the bit in the field points to the instruction within the instruction word.
  • Compressed VLIW instruction words stored in instruction memory IMl have to be decompressed before the instruction word IWl can be outputted, which requires decompression logic, as know by the person skilled in the art.
  • the use of compressed VLIW instruction words reduces the size of the instruction memory IMl, but the decompression step reduces the overall performance of the VLIW processor.
  • a superscalar processor also comprises multiple issue slots that can perform multiple operations in parallel, as in case of a VLIW processor.
  • the processor hardware itself determines at runtime which operation dependencies exist and decides which operations to execute in parallel based on these dependencies, while ensuring that no resource conflicts will occur.
  • a VLIW processor may have more issue slots in comparison to a superscalar processor.
  • the hardware of a VLIW processor is less complicated in comparison to a superscalar processor, which results in a better scalable architecture.
  • the number of issue slots and the complexity of each issue slot will determine the amount of benefit that can be reached using the present invention.

Abstract

Data processing systems, for example VLIW processors, comprise a register file (RF0, RF1) for storing data, and a number of issue slots (IS0 - IS5), wherein each issue slot has at least one execution unit. The data processing system processes the data stored in the register file, under control of instruction words. Especially in case of a large number of issue slots, it is not always possible to issue an instruction to each issue slot. Therefore the instruction words are often compressed to save instruction memory. A disadvantage is that decoding these compressed instruction words requires complex logic. According to the invention, a first instruction word (IW1) and a second instruction word (IW2) are used. The first instruction word corresponds to a first instruction set, wherein the first instruction word encodes a plurality of instructions to be executed in parallel by the plurality of issue slots. The second instruction word corresponds to a second instruction set, wherein the second instruction word encodes at least one instruction to be executed by a single issue slot. As a result, instructions can be encoded more efficiently. The decoding of the first instruction word becomes faster, since less shifting during decoding is required. The decoding of the second instruction word is achieved with a relatively simple and fast decoder.

Description

Instruction encoding for VLIW processors
TECHNICAL FIELD A processing apparatus, a method for processing data, a compiler program product and a computer program.
BACKGROUND ART Computer architectures consist of a fixed data path, which is controlled by a set of control words. Each control word controls parts of the data path and these parts may comprise register addresses and operation codes for arithmetic logic units (ALUs) or other execution units. Each set of instructions generates a new set of control words, usually by means of an instruction decoder that translates the binary format of the instruction into the corresponding control word, or by means of a micro store, i.e. a memory that contains the control words directly. Typically, a control word represents a RISC like operation, comprising an operation code, two operand register indices and a result register index. The operand register indices and the result register index refer to registers in a register file. In case of a Very Large Instruction Word (VLIW) processor, multiple instructions are packaged into one long instruction word, a so-called VLIW instruction word. A VLIW processor uses multiple, independent execution units to execute these multiple instructions in parallel. The processor allows exploiting instruction-level parallelism in programs and thus executing more than one instruction at a time. In order for a software program to run on a VLIW processor, it must be translated into a set of VLIW instruction words. The compiler attempts to minimize the time needed to execute the program by optimizing parallelism. The compiler combines instructions into a VLIW instruction word under the constraint that the instructions assigned to a single VLIW instruction word can be executed in parallel and under data dependency constraints. Encoding of instructions can be done in two different ways, for a data stationary VLIW processor or for a time stationary VLIW processor, respectively. In case of a data stationary VLIW processor all information related to a given pipeline of operations to be performed on a given data item is encoded in a single VLIW instruction word. For time stationary VLIW processors, the information related to a pipeline of operations to be performed on a given data item is spread over multiple instructions in different VLIW instructions, thereby exposing said pipeline of the processor in the program. In practical applications, the execution units will be active all together only rarely. Therefore, in some VLIW processors, fewer instructions are provided in each VLIW instruction word than would be needed for all the execution units together. Each instruction is directed to a selected execution unit that has to be active, for example by using multiplexers. In this way it is possible to save on instruction memory size while hardly compromising performance. In this architecture, instructions are directed to different execution units in different clock cycles. The corresponding control words are issued to a respective issue slot of the VLIW issue register. Each issue slot is associated with one or more execution units. A particular control word is directed to a specific one among the execution units that are associated with the particular issue slot. The encoding of parallel instructions in a VLIW instruction word leads to a severe increase of the code size. Large code size leads to an increase in program memory cost both in terms of required memory size and in terms of required memory bandwidth. In modern VLIW processors different measures are taken to reduce the code size. One important example is the compact representation of no operation (NOP) operations in a data stationary VLIW processor, for example the NOP operations can be encoded by single bits in a special header attached to the front of the VLIW instruction, resulting in a compressed VLIW instruction. Instruction bits may still be wasted in each instruction of a VLIW instruction, because some instructions can be encoded in a more compact way than others can. A disadvantage of the compact representation of NOP operations is that the fields representing the instructions must be aligned to the right in the compressed VLIW instruction word. When decoding the compressed instruction word, complex and inherently slow variable length decoding logic is required, since the decoding logic requires a shift register in order to shift the instructions to their proper position in the decompressed instruction word. This holds especially for VLIW processors with a large number of issue slots, resulting in a very wide VLIW instruction word.
DISCLOSURE OF INVENTION An object of the invention is to provide a processing apparatus, especially a VLIW processing apparatus, which allows an efficient encoding and decoding of instruction words, resulting in an increase of the performance of the processing apparatus. This object is achieved with a processing apparatus, comprising a register file for storing data, and a plurality of issue slots, wherein each issue slot comprises at least one execution unit. The processing apparatus is conceived for processing data, retrieved from the register file, under control of at least a first instruction word and a second instruction word. The first instruction word is selected from a first instruction set, wherein the first instruction word encodes a plurality of instructions to be executed in parallel by the plurality of issue slots. The second instruction word is selected from a second instruction set, wherein the second instruction word encodes at least one instruction to be executed by a subset of the plurality of issue slots. Two instructions sets are used for encoding of instruction words, the first instruction set for wide instruction words and the second instruction set for small instruction words. In case during encoding of an instruction word, not sufficient instructions are available that can be executed in parallel in order to efficiently encode a first instruction word, these instructions can be encoded in one or more instruction words of the second instruction set. The decoding of the first instruction word becomes faster, since no shifting during decoding is required. The decoding of the second instruction word is achieved with a relatively simple and fast decoder. As a result, an efficient encoding and decoding of instruction words is obtained. EP 10507 98 describes a processor, supporting three instruction modes. In all modes, each fetch operation initiated to the program memory retrieves an instruction word of 128 bits in length. In a first instruction mode, during each machine cycle two 32 bit instructions are decoded. In a second instruction mode, during each machine cycle a pair of 16 bit instructions is decoded. In a third instruction mode, four 32 bit instructions are decoded during each machine cycle, in which case the processor behaves as a VLIW processor. The current instruction mode is held in a process status register at the decode unit and this instruction mode can be changed. An instruction mode signal is generated using the current value of the instruction mode. A detector is able to detect a change in the instruction length when the processor is in the second instruction mode, which indicates that the subsequent instruction is of the first length, i.e. an instruction corresponding to the first instruction mode. In that case, the state of the instruction mode signal is temporarily altered to allow the first length instructions to be decoded without changing the instruction mode in the register. As a result, 32 bit instructions are allowed to be included in a sequence of 16 bit instructions. However, this document does not disclose the use of two instructions words having a different length, in order to efficiently encode and decode the instruction words. US2002/0004897 describes a processor capable of executing instructions from multiple instruction sets. The processor has a CPU for executing a primary instruction word and a processor status register, which contains an instruction set selector (ISS) for indicating a current instruction set of the instruction sets. The processor also has a pre-decoder for translating instructions from the instruction set to the primary instruction word, and a decoder for decoding the primary instruction word. When an instruction set switch occurs, the ISS will indicate a new instruction set mode. In case the instruction set is not equal to that of the primary instruction, the instruction is first pre-decoded by the pre-decoder and subsequently decoded. However, this document does not disclose the use of two instruction words having a different length that both are decoded in a single decoding step, An embodiment of the invention is characterized in that the processing apparatus further comprises a first instruction memory for storing the first instruction word, and a second instruction memory for storing the second instruction word. The instructions words corresponding to the second instruction set are smaller compared to the instruction words corresponding to the first instruction set. By storing the second instruction word in a separate memory, the total memory size of the instruction memory is reduced. An embodiment of the invention is characterized in that the processing apparatus further comprises a first decoder for decoding the first instruction word, and a second decoder for decoding the second instruction word. The second decoder is relatively simple and fast, allowing a fast decoding of the second instruction words. An embodiment of the invention is characterized in that the processing apparatus is a Very Large Instruction Word (VLIW) processor, wherein the first instruction word is a VLIW instruction word, and wherein the second instruction word has a width smaller than the first instruction word. A VLIW processor allows executing multiple instructions in parallel, increasing the overall speed of operation, while having relatively simple hardware. An embodiment of the invention is characterized in that the first instruction word is a compressed instruction word, comprising dedicated bits for encoding of NOP operations. The use of dedicated bits for encoding of NOP operations strongly reduces the code size of VLIW instructions, reducing the required memory size and bandwidth. Further embodiments of the invention are described in the dependent claims. A method for processing data using a processing apparatus according to the invention is defined in claim 8. A compiler program product arranged for generating a sequence of instruction words that can be executed by a processing apparatus according to the invention is defined in claim 9. A computer program comprising computer program code means for instructing a processing apparatus to perform the steps of the method according to the invention is defined in claim 10.
BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 shows a schematic block diagram of a VLIW processor according to the invention. Fig. 2 shows an embodiment of the instruction decoder DEC1 for a VLIW processor according to Figure 1. Fig. 3 shows the encoding of the instruction words for a VLIW processor according to the invention.
DESCRIPTION OF PREFERRED EMBODIMENTS Referring to Fig. 1, a schematic block diagram illustrates a VLIW processor comprising a plurality of issue slots, including issue slot ISo, ISi, IS2, IS3, IS4 and IS5, and a register file, including register file segments RFo and RFi. The processor has a controller CTR and a connection network CN for coupling the register file segments RFo and RF 1, and the issue slots ISo, ISi, IS2, IS3, IS and IS5. The controller CTR comprises a first decoder DEC1 for decoding of the VLIW instruction words. The register file segments RFo and RFi are coupled to a bus, not shown in Fig. 1, and via this bus the register file segments receive input data. Issue slots ISo, ISi, IS2> IS3> IS and IS5 represent issue slots with two execution units, requiring one or two operands and producing one result. Examples of execution units are a arithmetic and logic unit, a multiply-accumulate unit, a load/store unit or an application- specific unit. In different embodiments, one or more issue slots may contain a different number of execution units, or more complex functional units, which may require more than two operands and/or may produce more than one result. Connection network CN allows passing of input data and result data between the register file segments RFo and RF 1, and the issue slots IS0, ISi, IS2, IS3; IS4 and IS5. In some embodiments, the register file segments RFo and RFi are distributed register files, i.e. several register files, each accessible by a limited set of issue slots for retrieving data from the register file. In other embodiments, there is one central register file for all issue slots ISo - IS5. An advantage of a distributed register file is that it requires less read and write ports per register file segment, resulting in a smaller register file area, decrease in power consumption and increase in speed of operation. Furthermore, it improves the scalability of the processor when compared to a central register file. In some embodiments, the connection network CN is a partially connected network, i.e. not each issue slot ISo — IS 5 is coupled to each register file RFo and RFi for writing of data to the register file. The use of a partially connected communication network reduces the code size as well as the power consumption, and also allows increasing the performance of the processor. Furthermore, it improves the scalability of the processor when compared to a fully connected communication network. Referring to Figure 2, an embodiment of the instruction decoder DEC1 for a VLIW processor according to Figure 1 is shown. The instruction decoder DEC1 comprises an address translation table ATT, instruction memories IMl and IM2, a second decoder DEC2, blocks 201 -221 of 32 AND gates, and blocks 223 - 233 of 32 OR gates. In this embodiment, the decoder generates a VLIW instruction word IW3, comprising six control words 235 - 245, each corresponding to a 32-bit wide instruction. In alternative embodiments, the control words can be wider or smaller than 32 bits. The VLIW instruction word IW3 is used for controlling a VLIW processor according to Figure 1, where control word 235 is mapped onto issue slot ISo, control word 237 is mapped onto issue slot ISi, and so on. Control word 245 is mapped onto issue slot IS5. Instruction memory IMl stores VLIW instruction words of a first instruction set, where each instruction word encodes six instructions. Each instruction has a width of 32 bits, meaning that the width of the instruction words stored in instruction memory IMl is 196 bits. Instruction memory IM2 stores VLIW instruction words of a second instruction set, where each instruction word comprises a single instruction of 32 bits wide. For each instruction word stored in instruction memory IM2 at a given address, a corresponding number of the issue slot ISo - IS 5 for which the instruction should be issued is stored at the same address. In the address translation table ATT a dedicated bit value is stored at each address, indicating whether the instruction referred to by the program counter PC should be fetched from instruction memory IMl or instruction memory IM2. A bit value of '0' indicates that an instruction has to be fetched from instruction memory IMl, whereas a bit value of ' 1' indicates that an instruction has to be fetched from instruction memory IM2. Together with the bit value a corresponding instruction memory address is stored at the same address of the address translation table ATT, indicating at which address of instruction memory IMl or IM2, respectively, the instruction to be fetched is stored. During execution of a program, the controller CTR updates the value of the program counter PC in order to fetch a next instruction word. The program counter PC points to an address of address translation table ATT. In case the bit value stored at the address of address translation table ATT referred to by the program counter PC, indicates that an instruction word has to be fetched from instruction memory IM2, i.e. the bit value is equal to '1', the enable instruction signal El is set to the value of ' 1 ', and this signal activates the internal decoder DEC2. A logic converter NOT converts the value ' 1 ' of the enable instruction signal El into a signal having the value '0'. The corresponding value of ADDR, derived from the address translation table ATT, is used for selecting the address of instruction memory IM2 where the instruction word is stored, that has to be fetched. The internal decoder DEC2 reads the issue slot number IS from the address ADDR of instruction memory IM2, and the corresponding instruction word IW2 is sent to each block 201 - 209 of 32 AND ports. Within each block, the 32 bit values of the instruction word IW2 are used as input value for the first port of the respective 32 AND ports. The local decoder DEC2 sets the issue slot select signal ISS equal to ' 1 ' for the line corresponding to the control word of the issue slot with issue slot number IS, and the remaining issue slot select signals ISS are set equal to '0'. In case the issue slot select signal ISS for the block 201 of 32 AND gates is set equal to '1', the input value of the second input port of the 32 AND ports in the selected block is set equal to ' 1 '. As a result, the block 201 of 32 AND gates outputs the 32 bit values of instruction word IW2 to the first input ports of the 32 OR ports of the block 223 of 32 OR gates. The input values of the second input port of the AND gates of all the blocks 203 - 209 of 32 AND gates is set to '0', since their corresponding value of the issue slot select signal ISS is equal to '0'. The blocks 203 - 209 of 32 AND gates output a value '0' to the first input ports of the OR gates of all the blocks 225 - 233 of 32 OR gates. The logic converter NOT inputs a '0' on the first input port of the AND gates of all the blocks 211 -221 of 32 AND gates. The bit values of the instruction word IW1 are all equal to zero, and these bit values are put on the second input ports of the AND gates of all blocks 211 - 221 of 32 AND gates. All blocks 211 - 221 of 32 AND gates output a value of '0' to the second input port of the OR gates of all blocks 223 - 233 of 32 OR gates. The 32 OR gates of the block 223 of OR gates output the bit values of the instruction word IW2, corresponding to control word 235 of VLIW instruction word IW3. The OR gates of all blocks 225 - 233 of 32 OR gates output a value '0', and a NOP operation is encoded for the control words 237 - 245 of the VLIW instruction word IW3. As a result, the instruction word IW2 is issued to the first issue slot ISo of the VLIW processor. In case the instruction should be issued to one of the issue slots ISi - IS5, the issue slot number IS has to be changed according to that issue slot, resulting in outputting the value of ' 1 ' for the issue slot select signal ISS to the second input ports of the AND gates of that block of the blocks 203 - 209 of 32 AND gates corresponding to the control word of that issue slot. The instruction word IW2 is then mapped onto the corresponding control word 237 - 245. In case the bit value stored at the address of address translation table ATT referred to by the program counter PC, indicates that an instruction word has to be fetched from instruction memory IMl, i.e. the bit value is equal to '0', the enable instruction signal El is made equal to '0'. The internal decoder DEC2 is not activated in this case. A logic converter NOT converts the value '0' of the enable instruction signal El into a signal having the value ' 1 ', and puts this as an input signal to the first input ports of the AND gates of all blocks 211 - 221 of 32 AND gates. The corresponding value of ADDR, stored in the address translation table ATT, is used for selecting the address of instruction memory IMl where the instruction word is stored, that has to be fetched. The instruction word stored at the address ADDR of IMl is output as instruction word IWl to the second input ports of the AND gates of the blocks 211 - 221 of 32 AND gates. The first 32 bits of the instruction word IWl are send to the second input ports of the 32 AND gates of the block 211 of AND gates, the next 32 bits of the instruction word IWl are send the second input ports of the 32 AND gates of the block 213 of AND gates, and so on. The last 32 bits of the instruction word IWl are send to the second input ports of the 32 AND gates of the block 221 of AND gates. The AND gates of all the blocks 211 - 221 of 32 AND gates output the corresponding bit values of the instruction word IWl to the first input port of the corresponding OR gates of all the blocks 223 - 233 of OR gates. On all input ports of the AND gates of all blocks 201 - 209 of 32 AND gates a value of '0' is present, since the value of the issue slot select signal ISS is equal to '0', and all bits of the instruction word IW2 are equal to '0' as well. The AND gates of all blocks 201 - 209 of 32 AND gates output a value of '0' to the second input port of the OR gates of all blocks 223 - 233 of 32 OR gates. The OR gates of all the blocks 223 - 233 of 32 OR gates output the corresponding bit values of the instruction word IWl. The first 32 bits of the instruction word IWl, outputted by the OR gates of the block 223 of 32 OR gates, is mapped onto the first control word 235 of the VLIW instruction word IW3, and so on. The last 32 bits of the instruction word IWl, outputted by the OR gates of the block 233 of 32 OR gates are mapped onto the last control word 245 of the VLIW instruction word IW3. As a result, all six instructions of instruction word IWl are encoded in the VLIW instruction word IW3 and issued to the corresponding issue slots ISo - IS5. Referring to Figure 3, the encoding of the instruction words is shown for a VLIW processor according to the invention, having four issue slots. The encoding of instruction words in a prior art VLIW processor is shown in instruction memory 301. Horizontally the issue slot number IS is shown, and vertically the instruction memory address ADDR. The abbreviation "i" followed by a number refers to a certain instruction. In fields where no instruction is scheduled, a NOP operation is encoded. On address '0' and '1' an instruction word is stored only encoding one instruction, for issue slots 1 and 2, respectively. On address '2' and instruction word is stored encoding three instructions, whereas on address '3' an instruction word is stored where instructions are encoded for all issue slots. In step 311 , the instruction words having less than three instructions are rescheduled. In alternative embodiments a different threshold value can be chosen, depending on the required encoding efficiency for the instruction words. The instruction word stored at address 4 of table 301 is rescheduled, and the corresponding instructions ilO and il 1 are encoded in two different instruction words. The new instructions words are stored at addresses 4 and 5, respectively, of instruction memory 303. The encoding of the instruction words stored at addresses 0, 1, 2, 3, 5 and 6 of instruction memory 301 remains unchanged. In step 313 the instruction words having one instruction are transferred to instruction memory 305, together with the number of the issue slot to which the instruction should be issued. In step 315 the instruction words having three of four instructions are transferred to instruction memory 307. In step 317 the address transfer table 309 is generated. For a given address ADDR, a value '0' stored in the left column of table 309 indicates that the next instruction word should be fetched from instruction memory 305. A value stored in the right column of table 309, at the same address ADDR, indicates the address of instruction memory 305 where the instruction word is stored. For a given address ADDR, a value T stored in the left column of table 309 indicates that the next instruction word should be fetched from instruction memory 307. A value stored in the right column of table 309, at the same address ADDR, indicates the address of instruction memory 307 where the instruction word is stored. In instruction memory 305 instruction words are stored which encode one instruction that should be issued to a single issue slot, as in case of instruction memory IM2. In instruction memory 307 instruction words are stored which encode an instruction for each issue slot, as in case of instruction memory IMl . In case only a few instructions are encoded in a VLIW instruction word, as shown by some instruction words stored in instruction memory 301, a substantial amount of the memory size is used for only storing NOP operations. Compressing instruction words by using dedicated bits to encode NOP operations, reduces the required memory size, but requires more complex decoding logic to decompress the compressed instruction words. By using two instruction words, corresponding to two different instruction sets, a more efficient encoding of operations is achieved, as shown by instruction memories 305 and 307. The decoding of the first and second instruction words is efficient as well, since the number of wide instruction words, i.e. the first instruction words, that have to be decoded is reduced as well as the number of NOP operations in the wide instruction words, and the decoding of the second instruction words only requires relatively simple logic. An example of an application where encoding in small and wide instruction words in useful, is a so-called folded for-loop. In a folded for-loop, instructions corresponding to a different iteration of the for-loop are executed in parallel. An example is given in the table below, for a for-loop with N iterations and comprising instructions A, B and C. preamble kernel postamble
ISo Ai -2 A2 -N ISi 1 B2 -N-I BN IS2 CCll __NN--22 CCNN--11 CN
Instructions A, B and C are mapped onto issue slot ISo, ISi and IS2, respectively. The subscript of the instructions corresponds to the iteration number of the for-loop, where in the column "kernel" a range of iteration numbers is presented. In a first instruction word only instruction Ai is encoded. In a second instruction word instructions A2 and Bi are encoded. In the next N-4 instructions words, instructions A, B and C, corresponding to a different iteration number, are encoded. In the instruction word N-2, instructions BN and CN-I are encoded. In the last instruction word only instruction CN is encoded. The first two instruction words are referred to as the preamble of the folded for-loop, the next N-4 instruction words are referred to as the kernel of the folded for-loop and the last two instruction words are referred to as the postamble of the folded for-loop. The instructions corresponding to the kernel of the folded for-loop are encoded using a first instruction set, where three instructions A, B and C are encoded in a VLIW instruction word. These instructions A, B and C are executed in parallel by issue slots IS0 - IS2 of a VLIW processor. The instructions A, B and C corresponding to the preamble and the kernel of the folded for-loop are encoded using a second instruction set, where each instruction is separately encoded in a VLIW instruction word. These instructions A, B and C are executed sequentially. Each VLIW instruction word corresponding to the second instruction set can be mapped onto a specific issue slot, which is required, for example, for VLIW processors with a distributed register file and a partially connected network. In such a processor not every execution unit is coupled to every register file. An output value produced by an instruction corresponding to the kernel of the folded for- loop is stored in a fixed register of a register file, as encoded in that instruction. For an instruction corresponding to the postamble of the folded for-loop, it is required that this instruction is executed by an execution unit that can access the output value produced by an instruction corresponding to the kernel. In alternative embodiments, the second VLIW instruction word IW2 encodes more than one instruction. For example, a second instruction can be encoded, that is issued to the same issue slot each time a second instruction word IW2 is decoded. The other instruction is issued to the issue slot indicated by the issue slot number IS. This embodiment is useful in case the instruction level parallelism in a certain part of the program is equal to two, which is too low for encoding the instructions in a first VLIW instruction word IWl , but reduces the number of second VLIW instruction words IW2 when compared to the case these were only allowed to encode one instruction. For example, in case of a VLIW processor with both issue slots comprising generic execution units, e.g. an arithmetic and logic unit or a multiply- accumulate unit, as well as issue slots comprising execution units dedicated to a certain function, more than two instructions are encoded in the second VLIW instruction word IW2. The first set of instructions of the second VLIW instruction word IW2 is issued to fixed issue slots comprising generic execution units, and one instruction of the second VLIW instruction word IW2 is issued to a specific one of the issue slots, as indicated by the issue slot number IS, that comprises dedicated execution units. In a further alternative embodiment, the instructions encoded in a first, i.e. wide, VLIW instruction word may differ in width, for example five instructions of 32 bits and one instruction of 64 bits are encoded in the first VLIW instruction word. For example, in an instruction requiring three operands instead of two operands, the instruction must have additional bits to encode the source register of the third operand, compared to an instruction encoding two operands. In case of instructions in the first VLIW instruction word having a different width, the width of the instruction of the second, i.e. small, VLIW instruction word must always have the same width as that of the widest instruction of the first VLIW instruction word. Since, this wide instruction of an instruction word may be encoded by a second VLIW instruction word, if not sufficient instructions are available in that instruction word. In order to avoid that the instruction of the second VLIW instruction word must always have the same width as that of the widest instruction of the first VLIW instruction word, the instructions that require more than a chosen default number of input operands, or produce more than a default number of output operands are expanded during encoding of the second VLIW instruction words, by the compiler, in a chain of instructions with the default number of input or output operands. For example, an instruction requiring three operands and performing an addition operation, that is encoded by a second VLIW instruction word, where two operands is chosen as the default value, is expanded into two instructions, each encoded by a second VLIW instruction word. The first instruction has two input operands, and the second instruction has one input operand. The execution unit that executes the instructions comprises registers in which input operands can be stored. The input operands corresponding to the first instruction are stored in these registers. The input operands are fetched from these registers when the second instruction is executed, and the execution unit produces the result of adding the three operands. In this way it is prevented that the instruction words of a second instruction set must always have the same width as that of the widest instruction of the first instruction word, corresponding to a first instruction set, which may be costly in terms of instruction memory and which may reduce the performance of the processor. In different embodiments, the first instruction words are compressed VLIW instruction words. Single bits in a set of dedicated bits, stored in a field of the instruction word, encode the NOP operations. A bit '0' refers to a NOP operation and the position of the bit in the field points to the instruction within the instruction word that holds the NOP operation. A bit '1' refers to an instruction having a non-NOP operation, and the position of the bit in the field points to the instruction within the instruction word. Compressed VLIW instruction words stored in instruction memory IMl have to be decompressed before the instruction word IWl can be outputted, which requires decompression logic, as know by the person skilled in the art. The use of compressed VLIW instruction words reduces the size of the instruction memory IMl, but the decompression step reduces the overall performance of the VLIW processor. However, the penalty in performance reduction is limited since the number of NOP operations in the VLIW instruction words stored in instruction memory IMl is limited as well, as instruction words having a relatively large number of NOP operations are rescheduled, and therefore the instruction words stored in instruction memory IMl only require a relatively small number of shift operations during decompression. A superscalar processor also comprises multiple issue slots that can perform multiple operations in parallel, as in case of a VLIW processor. However, the processor hardware itself determines at runtime which operation dependencies exist and decides which operations to execute in parallel based on these dependencies, while ensuring that no resource conflicts will occur. The principles of the embodiments for a VLIW processor, described in this section, also apply for a superscalar processor. In general, a VLIW processor may have more issue slots in comparison to a superscalar processor. The hardware of a VLIW processor is less complicated in comparison to a superscalar processor, which results in a better scalable architecture. The number of issue slots and the complexity of each issue slot, among other things, will determine the amount of benefit that can be reached using the present invention. It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. In the device claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

CLAIMS:
1. A processing apparatus, comprising: a register file (RFo, RFi) for storing data; a plurality of issue slots (ISo - IS 5), wherein each issue slot comprises at least one execution unit, wherein the processing apparatus is conceived for processing data, retrieved from the register file, under control of at least a first instruction word (IWl) and a second instruction word (IW2), wherein the first instruction word is selected from a first instruction set, and wherein the first instruction word encodes a plurality of instructions to be executed in parallel by the plurality of issue slots, wherein the second instruction word is selected from a second instruction set, and wherein the second instruction word encodes at least one instruction to be executed by a subset of the plurality of issue slots.
2. A processing apparatus according to claim 1, further comprising: a first instruction memory (IMl) for storing the first instruction word, a second instruction memory (IM2) for storing the second instruction word.
3. A processing apparatus according to claim 1, further comprising: - a first decoder (DEC 1) for decoding the first instruction word, a second decoder (DEC2) for decoding the second instruction word.
4. A processing apparatus according to claim 1, wherein said processing apparatus is a Very Large Instruction Word (VLIW) processor, wherein the first instruction word is a VLIW instruction word, and wherein the second instruction word has a width smaller than the first instruction word.
5. A processing apparatus according to claim 1, wherein the first instruction word is a compressed instruction word, comprising dedicated bits for encoding of NOP operations.
6. A processing apparatus according to claim 1, which further comprises a connection network (CN) for coupling the register file and the plurality of issue slots.
7. A processing apparatus according to claim 1, wherein the register file is a distributed register file.
8. A method of processing data, using a processing apparatus, wherein the processing apparatus comprises: a register file (RFo, RFi) for storing data; a plurality of issue slots (ISo - IS5), wherein each issue slot comprises at least one execution unit, wherein the method comprises the following steps: retrieving data from the register file; processing the data under control of a first instruction word (IWl) and a second instruction word (IW2), wherein the first instruction word is selected from a first instruction set, and wherein the first instruction word encodes a plurality of instructions to be executed in parallel by the plurality of issue slots, wherein the second instruction word is selected from a second instruction set, and wherein the second instruction word encodes at least one instruction to be executed by a subset the plurality of issue slots.
9. A compiler program product arranged for generating a sequence of instruction words that can be executed by a processing apparatus according to claim 1.
10. A computer program comprising computer program code means for instructing a processing apparatus to perform the steps of the method according to claim 8.
PCT/IB2004/052047 2003-10-14 2004-10-11 Instruction encoding for vliw processors WO2005036384A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
EP03103796 2003-10-14
EP03103796.3 2003-10-14

Publications (2)

Publication Number Publication Date
WO2005036384A2 true WO2005036384A2 (en) 2005-04-21
WO2005036384A3 WO2005036384A3 (en) 2005-10-20

Family

ID=34429486

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2004/052047 WO2005036384A2 (en) 2003-10-14 2004-10-11 Instruction encoding for vliw processors

Country Status (1)

Country Link
WO (1) WO2005036384A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2430773A (en) * 2005-10-03 2007-04-04 Advanced Risc Mach Ltd Alignment of variable length program instructions
CN100456230C (en) * 2007-03-19 2009-01-28 中国人民解放军国防科学技术大学 Computing group structure for superlong instruction word and instruction flow multidata stream fusion
WO2013179085A1 (en) * 2012-05-29 2013-12-05 Freescale Semiconductor, Inc. Processing system and method of instruction set encoding space utilization
CN110007960A (en) * 2017-12-05 2019-07-12 三星电子株式会社 Electronic device and the method for using its process instruction

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026479A (en) * 1998-04-22 2000-02-15 Hewlett-Packard Company Apparatus and method for efficient switching of CPU mode between regions of high instruction level parallism and low instruction level parallism in computer programs
US20020042909A1 (en) * 2000-10-05 2002-04-11 Koninklijke Philips Electronics N.V. Retargetable compiling system and method
WO2003083649A1 (en) * 2002-03-28 2003-10-09 Koninklijke Philips Electronics N.V. Vliw processor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6026479A (en) * 1998-04-22 2000-02-15 Hewlett-Packard Company Apparatus and method for efficient switching of CPU mode between regions of high instruction level parallism and low instruction level parallism in computer programs
US20020042909A1 (en) * 2000-10-05 2002-04-11 Koninklijke Philips Electronics N.V. Retargetable compiling system and method
WO2003083649A1 (en) * 2002-03-28 2003-10-09 Koninklijke Philips Electronics N.V. Vliw processor

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2430773A (en) * 2005-10-03 2007-04-04 Advanced Risc Mach Ltd Alignment of variable length program instructions
CN100456230C (en) * 2007-03-19 2009-01-28 中国人民解放军国防科学技术大学 Computing group structure for superlong instruction word and instruction flow multidata stream fusion
WO2013179085A1 (en) * 2012-05-29 2013-12-05 Freescale Semiconductor, Inc. Processing system and method of instruction set encoding space utilization
US9672042B2 (en) 2012-05-29 2017-06-06 Nxp Usa, Inc. Processing system and method of instruction set encoding space utilization
CN110007960A (en) * 2017-12-05 2019-07-12 三星电子株式会社 Electronic device and the method for using its process instruction

Also Published As

Publication number Publication date
WO2005036384A3 (en) 2005-10-20

Similar Documents

Publication Publication Date Title
US7313671B2 (en) Processing apparatus, processing method and compiler
US5958048A (en) Architectural support for software pipelining of nested loops
KR100690225B1 (en) Data processor system and instruction system using grouping
US7574583B2 (en) Processing apparatus including dedicated issue slot for loading immediate value, and processing method therefor
US7127593B2 (en) Conditional execution with multiple destination stores
JP3547139B2 (en) Processor
US7581082B2 (en) Software source transfer selects instruction word sizes
KR0178078B1 (en) Data processor capable of simultaneoulsly executing two instructions
US20030037221A1 (en) Processor implementation having unified scalar and SIMD datapath
US20050223195A1 (en) Processor for making more efficient use of idling components and program conversion apparatus for the same
JPH1165844A (en) Data processor with pipeline bypass function
JP3781519B2 (en) Instruction control mechanism of processor
JP5989293B2 (en) Execution time selection of feedback connection in multiple instruction word processor
WO2005036384A2 (en) Instruction encoding for vliw processors
JP5122277B2 (en) Data processing method, processing device, multiple instruction word set generation method, compiler program
KR102560426B1 (en) Encoding of Instructions Identifying First and Second Architecture Register Numbers
JP4828409B2 (en) Support for conditional actions in time stationery processors
JP5068529B2 (en) Zero-overhead branching and looping in time-stationary processors
JP3915019B2 (en) VLIW processor, program generation device, and recording medium
JP2001306321A (en) Processor
WO1998006040A1 (en) Architectural support for software pipelining of nested loops
JP2001195252A (en) Processor for improving branching efficiency by using invalidation by mask technique

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase