EP0927391A1 - Input operand control in data processing systems - Google Patents

Input operand control in data processing systems

Info

Publication number
EP0927391A1
EP0927391A1 EP97937703A EP97937703A EP0927391A1 EP 0927391 A1 EP0927391 A1 EP 0927391A1 EP 97937703 A EP97937703 A EP 97937703A EP 97937703 A EP97937703 A EP 97937703A EP 0927391 A1 EP0927391 A1 EP 0927391A1
Authority
EP
European Patent Office
Prior art keywords
bit
register
instruction
piccolo
registers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP97937703A
Other languages
German (de)
English (en)
French (fr)
Inventor
David Vivian Jaggar
Simon James Glass
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ARM Ltd
Original Assignee
ARM Ltd
Advanced Risc Machines Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GB9619826A external-priority patent/GB2317467B/en
Application filed by ARM Ltd, Advanced Risc Machines Ltd filed Critical ARM Ltd
Publication of EP0927391A1 publication Critical patent/EP0927391A1/en
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30018Bit or string instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure

Definitions

  • This invention relates to data processing systems. More particularly, this invention relates to data processing systems having a plurality of registers for storing data words to be manipulated by an arithmetic logic unit operating under control of program instruction words.
  • the present invention provides apparatus for data processing, said apparatus comprising: a plurality of registers for storing data words to be manipulated, each of said registers having at least an N-bit capacity; and an arithmetic logic unit responsive program instruction words to perform arithmetic logic operations specified by said program instruction words; wherein said arithmetic logic unit is responsive to at least one program instruction word that includes:
  • an input operand size flag specifying whether said input operand data word has an N-bit size or an (N/2)-bit size; and (iii) when said input size flag specifies an (N/2)-bit size, a high/low location flag indicating in which of high order bit positions of said source register and low order bit positions of said source register said input operand data word is located.
  • the present invention recognises that when the data words to be manipulated are smaller than the datapath widths then using a full register to store these words is wasteful of the register resources of the device This is particularly the case in load/store architecture machine in which all data to be manipulated must be in a register and for which you wish to reduce the number of times you need to fetch data from the cache or main memory
  • the invention recognises the above consideration and provides the solution of using an input operand size flag and a high/low location flag to indicate the input operand size and in which po ⁇ ion of the register it is stored In this way, a single register can hold more than one input operand so more efficiently utilising the register resources of the device ard yet those input operands may be separately manipulated
  • the advantages of the present invention are enhanced further when an N-bit data bus links the data storage device to the register In this case, the data bus may be used to transfer two operands at a time, so more efficiently using the bus bandwidth and reducing the possibility of performance bottleneck
  • said arithmetic logic umt is responsive to at least one parallel operation program instruction word that performs separate arithmetic logic operations upon a first ( /2)-b ⁇ t input operand data word and a second (N/2Vb ⁇ t input operand data word stored within respective high order bit positions and low order bit positions of a single source register
  • arithmetic logic umt has a signal path that functions as a carry chain between bit positions in arithmetic logic operations and, when executing a parallel operation program instruction word said signal path is broken between said first (N/2)-b ⁇ t input operand data word and said second (N/2)-b ⁇ t input operand data word
  • parallel operation program instructions could take many forms, it is preferred that said parallel operation program instruction word performs the arithmetic logic operation of one of
  • a further refinement of the invention is that when said input size flag specifies an N-bit size, said high/low location flag indicates whether those bits stored in said high order bit positions should be moved to said lower order bit positions and those bits stored said low order bit positions should be moved to said h gh order bit positions prior to use as an N-bit input operand data word
  • This feamre is particularly useful during transform operations
  • a particularly effective hardware implementation of this functionality comprising at least one multiplexer responsive to said high/low location flag for selecting for supply to the low order (N/2)-b ⁇ ts of said datapath an (N/2)-bit input operand data word stored in one of high order bit positions of said source register and low order bit positions of said source register.
  • the present invention provides a method of processing data , said method comprising the steps of. storing data words to be manipulated in a plurality of registers, each of said registers having at least an N-bit capacity; and in response program instruction words, performing arithmetic logic operations specified by said program instruction words; wherein at least one program instruction word includes:
  • an input operand size flag specifying whether said input operand data word has an N-bit size or an (N/2Vb ⁇ t size; and (iii) when said input size flag specifies an (N/2)-bit size, a high/low location flag indicating in which of high order bit positions of said source register and low order bit positions of said source register said input operand data word is located.
  • Figure 1 illustrates the high level configuration of a digital signal processing apparatus
  • Figure 2 illustrates the input buffer of register configuration of a coprocessor
  • Figure 3 illustrates the datapath through the coprocessor
  • Figure 4 illustrates a mutlipiexmg circuit for read high or low order bits from a register
  • Figure 5 is a block diagram illustrating register remapping logic used by the coprocessor in preferred embodiments
  • Figure 6 illustrates in more detail the register remapping logic shown m Figure 5.
  • Figure 7 is a table illustrating a Block Filter Algorithm
  • DSP digital signal processing
  • DSP can take many forms, but may typically be considered to be processing that requires the high speed (real time) processing of large volumes of data This data typically represents some analogue physical signal.
  • a good example of DSP is that used in digital mobile telephones in which radio signals are received and transmitted that require decoding and encoding (typically using convolution, transform and correlation operations) to and from an analogue sound signal
  • DSP disk driver controllers m which the signals recovered from the disk heads are processed to yield head tracking control.
  • the interface of the microprocessor and the coprocessor and the coprocessor architecture itself are specifically configured to provide DSP functionality.
  • the microprocessor core will be referred to as the ARM and the coprocessor as the Piccolo.
  • the ARM and the Piccolo will typically be fabricated as a single integrated circuit that will often include other elements (e.g. on-chip DRAM, ROM. D to A and A to D convenors etc.) as part of an ASIC Piccolo is an ARM coprocessor, it therefore executes part of the ARM instruction set.
  • the ARM coprocessor instructions allow ARM to transfer data between Piccolo and memory (using Load Coprocessor, LDC and Store Coprocessor, STC.
  • FIG. 1 illustrates the ARM 2 and Piccolo 4 with the ARM 2 issuing control signals to the Piccolo 4 to control the transfer of data words to and from Piccolo 4
  • An instruction cache 6 stores the Piccolo program instruction words that are required by Piccolo 4
  • a single DRAM memory 8 stores all the data and instruction words required by both the ARM 2 and Piccolo 4
  • the ARM 2 is responsible for addressing the memory 8 and controlling all data transfers The arrangement with only a single memory 8 and one set of data and address buses is less complex and expensive than the typical DSP approach that requires multiple memories and buses with high bus bandwidths.
  • Piccolo executes a second instruction stream (the digital signal processing program instruction words) from the instruction cache 6, which controls the Piccolo datapath.
  • These instructions include digital signal processing type operations, " or example Multiply-Accumulate, and control flow instructions, for example zero overhead loop instructions.
  • These instructions operate on data which is held in Piccolo registers 10 (see Figure 2) This data was earlier transferred from memory 8 by ihe ARM 2.
  • the instructions are streamed from the instruction cache 6; the instruction cache 6 drives the data bus as a full bus master
  • a small Piccolo instruction cache 6 will be a 4 line, 16 words per line direct mapped cache (64 instructions) In some implementations, it may be worthwhile to make the instruction cache bigger
  • Piccolo processing I This allows sustained single cycle data processing on 16 bit data.
  • Piccolo has a data input mechanism (illustrated in Figure 2) that allows the ARM to prefetch sequential data, loading the data before it is required by Piccolo.
  • Piccolo can access the loaded data m any order, automatically refilling its register as the old data is used for the last time (all instructions have one bit per source operand to indicate that the source register should be refilled)
  • This input mechanism is termed the reorder buffer and comprises an input buffer 12 Every value loaded into Piccolo (via an LDC or MCR see below) carries with it a tag Rn specifying which register the value is destined for The tag Rn is stored alongside the data word in the input buffer
  • the register is marked as empty by asserting a signal E
  • the register is then automatically refilled by a refill control circuit 16 using the oldest loaded value destined for that register within the input buffer 12
  • the reorder buffer holds 8 tagged values
  • the input buffer 12 has a form similar to a FIFO except that data words
  • the output buffer 18 holds 8 32 bit values Piccolo connects to ARM via the coprocessor interface (CP Control signals of
  • Piccolo On execution of an ARM coprocessor instruction Piccolo can either execute the instruction, cause the ARM to wait until Piccolo is ready before executing the instruction or refuse to execute the instruction In the last case ARM will take an undefined instruction exception
  • the most common coprocessor instructions mat Piccolo will execute are LDC and STC.
  • Piccolo fetches its own instructions from memory to control the Piccolo datapath illustrated in Figure 3 and to transfer data from the reorder buffer to registers and from registers to the output buffer 18
  • the arithmetic logic unit of the Piccolo that executes these instructions has a multiplier/adder circuit 20 that performs multiplies, adds, subtracts, multiple-accumulates, logical operations, shifts and rotates
  • the Piccolo instructions are initially loaded from memory into the instruction cache 6, where Piccolo can access them without needing access back to the main memory
  • Piccolo cannot recover from memory aborts Therefore if Piccolo is used in a Mi ⁇ ual memory system, all Piccolo data must be in physical memory througnout the
  • Piccolo task This is not a significant limitation given the real time nature of Piccolo tasks e g real time DSP If a memory abort occurs Piccolo will stop and set a flag in a status register S2
  • Figure 3 shows the overall datapath functionality of Piccolo
  • the register bank 10 uses 3 read ports and 2 write ports One write port (the L port) is used to refill registers from the reorder buffer
  • the output buffer 18 is updated directly from the ALU result bus 26. output from the output buffer 18 is under ARM program control
  • ARM coprocessor interface performs LDC (Load Coprocessor; instructions into tne reorder buffer, and STC (Store Coprocessor) instructions from the outDut buffer 18. as well as MCR and MRC (Move ARM register to/from CP register) on the register bank 10
  • LDC Load Coprocessor
  • STC Store Coprocessor
  • MCR and MRC Move ARM register to/from CP register
  • the multiplier 20 performs a 16 x 16 signed or unsigned multiph with an oDtional 48 bit accumulate
  • the sealer unit 24 can provide a 0 to 31 immediate a ⁇ tnmetic or logical shift right, followed by an optional saturate
  • the snirter and logical unit 20 can perform either a shift or a logical operation every cycle
  • Piccolo has 16 general purpose registers named DO-D15 or A0-A3. X0-X3, Y0-Y3, Z0-Z3.
  • the first four registers (A0-A3) are intended as accumulators and are 48 bits wide, the extra 16 bits providing a guard against overflow during many successive calculations.
  • the remaining registers are 32 bits wide.
  • Each of Piccolo's registers can be treated as containing two independent 16 bit values. Bits 0 to 15 contain the low half, bits 16 to 31 contain the high half. Instructions can specify a particular 16 bit half of each register as a source operand, or they may specify the entire 32 bit register.
  • Piccolo also provides for saturated arithmetic. Variants of the multiply, add and subtract instructions provide a saturated result if the result is greater than the size of the destination register. Where the destination register is a 48 bit accumulator, the value is saturated to 32 bits (i.e. there is no way to saturate a 48 bit valuej. There is no overflow detection on 48 bit registers. This is a reasonable restriction since it would take at least 65536 multiply accumulate instructions to cause an overflow. Each Piccolo register is either marked as "empty" (E flag, see Figure 2) or contains a value (it is not possible to have half of a register empty). Initially, all registers are marked as empty.
  • Piccolo attempts with the refill control circuit 16 to fill one of the empty registers by a value from the input reorder buffer. Alternatively if the register is written with a value from the ALU it is no longer marked as "empty". If a register is written from the ALU and at the same time there is a value waiting to be placed in the register from the reorder buffer then the result is undefined. Piccolo's execution unit will stall if a read is made to an empty, register.
  • the Input Reorder Buffer sits between the coprocessor interface and
  • Piccolo's register bank Data is loaded into the ROB with ARM coprocessor transfers.
  • the ROB contains a number of 32-bit values, each with a tag indicating the Piccolo register that the value is destined for.
  • the tag also indicates whether the data should be transferred to a whole 32-bit register or just to the bottom 16-bits of a 32-bit register. If the data is destined for a whole register, the bottom 16 bits of the entry will be transferred to the bottom half of the target register and the top 16 bits will be transferred to the top half of the register (sign extended if the target register is a 48-bit accumulator). If the data is destined for just the bottom half of a register (so called " Half Register ' ), the bottom 16 bits will be transferred first.
  • the register tag always refers to a physical destination register, no register remapping is performed (see below regarding register remapping)
  • the oldest entry is selected and its data transferred to the register bank
  • the whole 32 bits are transferred and the entry is marked empty If the bottom half of the target register is empty and the ROB entry contains data destined for the bottom half of a register, the bottom 16 bits of the ROB entry are transferred to the bottom half of the target register and the bottom half of the ROB is marked as empty
  • the two halves of a register may be refilled independently from the ROB.
  • the data in the ROB is either marked as destined for a whole register or as two 16-bit values destined for the bottom half of a register.
  • the first tnree are assembled as LDCs, MPR and MRP as MCRs, LDPA is assembled as a CDP instruction.
  • ⁇ dest> stands for a Piccolo register (A0-Z3).
  • Rn for an ARM register
  • ⁇ s ⁇ ze> for a constant number of bytes which must be a non zero multiple of
  • a 'word' is a 32-bit chunk from memory, which may consist of two 16-bit data items or one 32-bit data item.
  • the LDP instruction transfers a number of data items, marking them as destined for a full register
  • the instruction will load ⁇ s ⁇ ze>/4 words from address Rn m memory, inserting them into the ROB
  • the number of words that can be transferred is limited by the following
  • - ⁇ s ⁇ ze> must be less than or equal to the size of the ROB for a particular implementation (8 words in the first version, and guaranteed to be no less than this in future versions)
  • the first data item transferred will be tagged as destined for ⁇ dest>.
  • the second as destined for ⁇ dest>- and so on (with wrapping from Z3 to AO) If the ' is specified then the register Rn is incremented by ⁇ s ⁇ ze> afterwards If the LDP 16 variant is used, endian specific action is performed on the two
  • the LDPW instruction transfers a number of data items to a set of registers
  • the first data item transferred is tagged as destined for ⁇ dest>, the next for ⁇ dest>+l, etc
  • the next item transferred is tagged as destined for ⁇ dest>, and so on
  • the ⁇ wrap> quantity is specified in halfword quantities
  • - ⁇ size> must be less than or equal to the size of the ROB for a particular implementation (8 words in the first version, and guaranteed to be no less than this in future versions);
  • - ⁇ dest> may be one of (AO, XO, Y0, Z0 ⁇ ;
  • - ⁇ wrap> may be one of ⁇ 2,4,8 ⁇ halfwords for LDP32W and one of ⁇ 1 ,2,4,8 ⁇ halfwords for LDP16W;
  • All unused encodings of the LDP instruction may be reserved for future expansion
  • the LDP16U instruction is provided to support the efficient transfer of non- word aligned 16-bit data.
  • LDP16U support is provided for registers D-J to D15 (the X, Y and Z banks)
  • the LDP16U instruction will transfer one 32-bit word of data (containing two 16-bit data items) from memory into Piccolo Piccolo will discard the bottom 16 bits of this data and store the top 16 bits in a holding register There is a holding register for the X, Y and Z banks
  • the behaviour of LDP ⁇ W ⁇ instructions is modified if the data is destined for a register in that bank
  • the data loaded into the ROB is formed by the concatenation of the holding register and the bottom 16 bits of data being transferred by the LDP instruction
  • the upper 16 bits of data being transferred is put into the holding register.
  • This mode of operation is persistent until it is turned off a LDPA instruction
  • the holding register does not record the destination register tag or size.
  • the LDPA instruction is used to switch off the unaligned mode of operation initiated by a LDP16U instruction.
  • the unaligned mode may be turned off independently on banks X. Y, Z.
  • the MPR instruction places the contents of ARM register Rn into theHOB, destined for Piccolo register ⁇ dest>.
  • the destination register ⁇ dest> may be any full register in the range A0-Z3.
  • the MPRW instruction places the contents of ARM register Rn into the ROB, marking it as two 16-bit data items destined for the 16-bit Piccolo register ⁇ dest ,l.
  • the restrictions on ⁇ dest> are the same as those for the LDPW instructions (i.e. AO.XO ⁇ O.ZO).
  • the instruction i.e. AO.XO ⁇ O.ZO.
  • R3 will transfer the contents of R3 into the ROB, marking the data as 2 16-bit quantities destined for X0.1. It should be noted that as for the LDP16W case with a wrap of 1 , only the bottom half of a 32-bit register can be targeted.
  • LDP is encoded as:
  • PICCOLOl Piccolo's firs: coprocessor number (currently 8).
  • the N bit selects between LDP32 (1) and LDP16 (0).
  • LDPW is encoded as:
  • DEST is 0-3 for destination register A0,XO,Y0,Z0 and WRAP is 0-3 for wrap values 1.2.4,8.
  • PICCOL02 is Piccolo's second coprocessor number (currently 9). The N bit selects between LDP32 (1) and LDP16 (0).
  • LDP16U is encoded as:
  • DEST is 1 -3 for the destination bank X.
  • Y, Z. LDPA is encoded as:
  • BANK[3 :0] is used to turn off the unaligned mode on a per bank basis. If BANK[l ] is set, unaligned mode on bank X is turned off. BAN [2] and BANI ] turn off unaligned mode on banks Y and Z if set, respectively. N.B. This is a CDP operation.
  • MPR is encoded as:
  • MPRW is encoded as:
  • DEST is 1 -3 for the destination register X0,Y0,Z0.
  • the output FIFO can hold up to eight 32-bit values. These are transferred frcm Piccolo by using one of the following (ARM) opcodes:
  • the MRP instruction removes one word from the output FIFO and places it in ARM register Rn. As with MPR no endian specific operations are applied to the data
  • the ARM encoding for STP is:
  • N selects between STP 2 (1) and STP 16 (0).
  • W b s refer to an ARM data sheet.
  • the ARM encoding for MRP is:
  • Piccolo assumes little endian operation internally For example when accessing a 32-bit register as 16 bits halves, the lower half is assumed to occupy bits 15 to 0. Piccolo may be operating in a system with big endian memory or peripherals and must therefore take care to load 16-bit packed data in the correct manner
  • Piccolo i.e. the DSP adapted coprocessor
  • the ARM e.g. the ARM7 microprocessors produced by Advanced RISC Machines Limited of Cambndge, United Kingdom
  • Piccolo uses this pm to configure the input reorder buffer and output FIFO
  • the ARM loads packed 16-bit data into the reorder buffer it must indicate this by using the 16-bit form of the LDP instruction.
  • This information is combined with the state of the 'BIGEND' configuration input to place data into the holding latches and reorder buffer in the appropriate order.
  • the holding register stores the bottom 16 bits of the loaded word, and is paired up with the top 16 bits of the next load.
  • the holding register contents always end up in the bottom 16 bits of the word transferred into the reorder buffer.
  • the output FIFO may contain either packed 16-bit or 32-bit data.
  • the programmer must use the correct form of the STP instruction so that Piccolo can ensure that the 16-bit data is provided on the correct halves of the data bus. When configured as big endian the top and bottom 16-bit halves are swapped when the 15- bit forms of STP are used.
  • Piccoio has 4 private registers which can only be accessed from the ARM. They are called S0-S2. They can only be accessed with MRC and MCR instructions.
  • the oDco ⁇ es are:
  • opcodes transfer a 32-bit value between ARM register Rm and private register Sn. They are encoded in ARM as a coprocessor register transfer:
  • Register SO contains the Piccolo unique ID and revision code
  • B ⁇ ts[15 4] contain a 3 digit pan number in binary co ⁇ ed decimal forma; 0x500 for Piccolo
  • Register SI is the Piccolo status register
  • Piccolo encountered a BREAKPOINT and has halted H bit Piccolo encountered a HALT instruction and has halted.
  • Register S2 is the Piccolo program counter:
  • the coprocessor interface is busy-waiting, because of insufficient space in the ROB or insufficient items in the output FIFO.
  • Piccolo sets the D-bit in its status register, halts and rejects the ARM coprocessor instruction, causing .ARM to take the undefined instruction trap.
  • CDP instructions There are several operations available that may be used to control Piccolo from the ARM, these are provided by CDP instructions. These CDP instructions will only be accepted when the ARM is in a privileged state. If this is not the case Piccolo will reject the CDP instruction resulting in the ARM taking the undefined instruction trap. The following operations are available:
  • Piccoio may be reset in sof ware bv usir.2 the PRESET instruction.
  • Executing the PRESET instruction may take several cycles to complete_(2-3 for this embodiment). Wnilst it is executing, following ARM coprocessor instructions to be executed on Piccolo will be busv waited.
  • Piccolo's state may be saved and restored using STC and LDC instructions (see the below regarding accessing Piccolo state from .ARM).
  • PST.ATE instruction To enter state access mode, the PST.ATE instruction must first be executed:
  • This instruction is encoded as:
  • the PSTATE instruction When executed, the PSTATE instruction will: -Halt Piccolo (if it is not already halted), setting the E bit in Piccolo's Status Register
  • Executing the PSTATE instruction may take several cycles to complete, as Piccolo's instruction pipeline must drain before it can halt. Whilst it is executing, following ARM coprocessor instructions to be executed on Piccolo will be busy waited.
  • the PEN.ABLE and PDISABLE instructions are used for fast context switching.
  • Piccolo When Piccolo is disabled, only private registers 0 and 1 (the ID and Status registers) are accessible, and only then from a privileged mode. Access to any other state, or any access from user mo ⁇ e will cause an ARM undefined instruction exception Disabling Piccolo causes it to halt execution. When Piccoio has halted execution, it will acknowledge the fact by setting the E bit m the status register
  • Piccolo is enabled by executing the PENABLE instruction:
  • Piccolo is disabled by executing the PDISABLE instruction:
  • This instruction is encoded as 31 30292S 2726252423 2221 20 19 13 17 16 15 14 13 12 11 10 9 8 7 6 5 I 0
  • the Piccolo instruction cache holds the Piccolo instructions which control the
  • Piccolo datapath If present it is guaranteed to hold at least 64 instructions, sta ⁇ ing on a 16 word boundary.
  • the following ARM opcode assembles into an MCR. Its action is to force the cache to fetch a line of (16) instructions sta ⁇ ing at the specified address (which must be on a 16- word boundary). This fetch occurs even if the cache alrea ⁇ v holds data related to this address.
  • the MCR encoding of this opcode is: 31 30292327262524232221 20 19 IS 17 16 15 14 13 12 11 10 9 S 7 6 5 4 3
  • This section discusses the Piccolo instruction set which controls the Piccolo data path. Each instruction is 32 bits long. The instructions are read from the Piccolo instruction cache.
  • the Source 1 (SRCI) operand has the following 7 bit format:
  • Refill - specifies that the register should be marked as empty after being read and can be refilled from the ROB.
  • the register size is specified in the assembler by adding a suffix to the register number: .1 for the low 16 bits, .h for the high 16 bits or .x for 32 bits with the upper and lower sixteen bits interchanged.
  • the general Source 2 (SRC2) has one of the following three 12 bit formats:
  • Figure 4 illustrates a multiplexer arrangement responsive to the Hi Lo bit and Size bit to switch appropriate halves of the selected register to the Piccolo datapath. If the Size bit indicates 16 bits, then a sign extending circuit pads the high order bits of the datapath with 0s or Is as appropriate.
  • the first encoding specifies the source as being a register, the fields having the same encoding as the SRCI specifier.
  • the SCALE field specifies a scale to be applied to the result of the ALU.
  • the 8-bit immediate with rotate encoding allows the generation of a 32-bit immediate which is expressible by an 8-bit value and 2-bit rotate.
  • the following table _>-> shows the immediate values that can be generated from the 8 -bit value XY:
  • the 6-bit Immediate encoding allows the use of a 6-bit unsigned immediate (range 0 to 63), together with a scale applied to the output of the ALU.
  • Source 2 encoding is common to most instruction variants. There are some exceptions to this rule which support a limited subset of the Source 2 encoding or modify it slightly:
  • Select instructions only support an operand which is a register or a 6-bit unsigned immediate.
  • the scale is not available as these bits are used by the condition field of the instruction.
  • Shift instructions only support an operand which is a 16-bit register or a 5-bit unsigned immediate between 1 and 31. No scale of the result is available.
  • the 6-bit immediate is used then it is always duplicated onto both halves of the 32-bit quantity. If the 8-bit immediate is used it is duplicated only if the rotate indicates that the 8-bit immediate should be rotated onto the top half of the 32-bit quantity:
  • the multiply accumulate instructions do not allow an 8-bit rotated immediate to be specified. Bit 10 of the field is used to partly specify which accumulator to use. Source 2 is implied as a 16-bit operand.
  • Multiply double instructions do not allow the use of a constant. Only a 16-bit register can be specified. Bit 10 of the field is used to partly specify which accumulator to use.
  • Some instructions aiways imply a 32-bit operation (e g ADDADD). and in these cases the size bit shall be set to 1, with the Hi Lo bit used to optionally swap the two 16-bit halves of the 32-bit operand
  • Some instructions alwavs imply a 16- bit operation (e g MUL) and the size bit should be set to 0
  • the Hi/Lo bit selects which half of the register is used (it is assumed that the missing size bit is clear)
  • Multiply-accumlulate instructions allow independent specification of the source accumulator and destination registers For these instructions the Size bits aie used to indicate the source accumulator, and the size bits are implied by the instruction type as 0
  • the register is marked as empty after use ana will be refilled from the ROB by the usual refill mechanism (see the section on tne ROB ⁇ Piccolo will not stall unless the register is used again as a source operan ⁇ before the refill has taken place
  • the minimum number of cycles before the refilled data is valid (best case - the data is waiting at the head of the ROB) will be eitn ⁇ : 1 or 2 Hence it is advisable not to use the refilled data on the instruction following the refill request If use of the operand on the next two instructions can be avoided it should be, since this will prevent performance loss on deeper pipeline implementations
  • the refill bit is specified in the assembler by suffixing the register numbei wun a ' ⁇ '
  • the section of the register marked as empty derjends on the register operand
  • the two halves of each register may be marked for refill independently (for example XO 1 ⁇ will mark only the bottom half of XO for refill. X0 ⁇ will mark the whole of XO for refill)
  • XO 1 ⁇ will mark only the bottom half of XO for refill.
  • X0 ⁇ will mark the whole of XO for refill
  • the 4-bit scale field encodes fourteen scale types:
  • REPEAT instruction register re-mapping is supported, allowing a REPEAT to access a moving " window' of registers without unrolling the loop. This is described in more detail in below.
  • Destination operands have the following 7 bit format: ⁇ 11 20 19
  • the register number (Dx) indicates which of the 16 registers is being addressed.
  • the Hi/Lo bit and the Size bit work together to address each 32-bit register as a pair of 16-bit registers.
  • the Size bit defines how the appropriate flags, as defined in the instruction type, will be set. irrespective of whether a result is written to the register bank and or output FIFO. This allows the construction of compares and similar instructions.
  • the add with accumulate class of instruction must write back the result to a register.
  • the write is of 16-bits the 48 bit quantity is reduced to a 16-bit quantity by selecting the bottom 16 bits [15:0]. If the instruction saturates then the value will be saturated into the range -2 ⁇ 15 to 2 ⁇ 15-1. The 16-bit value is then written back to the indicated register and. if the Write FIFO bit is set, to the output FIFO. If it is written to the output FIFO then it is held until the next 16-bit value is written when the values are paired up and placed into the output FIFO as a single 32-bit value.
  • the destination size is specified in the assembler by a .1 or .h after me register number. If no register writeback is performed then the register number is unimportant, so omit the destination register to indicate no write to a register or use ⁇ to indicate a write only to the output FIFO.
  • SUB , X0, Y0 is equivalent to CMP
  • REPEAT instruction register re-mapping is supported, allowing a REPEAT to access a moving 'window * of registers without unrolling the loop. This is described in more detail below.
  • the REPEAT instruction provides a mechanism to modify the way in which register operands are specified within a loop.
  • the registers to be accessed are determined by a function of the register operand in the instruction and an offset into the register bank.
  • the offset is changed in a programmable manner, preferably at the end ⁇ Teach instruction loop.
  • the mechanism may operate independently on registers residing in the X. Y and Z banks. In preferred embodiments, this facility is not available for registers in the A bank.
  • the notion of a logical and physical register can be used.
  • the instruction operands are logical register references, and these are then mapped to physical register references identifying specific Piccolo registers 10. All operations, including refilling, operate on the physical register.
  • the register remapping only occurs on the Piccolo instruction stream side - data loaded into Piccolo is always destined for a physical register, and no remapping is performed.
  • FIG. 5 is a block diagram illustrating a number of the internal components of the Piccolo coprocessor 4. Data items retrieved by the ARM core 2 from memory are placed in the reorder buffer 12, and the Piccolo registers 10 are refilled from the reorder buffer 12 in the manner described earlier with reference to Figure 2. Piccolo instructions stored in the cache 6 are passed to an instruction decoder 50 within Piccolo 4, where they are decoded prior to being passed to the Piccolo processor core 54.
  • the Piccolo processor core 54 includes the multiplier/adder circuit 20. the accumulate/decumulate circuit 22, and the scale/saturate circuit 24 discussed earlier with reference to Figure 3
  • the register remapping logic 52 is employed to perform the necessary remapping
  • the register remapping logic 52 can be considered as being part of the instruction decoder 50, although it will be apparent to those skilled in the art that the register remapping logic 52 may be provided as a completely separate entity to the instruction decoder 50
  • An instruction will typically include one or more operands identify mg registers containing the data items required by the instruction
  • a typical instruction may include two source operands and one destination operand, idenufving ⁇ wo registers containing data items required by the instruction, and a register in to which the result of the instruction should be placed
  • the register remapping logic 52 receives the operands of an instruction from the instruction decoder 50 these operands identifying logical register references Based on the logical register references, the register remapping logic will determine whether remapping should or should noi be applied, and will then apply a remapping to phvsical register references as required If it is determined that remapping should not be applied, the logical register references are provided as the physical register references The preferred manner in which the remapping is performed will be discussed in more detail later
  • Each output physical register reference from the register remapping logic is passed to the Piccolo processor core 54, such that the processor core can then apply the instruction to the data item in the particular register 10 identified bv the physical register reference
  • the remapping mechanism of the preferred embodiment allows each bank of registers to be split into two sections, namely a section within which registers mav be remapped, and a section in which registers retain their original register references without remapping
  • the remapped section starts at the bottom of the register bank being remapped
  • Figure 6 is a block diagram illustrating how the various parameters are used by the register remapping logic 52. It should be noted that these parameters are given values that are relative to a point within the bank being remapped, this point being, for example, the bottom of the bank
  • the register remapping logic 52 can be considered as comprising twojnain logical blocks, namely the Remap block 56 and the Base Update block 58
  • the register remapping logic 52 employs a base pointer that provides an offset value to be added to the logical register reference, this base pointer value being provided to the remap block 56 by base update block 58
  • a BASESTART signal can be used to define the initial value of the base pointer, this for example typically being zero, although some other value may be specified
  • This BASESTART signal is passed to multiplexor 60 within the Base Update block 58 During the first iteration of the instruction loop, the BASESTART signal is passed by the multiplexor 60 to the storage element 66, whereas for subsequent iterations of the loop, the next base pointer value is supplied by the multiplexor 60 to the storage element 66
  • the output of the storage element 66 is passed as the current base pointer value to the ReMap logic 56, and is also passed to one of the inputs of an adder 62 within the Base Update logic 58
  • the adder 62 also receives a BASEINC signal that provides a base increment value
  • the adder 62 is arranged to increment the current base pointer value supplied by storage element 66 by the BASEINC value, and to pass the result to the modulo circuit 64
  • the modulo circuit also receives a BASEWRAP value, and compares this value to the output base pointer signal from the adder 62. If the incremented base pointer value equals or exceeds the BASEWRAP value, the new base pointer is wrapped round to a new offset value.
  • the output of the modulo circuit 64 is then the next base pointer value to be stored in storage element 66. This output is provided to the multiplexor 60. and from there to the storage element 66.
  • this next base pointer value cannot be stored in the storage element 66 until a BASEUPDATE signal is received by the storage element 66 from the loop hardware managing the REPEAT instruction.
  • the BASEUPDATE signal will be produced periodically by the loop hardware, for example each time the instruction loop is to be repeated.
  • the storage element will overwrite the previous base pointer value with the next base pointer value provided by the multiplexor 60. In this manner, the base pointer value supplied to the ReMap logic 58 will change to the new base pointer value.
  • the physical register to be accessed inside a remapped section of a register bank is determined by the addition of a logical register reference contained within an operand of an instruction, and the base pointer value provided by the base update logic 58. This addition is performed by adder 68 and the output is passed to modulo circuit 70.
  • the modulo circuit 70 also receives a register wrap vaiue. and if the output signal from the adder 68 (the addition of the logical register reference and the base pointer value) exceeds the register wrap value, the result will w ⁇ ap through to the bottom of the remapped region. The output of the modulo circuit 70 is then provided to multiplexor 72.
  • a REGCOUNT value is provided to logic 74 within Remap block 56, identifying the number of registers within a bank which are to be remapped.
  • the logic 74 compares this REGCOUNT value with the logical register reference, and passes a control signal to multiplexor 72 dependent on the result of that comparison.
  • the multiplexor 72 receives as its two inputs the logical register reference and the output from modulo circuit 70 (the remapped register reference). In preferred embodiments of the present invention, if the logical register reference is less than the REGCOUNT value, then the iogic 74 instructs the multiplexor 72 to output the remapped register reference as the Physical Register Reference. If. however, the logical register reference is greater than or equal to the REGCOUNT value, then the logic 74 instructs the multiplexor 72 to output the logical register reference directly as the physical register reference.
  • REPEAT instructions which invokes the remapping mechanism
  • REPEAT instructions provide four zero cycle loops in hardware These hardware loops are illustrated in Figure 5 as part of the instruction decoder 50. Each time the instruction decoder 50 requests an instruction from cache 6, the cache returns that instruction to the instruction decoder, whereupon the instruction decoder determines whether the returned instruction is a REPEAT instruction. If so, one of the hardware loops is configured to handle that REPEAT instruction
  • Each repeat instruction specifies the number of instructions in the loop and the numoer of times to go around the loop (which is either a constant or read from a Piccolo register)
  • Two opcodes REPEAT and NEXT are provided for defining a hardware loop, the NEXT opcode being used merely as a delimiter and not being assembled as an instruction
  • the REPEAT goes at the start of the loop, and NEXT delimits the end of the loop, allowing the assembler to calculate the number of instructions in the loop body
  • the REPEAT instruction can include remapping parameters such as the REGCOUNT, BASEINC, BASEWRAP and REGWRAP parameters to be employed by the register remapping logic 52.
  • a number of registers can be provided to store remapping parameters used by the register remapping logic within these registers, a number of sets of predefined remapping parameters can be provided, whilst some registers are left for the storage of user defined remapping parameters If the remapping parameters specified with the REPEAT instruction are equal to one of the sets of predefined remapping parameters, then the appropriate REPEAT encoding is used, this encoding causing a multiplexor or the like to provide the appropriate remapping parameters from the registers directly to the register remapping iogic.
  • the assembler will generate a Remapping Parameter Move Instruction (RMOV) which allows the configuration of the user defined register remapping parameters, the RMOV instruction being followed by the REPEAT instruction.
  • RMOV Remapping Parameter Move Instruction
  • the user defined remapping parameters would be placed by the RMOV instruction in the registers left aside for storing such user defined remapping parameters, and the multiplexor would then be programmed to pass the contents of those registers to the register remapping logic.
  • the REGCOUNT. BASEINC, BASEWRAP and REGWRAP parameters take one of the values identified in the following chart:
  • the following update to the base pointer is performed by the base update logic 58
  • accumulator register AO is arranged to accumulate r .he results of a number of multiplication operations, the multiplication operations being the multiplication of coefficient cO by data item dO. the multiplication of coefficient cl by data item dl, the multiplication of coefficient c2 by data item d2. etc Register
  • Al accumulates the results of a similar set of multiDlication operations, but this time the set of coefficients have been shifted such that cO is now multiplied by dl, cl is now multiplied by d2. c2 is now multiDlied bv d3, etc Likewise, register A2 accumulates the results of multiplying the data values by the coefficient shiited another step to the right, such that cO is multiplied by d2, cl is multiplied by d3, c2 is multiplied by d4, etc. This shift, multiply, and accumulate process is then repeated with the result being placed in register A3.
  • the data values are placed in the X bank of registers and the coefficient values are placed in the Y bank of registers.
  • the four accumulator registers AO, Al, A2, and A3 are set to zero.
  • an instruction loop is then entered, which is delimited by the REPEAT and NEXT instructions.
  • the value Zl identifies the number of times that the instruction loop should be repeated, and for the reasons that will be discussed later, this will actually be equal to the number of coefficients (cO, cl, c2. etc.) divided by
  • the instruction loop comprises 16 multiply accumulate instructions (MULA), which, after the first iteration through the loop, will result in the registers AO.
  • the first instruction multiplies the data value within the first, or lower. 16 bits of the X bank register zero with the lower 16 bits within Y bank register zero, and adds the result to the accumulator register AO. At the same time the lower 16 bits of the X bank register zero are marked by a refill bit, this indicating that that part of the register can now be refilled with a new data vaiue.
  • the second MULA instruction then multiplies the second, or higher 16 bits of the X bank register zero with the lower 16 bits of the Y bank register zero (this representing the multiplication dl x cO shown in Figure 7).
  • the third and fourth MULA instructions represent the multiplications d2 x cO, and d3 x cO, respectively.
  • coefficient CO is no longer required and so the register Y0.1 is marked by a refill bit to enable it to be overwritten with another coefficient (c4).
  • the next four MULA instructions represent the calculations dlxc l . d2xcl, d3xcl . and d4xcl. respectively.
  • the register XO.h is marked by a refill bit since dl is no longer required.
  • the register YO.h is marked for refilling, since the coefficient cl is no longer needed.
  • the next four MULA instructions correspond to the calculations d2xc2, d3xc2, d4xc2, and d5xc2. whilst the final four calculations correspond to the calculations d3xc3, d4xc3. d5xc3, and d6xc3.
  • the instruction loop can be dramatically reduced, such that it now only includes 4 multiply accumulate instructions, rather than the 16 multiply accumulate instructions that were otherwise required
  • the code can now be written as follows
  • registers in these banks are remappec ;
  • the base pointer for both banks is incremented by one on each
  • the base pointer wraps when it reaches the fourth register m the
  • the first step is to set the four accumulator registers A0-A3 to zero. Then, the instruction loop is entered, delimited by the REPEAT and NEXT opcodes.
  • the REPEAT instruction has a number of parameters associated therewith, which are as follows:
  • X— indicates that BASEINC is '1' for the X Bank of registers
  • n4 indicates that REGCOUNT is '4' and hence the first four X Bank registers
  • w4 indicates that BASEWRAP is '4' for the X Bank of registers
  • r4 indicates that REGWRAP is '4' for the X Bank of registers
  • Y-H- indicates that BASEINC is ' 1 ' for the Y Bank of registers
  • n4 indicates that REGCOUNT is '4' and hence the first four Y Bank registers
  • the base pointer value is zero, and so there is no remapping. However, next time the loop is executed, the base pointer value will be " 1 ' for both the X and Y banks, and so the operands will be mapped as follows:
  • the four MULA instructions actually perform the calculations indicated by the fifth to eight MULA instructions in the example discussed earlier that does not include the remapping of the present invention.
  • the third and fourth iterations through the loop perform the calculations formerly performed by the ninth to twelfth, and thirteenth to sixteenth MULA instructions of the prior art code.
  • the above code performs exactly the same block filter algorithm as the prior art code, but improves code density within the loop body by a factor of four, since only four instructions need to be provided rather than the sixteen required by the prior art.
  • the register remapping mechanism can make these registers available for example consider the example discussed earlier where the X bank of registers has four 32 bit registers available to the programmer, and hence eight 16 bit registers can be specified by logical register references It is possible for the X bank of registers to actualh consist of, for example, six 32 bit registers, in which case there will be four additional 16 bit registers not directly accessible to the programmer However, these extra four registers can be made available by the remapping mechanism therebv providing additional registers for the storage of data items
  • destination register is 48 bits the saturation is still at 32 bits.
  • Source operand 1 can be one of the following formats:
  • ⁇ srcl> will be used a shorthand for [Rn!Rn.l
  • all 7 bits of the source specifier are valid and the register is read_as a 32-bit value (optionally swapped) or a 16-bit value sign extended. For an accumulator only the bottom 32 bits are read.
  • the ⁇ specifies register refill.
  • ⁇ srcl_32> is short for [RnjRn.x][ ⁇ ]. Only a 32-bit value can be read, with the upper and lower halves optionally swapped.
  • Source operand 2 can be one of the following formats:
  • ⁇ src2> will be a shorthand for three options: a source register of the form [Rn!Rn.l
  • ⁇ src2_maxmin> is the same as ⁇ src2> but a scale is not permitted.
  • ⁇ src2_shift> shift instructions provide a limited subset of ⁇ src2>. See above. for details. ⁇ src2_par> as for ⁇ src2_shift>
  • ⁇ acc> is short for any of the four accumulator registers [A0'A1!A2
  • the destination register has the format:
  • ⁇ scale> represents a number of arithmetic scales There are fourteen available scales:
  • ASR #0, 1. 2, 3. 4, 6, 8, 10 ASR #12 to 16 LSL #1 ⁇ mmed 8> stands for a unsigned 8-bit immediate value. This consists of a bvte rotated left bv a shift of 0, 8, 16 or 24. Hence values
  • OxOOYZOOOO, OxOOOOYZOO and OxOOOOYZ can be encoded for any YZ.
  • the rotate is encoded as a 2 bit quantity.
  • ⁇ mm_6> Stands for an unsigned 6-bit immediate.
  • ⁇ PARAMS> is used to specify register re-mapping and has the following format: ⁇ BANK> ⁇ BASEINC> n ⁇ RENUMBER> w ⁇ BASE ⁇ VRAP> ⁇ BANK> can be [XjYjZ] ⁇ BASEINC> can be [-H-J+ 1
  • V and N flags are set differently on Piccolo than on the ARM so the translation from condition testing to flag checking is not the same as the ARM either.
  • Last result was positive Signed overflow/saturation on last result No overflow/saturation on last result Overflow positive on last result. Overflow negative on last result
  • the primary and secondary condition codes each consist of:
  • Arithmetic instructions can be divided into two types; parallel and 'full width " .
  • the 'full width' instructions only set the primary flags, whereas the parallel operators set the primary and secondary flags based on the upper and lower 16-bit halves of the result.
  • the N. Z and V flags are calculated based on the full ALU result, after the scale has been applied but prior to being written to the destination. .An ASR will always reduce the number of bits required to store the result, but an ASL would increase it. To avoid this Piccolo truncates the 48-bit result when an ASL scale is applied, to limit the number of bits over which zero detect and overflow must carried out.
  • the N flag is calculated presuming signed arithmetic is being carried out. This is because when overflow occurs, the most significant bit of the result is either the C flag or the N flag, depending on whether the input operands are signed or unsigned.
  • the V flag indicates if any loss of precision occurs as a result of writing the result to the selected destination. If no write-back is selected a 'size' is still implied, and the overflow flag is set correctly. Overflow can occur when:
  • Parallel add subtract instructions set the N. Z and V flags independently on the upper and lower halves of the result. When writing to an accumulator the V flag is set as if writing to a 32-bit register. This is to allow saturating instructions to use accumulators as 32-bit registers.
  • the saturating absolute instruction also sets the overflow flag if the absolute value of the input operand would not fit in designated destination.
  • the Carry flag is set by add and subtract instructions and is used as a 'binary' flag by the MAX/MIN, SABS and CLB instructions. All other instructions, including multiply operations preserve the Carry flag(s).
  • the Carry is that which is generated by either bit 31 or bit 15 or the result, based on whether the destination is 32 or 16-bits wide.
  • the standard arithmetic instructions can be divided up into a number types, depending on how the flags are set:
  • N is set if the full 48 bit result had bit 47 set (was negative).
  • V is set if either:
  • the destination register is a 32/48 bit register and the signed result will not fit into 32 bits.
  • ⁇ dest> is a 32 or 48 bit register then the C flag is set if there is a carry out of bit 31 when summing ⁇ srcl> and ⁇ src2> or if no borrow occurred from bit 31 when subtracting ⁇ src2> from ⁇ srcl> (the same carry value you would expect on the
  • ⁇ dest> is a 16-bit register then the C flag is set if there is a carry out of bit 15 of the sum.
  • the secondary flags (SZ, SN, SV, SC) are preserved.
  • Tne Ac ⁇ and Subtract instructions add or suDtract two registers, scaie me rest..: and then store back to a register
  • the operands are treatea as signec v lues
  • Flag -ideating far tne non-saturating variants is optional and mav oe surjpresse ⁇ c aocer.cins an X :o tne end of the instruction
  • OPC specifies "le voe of instruction
  • 01 111 dest SAT((src2 - srcl) (-» scale))
  • the assembler supports the following opcodes
  • CMP is a subtract which sets the flags with the register write disabled.
  • CMN is an add which sets the flags with register write disabled.
  • ADC is useful for inserting carry into the bottom of a register following a shift'MAX/MT operation It is also used to do a 32/32 bit divide It also provides for extende ⁇ precision adds.
  • N bit gives finer control of the flags, in particular the carry This enables a 32 '32 bit division a: 2 cycles oer bit
  • Incrementing' ⁇ ecrement g counters RSB is useful for calculating sr fts is a common operation)
  • a saturated RSB is nee ⁇ e ⁇ for sarurate ⁇ negation t use ⁇ :r. G 729)
  • Add/subtract accumulate instructions perform acdition inc suctractio ⁇ w m accumulation an scaling/saturation Unlike tne multiply accumulate instructions tr.e accumulator number cannot be specified independently of the destination register Tne oottom two oits of the destination register give ;r.e numoer. ac , o: e -3 :.”. accumulator to accumulate into.
  • OPC specifies the type of instruction. In the following ace is (DEST[l :0j). The Sa bit indicates saturation.
  • the ADDA (add accumulate) instruction is useful for summing two words of an array of integers with an accumulator (for instance to find their average) per cycle.
  • SUBA subtract accumulate
  • Addition with rounding can be done by using ⁇ dest> different from ⁇ acc>.
  • Addition with a rounding constant can be done by ADDA X0,X1,#16384,A0.
  • Piccolo code For a bit exact implementation of: sum of ((a_i * b_i)»k) (quite common - used in TrueSpeech) the standard Piccolo code would be: MUL tl , a_0, b_0, ASR k ADD ans, ans, tl MUL ⁇ 2, a_l, b_l , ASR#k ADD ans, ans, t2
  • ASR k MUL tl , a_0, b_0, ASR k MUL , a_l , b_l .
  • Add/Subtract m Parallel instructions perform addition and subtraction on two signed 16-bit quantities held in pairs in 32-bit registers
  • the primary condition code flags are set from the result of the most significant 16 bus, the secondary' flags are updated from tne least significant half. Only 32-bit registers can oe specified as tne source for these instructions, although the values can be halfworc swapped. The individual halves of each register are treated as signed values. The calculations and sca ng are done with no loss of precision.
  • Optional sararatio ⁇ is provided for each instruction for which the Sa bit must be set.
  • OPC defines the operation.
  • Each sum/difference is independently saturated if the Sa bit is set.
  • the assembler also supports
  • C is set if there is a carry out of bit 15 wnen adding the two upoer sixteen bit halves
  • Z is set if the sum of the upper sixteen bit halves is 0
  • N is set if the sum of tne upper sixteen bit halves is negative
  • V is set if the signed 17 bit sum of the upper sixteen bit halves will not fit into 16 bits (post scale) SZ, SN, SV and SC are set similarlv for the lower 16-bu halves
  • the parallel .Ada and Suotract instructions are useful for Derformmg operations on complex numoers neid in a single 32 -bit register They are used m tne FFT kernel It is aiso useful for simple addition subtraction of vectors of 16-bu data allowing two elements to oe processed per c c ⁇ e
  • the offset is a signed 16-b ⁇ t numcer of wor ⁇ s At the moment tne range
  • target address branch instruction address - ⁇ - OFFSE 1
  • Conditional Add or Subtract instructions conditionally add or subtract s:c2 to srcl .
  • OFC specities the type of instruction. Action (OPC):
  • the Conditional Add or Subtract instruction enables efficient divide code to be constructed.
  • XO.l holds the quotient of the divide. The remainder can be recovered from XO.h depending on the value of carry.
  • Example 2 Divide the 32-bit positive value in XO by the 32-bit positive value in XI, with early termination.
  • X2 holds the quotient and the remainder can be recovered from X0.
  • the Count Leading Bits instruction allows data to be normalised. 31 3029232726252-1 23 2221 20 19 IS 17 16 15 U 13 12 I 1 10 9 8 7 6
  • dest is set to the number of places the value in src l must be shifted left in order for bit 31 to differ from bit 30.
  • Tnis is a value in the range 0-30 except in the special cases where src ! is. either - i or 0 where 1 is returned.
  • is set if the result is zero. is cleared.
  • C is set if src l is either - 1 or 0.
  • Halt and Breakpoint instructions are provided for stopping Piccolo exec. ;i 3029232" 252524 23 222! 20 19 IS 17 16 15 14 13 12 10 9 S 1 0
  • Logical Operation instructions perform a logical operation on a 32 or 16-bu register
  • the ocerands are treated as unsisned values 31 302923 27 2625 24 23 21 20 19 18 17 16 15 14 13 12 11 10 9 8 6 5 1 0
  • OPC encodes the logical operation to perform.
  • TST s an AND with the register write disabled TEQ is an EOR with the registe write disabled.
  • Speecr compression algorithms use packed bitfields for encoding information. Bit ask-ng instructions help for extracting packing these fields.
  • I D Max and Min Operation instructions perform maximum and minimum operations.
  • N is set if the result is negative
  • MAX XO, XO, #0 will convert XO to a positive number with clipping below.
  • Max and Min Operations in Parallel instructions perform maximum, and minimum operations on parallel 16-bit data.
  • OPC specifies the type of instruction.
  • N is set if the upper 16 bits of the result is negative
  • Move Long Immediate Operation instructions allow a register to oe set * o r signed 16-bu. sign extended vaiue Two of these instructions can set a 32-b,: register to any value (by accessing the high and low half in sequence) For moves between registers see me select operations.
  • Multiply Accumulate Operation instructions perform signed multiplication w. accu uiation or ae-accumuiation, scaling and saturation
  • a one cycle sustained MULA is required for FIR.
  • code MULS is used in me FFT butter: " .
  • a MUL is aiso useful for multiply with rounding
  • For exarr.r.e A0 (X0 " X 1 — 163 S-)» 15 can be done in once cycle by holding 16384 m another accumulator ( l for example).
  • Different ⁇ dest and ⁇ acc> is also required for tne FFT kernel
  • Multiply Double Operation instructions perform signed multiplication, doubling the result prior to accumulation or de-accumulation, scahng and saturation.
  • OPC specifies me tvoe of instruction
  • the MLD instruction is required for G.729 and other algorithms which use fractional arithmetic. Most DSPs provide a fractional mode which enables a left shift of one bit at the output of the multiplier, prior to accumulation or writeback. Suppo ⁇ ing this as a specific instruction provides more programming flexibility.
  • the name equivalents for some of the G series basic operations are:
  • MULA can be used, with the sum maintained in 33.14 format.
  • a left shift and saturate can be used at the end to convert to 1.15 format, if required.
  • Multiply Operation instructions perform signed multiplication, and optional scaling/saturation.
  • the source registers (16-bit only) are treated as signed numbers. 302928 252524 23 2221 20 19 16 15 14 13 12 11 10 9 S
  • OPC specifies the type of instruction.
  • Register List Operations are used to perform actions on a set of registers.
  • the Empty and Zero instructions are provided for reseuing a selection of registers prior to. or in between routines
  • the Output instruction is provided to store tne contents of a list of registers to the output FIFO 302923 27 25 25 24 23 22 21 20 19 IS 17 16 15 14 13 12 11 10 9 1 0
  • OPC specifies the type of instruction.
  • register k is marxed as being empty.
  • register k (register k -» scale) is written to the output FIFO and register k is marked as being empty. if bit k of the register list is set then
  • the assembler will also support the syntax
  • the EMPTY instruction will stall until all registers to be empties contain valid data d e are not empty).
  • Tne OUPUT instruction can only specif.' up to eight registers to output
  • the ZERO instruction helps witn this. Both, are designed to improve code dens:?/ by replacing a series of smgie register moves
  • the OUTPUT instruction is mciuced to improve code density by replacing a se ⁇ es of MOV ⁇ , Rn mstructions
  • the instruction encoding is as follows.
  • ⁇ acn P.ARAMS field is comprised of the following entnes:
  • RENUMBER Number of 16-bit registers to ' perform re-mapping on may take the values 0, 2, 4, 8.
  • BASEINC The amount the base pointer is incremented at the end of each loop. May take the values 1 , 2. o:
  • the base wrapping modulus may take the values 2. 4. 8.
  • Tne ⁇ PARAMS> field has me following format:
  • ⁇ PARAMS> :: • ⁇ BANK> ⁇ B.ASEINC> n ⁇ RENUMBER> w ⁇ BASEWRAP> 98/12627
  • the REPEAT instruction defines a new hardware loop. Piccolo uses hardware loop 0 for the first REPEAT instruction, hardware loop 1 for a REPEAT instruction nested within the first repeat instruction and so on. The REPEAT instruction does not need to specify which loop is being used. REPEAT loops must be strictly - nested. If an attempt is made to nest loops to a depth greater than 4 then the behaviour is unpredictable.
  • Each REPEAT instruction specifies the number of instructions in the loop (which immediately follows the REPEAT instruction) and the number of times to go around the loop (which is either a constant or read from a Piccolo register).
  • Piccolo may take extra cycles to set the loop up.
  • the REPEAT instruction provides a mechanism to modify the way in which register operands are specified within a loop. The details are described above
  • the RFIELD operand specifies which of 16 re-mapping parameter configurations to use inside the ioon. 98/12627
  • the assembler provides two opcodes REPEAT and NEXT for defining a hardware loop.
  • the REPEAT goes at the start of the loop and the NEXT delimits the end of the loop, allowing the assembler to calculate the number of instructions in the loop body.
  • the REPEAT it is only necessary to specify the number of loops either as a constant or register. For example:
  • MULA AO Y0.1 , Z0.1 .
  • the assembler supports the syntax: REPEAT ⁇ iterations [. ⁇ PARAMS>]
  • the re-mapping parameters to use for the REPEAT are equal to one of the predefined set of parameters, then the appropriate REPEAT encoding is used. If it is not then the assembler will generate an RMOV to load the user defined parameters, followed by a REPEAT instruction. See the section above for details of the RMOV instruction and the re-mappmg parameters format. ⁇ : the number of iterations for a loop is 0 then the action of REPEAT is UNPREDICTABLE.
  • the Saturating .Absolute instruction calculates the saturated absolute of source 1.
  • the value is alwavs saturated.
  • 0x80000000 is 0x7ffffff and NOT 0x80000000!
  • Z is set if the result is zero.
  • V is set if saturation occured.
  • Select Operations serve to conditionally move either source 1 or source 2 into the destination register.
  • a select is always equivalent to a move.
  • OPC specifies the type of instruction.
  • MOV ⁇ cond> A,B is equivalent to SEL ⁇ cond> A.
  • SELFT and SELFF are obtained by swapping srcl and src2 and using SELTF, SELTT.
  • Snift Operation instructions provide left and right logical shifts, right arithmetic shifts, and rotates oy a specified amount
  • the shift amount is considered to be a signed integer between -128 and -127 taken from the bottom 8 bits of me register contents or an immediate in the range -1 to - ⁇ 31
  • a shift of a negative amount causes a snift in the opposite direction by ABS(sh ⁇ ft amount)
  • OPC specifies me tvpe of instruction.
  • Z is set if the result is zero.
  • N is set if the result is negative
  • C is set to the value of the last bit shifted out (as on the .ARM)
  • -ASR by 32 or more has result filled with and C equal to bit 31 of srcl .
  • -ROR by 32 has result equal to srcl and C set to bit 31 of srcl.
  • Bit and field extraction Serial registers. Undefined Instructions are set out above in the instruction set listing. Their execution will cause Piccolo to halt execution, and set the U bit in the status register, and disable itself (as if the E bit in the control register was cleared). This allows any future extensions of the instructions set to be trapped and optionally emulated on existing implementations.
  • Acessing Piccolo State from ARM is as follows. State access mode is used to observe/modify the state of Piccolo. This mechanism is provided for two purposes: -Context Switch. -Debug.
  • Piccolo is put in state access mode by executing the PSTATE instruction. This mode allows all Piccolo state to be saved and restored with a sequence of STC and LDC instructions. When put into state access mode, the use of the Piccolo coprocessor ID PICCOLOl is modified to allow the state of Piccolo to be accessed.
  • Bank 0 Private registers. - 1 32-bit word containing the value of the Piccolo ID Register (Read Only).
  • Bank 3 Register/Piccolo ROB/Output FIFO Status.
  • Bank 6 Loop Hardware. - 4 32-bit words containing the loop start addresses.
  • the LDC instruction is used to load Piccolo state when Piccolo is in state access mode
  • the BANK field specifies which bank is being loaded.
  • Debug Mode - Piccolo needs to respond to the same debug mechanisms as supported by ARM i.e. software through Demon and Angel, and hardware with
  • Piccolo instruction breakpoints are handled by the Piccolo Embedded ICE module: Piccolo software breakpoints are handled by the Piccolo core.
  • the hardware breakpoint system will be configurable such that both the ARM and
  • Piccolo instruction Halt or Break
  • debug mode B bit in the status register set
  • the program counter remains valid, allowing the address of the breakpoint to be recovered. Piccolo will no longer execute instructions.
  • Piccolo Software Debug -
  • the basic functionality provided by Piccolo is the ability to load and save all state to memory via coprocessor instructions when in state access mode. This allows a debugger to save all state to memory, read and/or update it, and restore it to Piccolo.
  • the Piccolo store state mechanism will be nondestructive, that is the action of storing the state of Piccolo will not corrupt any of Piccolo's internal state. This means that Piccolo can be restarted after dumping its state without restoring it again first. The mechanism to find the status of the Piccolo cache is to be determined.
  • Hardware Debug - Hardware debug will be facilitated by a scan chain on Piccolo's coprocessor interface. Piccolo may then be put into state access mode and have its state examined/modified via the scan chain.
  • the Piccolo Status register contains a single bit to indicate that it has executed a breakpointed instruction. When a breakpointed instruction is executed, Piccolo sets the B bit in the Status register, and halts execution. To be able to interrogate Piccolo, the debugger must enable Piccolo and put it into state access mode by writing to its control register before subsequent accesses can occur.
  • Figure 4 illustrates a multiplexer arrangement responsive to the Hi/Lo bit and Size bit to switch appropriate halves of the selected register to the Piccolo datapath. If the Size bit indicates 16 bits, then a sign extending circuit pads the high order bits of the datapath with 0s or Is as appropriate.
EP97937703A 1996-09-23 1997-08-22 Input operand control in data processing systems Ceased EP0927391A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
GB9619826A GB2317467B (en) 1996-09-23 1996-09-23 Input operand control in data processing systems
GB9619826 1996-09-23
PCT/GB1997/002260 WO1998012627A1 (en) 1996-09-23 1997-08-22 Input operand control in data processing systems

Publications (1)

Publication Number Publication Date
EP0927391A1 true EP0927391A1 (en) 1999-07-07

Family

ID=10800363

Family Applications (1)

Application Number Title Priority Date Filing Date
EP97937703A Ceased EP0927391A1 (en) 1996-09-23 1997-08-22 Input operand control in data processing systems

Country Status (7)

Country Link
EP (1) EP0927391A1 (ko)
JP (1) JP3645574B2 (ko)
KR (1) KR20000048531A (ko)
CN (1) CN1226325A (ko)
IL (1) IL127291A0 (ko)
MY (1) MY133769A (ko)
TW (1) TW364976B (ko)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3857614B2 (ja) 2002-06-03 2006-12-13 松下電器産業株式会社 プロセッサ
US7668897B2 (en) 2003-06-16 2010-02-23 Arm Limited Result partitioning within SIMD data processing systems
GB2409059B (en) * 2003-12-09 2006-09-27 Advanced Risc Mach Ltd A data processing apparatus and method for moving data between registers and memory
GB2409066B (en) * 2003-12-09 2006-09-27 Advanced Risc Mach Ltd A data processing apparatus and method for moving data between registers and memory
GB2409062C (en) * 2003-12-09 2007-12-11 Advanced Risc Mach Ltd Aliasing data processing registers
US7437541B2 (en) * 2004-07-08 2008-10-14 International Business Machiens Corporation Atomically updating 64 bit fields in the 32 bit AIX kernel
US8914619B2 (en) * 2010-06-22 2014-12-16 International Business Machines Corporation High-word facility for extending the number of general purpose registers available to instructions
CN107908427B (zh) * 2011-12-23 2021-11-09 英特尔公司 用于多维数组中的元素偏移量计算的指令
CN108304217B (zh) * 2018-03-09 2020-11-03 中国科学院计算技术研究所 将长位宽操作数指令转换为短位宽操作数指令的方法
CN111459546B (zh) * 2020-03-30 2023-04-18 芯来智融半导体科技(上海)有限公司 一种实现操作数位宽可变的装置及方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO9812627A1 *

Also Published As

Publication number Publication date
TW364976B (en) 1999-07-21
IL127291A0 (en) 1999-09-22
JP3645574B2 (ja) 2005-05-11
KR20000048531A (ko) 2000-07-25
MY133769A (en) 2007-11-30
JP2001501001A (ja) 2001-01-23
CN1226325A (zh) 1999-08-18

Similar Documents

Publication Publication Date Title
EP0927393B1 (en) Digital signal processing integrated circuit architecture
US5881257A (en) Data processing system register control
EP1010065B1 (en) Coprocessor data access control
US5784602A (en) Method and apparatus for digital signal processing for integrated circuit architecture
US5748515A (en) Data processing condition code flags
WO1998012627A1 (en) Input operand control in data processing systems
US5969975A (en) Data processing apparatus registers
US5881263A (en) Non-instruction base register addressing in a data processing apparatus
EP0927389B1 (en) Register addressing in a data processing apparatus
EP0927390B1 (en) Processing of conditional select and move instructions
EP0927391A1 (en) Input operand control in data processing systems
EP0927392A1 (en) Data processing system register control
JP2001501329A (ja) データ処理装置レジスタ

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 19981027

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): DE FR GB IT NL

RAP1 Party data changed (applicant data changed or rights of an application transferred)

Owner name: ARM LIMITED

GRAG Despatch of communication of intention to grant

Free format text: ORIGINAL CODE: EPIDOS AGRA

17Q First examination report despatched

Effective date: 20011127

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20020603