CN1226325A

CN1226325A - Input operation control in data processing systems

Info

Publication number: CN1226325A
Application number: CN97196725A
Authority: CN
Inventors: D·V·贾加尔; S·J·格拉斯
Original assignee: Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 1996-09-23
Filing date: 1997-08-22
Publication date: 1999-08-18
Also published as: JP3645574B2; MY133769A; EP0927391A1; KR20000048531A; TW364976B; IL127291A0; JP2001501001A

Abstract

A data processing system having a plurality of registers 10 and an arithmetic logic unit 20, 22, 24 includes program instruction words having a source register bit field Sn specifying one of the registers storing an input operand data word together with an input operand size flag indicating whether the input operand has an N-bit size or (N/2)-bit size together with a high/low location flag indicating which of the high order bit positions or low order bit positions stores the input operand if it is of the smaller size. It is preferred that the arithmetic logic unit is also able to perform parallel operation program instruction words operating independently upon (N/2)-bit input operand data words stored in respective halves of a register.

Description

Input operand control in the data handling system

The present invention relates to data handling system. More specifically, the present invention relates to have the data handling system of a plurality of registers, described a plurality of registers are used for storage by the data word of ALU operation under the control of programmed instruction word.

Provide programmed instruction whether operate being stored in an input operand in the register to specify, perhaps to be stored in two registers but to operate as the input operand that an input operand is treated be well-known.

From an aspect of of the present present invention, the invention provides the device of processing for data, described device comprises:

Be used for storing a plurality of registers of data word to be processed, each described register has at least capacity of N position; And

An ALU is used for the execution of responder coding line by the arithmetical logic operation of described programmed instruction word appointment; Wherein

Described ALU responds at least one programmed instruction word, and this programmed instruction word comprises:

(ⅰ) source register bit field, a source register in described a plurality of registers of an input operand data word of the described programmed instruction word of its designated store;

(ⅱ) input operand length mark is used to specify described input operand data word and whether has the length of N position or the length of N/2 position; And

(ⅲ) high/low tick lables is designated as one at described input operand length mark

(N/2) position length the time, this high/low tick lables indicates this input operand data word to be positioned at institute

State the upper level position of source register and in the low order position which.

The trend that has occurred improving the data path width of system at Data processing. Early stage system has 8 data path. Then develop into 16 data path, commonly have now 32 and 64 bit data paths. Along with the increase of data path width, the width of the register in the data handling system also increase thereupon with its coupling. The present invention recognizes, at the width of data word to be processed during less than the width of data path, then storing these words with a full register is wastes to the register resources of equipment. This situation is especially true in the machine of load/store architecture, and wherein, all data to be processed must be in register, and wishes to reduce and need to take out the number of times of data from high-speed cache or main storage. The present invention recognizes above-mentioned consideration, and provides the scheme of using an input operand length mark and a high/low tick lables to be stored in which part of register with the length of indicating this input operand with it. Mode successively, a register can be preserved more than one input operand, thereby has more effectively utilized the register resources of equipment, and these input operands still can be processed individually.

When utilizing a N bit data bus that data storage device is connected to register, this advantage of the present invention is strengthened. In this case, can utilize data/address bus once to transmit two operands, thereby more effectively utilize bus bandwidth, reduce the possibility that performance bottleneck occurs. In preferred embodiment of the present invention, described ALU responds the programmed instruction word of at least one parallel work-flow, carry out independent arithmetical logic operation with second (N/2) position input operand data word on the input operand data word of first (N/2) position, these input operand data words are stored in respectively upper level position and a low order position in the source register.

The N bit data path ability that the programmed instruction word of the parallel work-flow that provides allows ALU to take full advantage of it is carried out two and is independently calculated, though these input operands on length less than the data path width of maximum. This has greatly improved the data-handling capacity of this system, does not cause simultaneously significant other expenses.

An improvement can carrying out is that described ALU has a signal path, its carry chain between the position that plays a part to put in place in arithmetical logic operation, and when carrying out the programmed instruction word of a parallel work-flow, described signal path disconnects between described first (N/2) position input operand data word and described second (N/2) position input operand data word.

Although many forms can be taked in the programmed instruction word of parallel work-flow, one of following arithmetical logic operation carried out in the programmed instruction word of best described parallel work-flow:

(ⅰ) parallel adding, wherein carry out two parallel (N/2) positions and add;

(ⅱ) parallel subtracting wherein carried out two parallel (N/2) positions and subtracts;

Two parallel (N/2) bit shift operations are wherein carried out in (ⅲ) parallel displacement;

(ⅳ) parallelly add/subtract, wherein (N/2) position of executed in parallel adds with one (N/2) position and subtracts.

Of the present invention further the improvement is, when a N bit length of described input length mark indication, whether described high/low tick lables indication will be stored in described upper level position before using as a N position input operand data word those positions move to described low order position, and those positions that will be stored in described low order position move to described upper level position.

This feature is particularly useful during map function. The extremely effectively hardware of this function realizes comprising at least one multiplexer, this multiplexer responds described high/low tick lables, selects to provide to low (N/2) position of described data path one (N/2) position input operand data word of one of the upper level position that is stored in described source register and low order position of described source register.

In order to process signed computing and not have unnecessary complexity, preferably provide a circuit before one (N/2) position input operand data word is input to described N bit data path, it is carried out sign extended.

From another aspect of the present invention, the invention provides the method for processing for data, said method comprising the steps of:

The data word that will process is stored in a plurality of registers, and each described register has at least capacity of N position; And

The responder coding line is carried out the arithmetical logic operation by the appointment of described programmed instruction word; Wherein

At least one programmed instruction word comprises:

(ⅲ) high/low tick lables, when described input operand length mark was designated as the length of one (N/2) position, this high/low tick lables indicated this input operand data word to be arranged in which of the upper level position of described source register and low order position.

Embodiments of the invention are described with reference to the accompanying drawings by way of example, in the accompanying drawing:

Fig. 1 illustrates the high level configuration of digital signal processing device;

Fig. 2 illustrates the input buffer of the register configuration of coprocessor;

Fig. 3 illustrates the data path by coprocessor;

Fig. 4 illustrates the multiplex electronics that reads position, high or low position from register;

Fig. 5 is the employed register of the coprocessor block diagram of map logic again that illustrates in the preferred embodiment;

Fig. 6 illustrates in greater detail again map logic of the register shown in Fig. 5; And

Fig. 7 is the table that the piece filter algorithm is shown.

The system that the following describes is about Digital Signal Processing (DSP). DSP can take many forms, but need generally can think the at a high speed processing of (in real time) processing mass data. Certain analog physical signal of this data ordinary representation. The good example of DSP is used in the digital mobile phone, wherein receives need to be decoded into analoging sound signal with the radio signal that sends and with analoging sound signal coding (usually adopting convolution, conversion and related operation). Another example is the disk drive controller, wherein processes from the signal of coiled hair recovery and follows the tracks of control to produce head.

In the superincumbent context, the below be to based on the description of the digital information processing system of the microprocessor core of coprocessor cooperation (being the ARM nuclear in the microprocessor scope of Britain Camb Advanced RISC Machines Ltd. design in this example). The interface of microprocessor and coprocessor and coprocessor processor system structure itself are special in the DSP functional configuration is provided. Microprocessor core will be known as ARM and coprocessor is called Piccolo. ARM and Piccolo manufacture the single IC for both of other element (such as DRAM, ROM, D/A and A/D converter etc. on the sheet) that comprises as the part of ASIC usually.

Piccolo is arm coprocessor, so it carries out a part of ARM instruction set. The ARM coprocessor instruction allows ARM to transmit data (utilize and load coprocessor LDC and storage coprocessor STC instruction) between Piccolo and memory, and to transmit ARM register (the MRC instruction that utilization is sent to coprocessor MCR and transmits from coprocessor) from Piccolo. A kind of mode of observing ARM and the cooperative interaction of Piccolo be ARM as the strong address generator work of Piccolo data, need to process in real time the DSP computing that mass data produces the real-time results of correspondence and Piccolo is carried out if having time.

Fig. 1 illustrates ARM2 and Piccolo4, and ARM2 issuing control signal is controlled to Piccolo4 to Piccolo4 and transmitted data and transmit data word from Piccolo4. The needed Piccolo programmed instruction of instruction cache 6 storage Piccolo4 word. Single DRAM memory 8 storage ARM2 and needed all data and instruction words of Piccolo4. ARM2 is responsible for addressable memory 8 and controls all data and transmit. Only simple and cheap than the typical DSP method of the bus that needs a plurality of memories and high bus bandwidth with the layout of address bus with single memory 8 and one group of data.

Piccolo carries out the second instruction stream (DSP program coding line) from the instruction cache 6 of control Piccolo data path. Comprise in these instructions such as the operation of the Digital Signal Processing types such as multiply-accumulate and such as control stream instructions such as zero-overhead loop instructions. The data of these instructions in remaining on Piccolo register 10 (seeing Fig. 2) operate. These data are that previous ARM2 sends from memory 8. Instruction stream is from instruction cache 6; Instruction cache 6 conducts are bus master driving data bus completely. Little Piccolo instruction cache 6 is the direct mapping cache (64 instruction) of 4 lines, 16 words of every line. In some implementations, make that instruction cache is larger to be worth.

Thereby two tasks are independent operatings, and ARM loads data and Piccolo processes it. This allows to process in the monocycle data that 16 bit data continue. Piccolo has the ARM of the making alphabetic data of looking ahead, and loads the scanning machine system (being illustrated among Fig. 2) of data before Piccolo needs it. The data that Piccolo can load with any order access are along with the last use of old data automatically refills its register (each source operand of all instructions have indicate should refill source register). This input mechanism is called the sequencing buffer again and comprises input buffer 12. Each value (seeing below by LDC or MCR) that loads Piccolo carries the mark Rn of the destination register of specifying this value. Mark Rn is stored in the input buffer with data word. When instruction is specified and will be refilled this data register when selecting circuit 14 access function resisters by register, just come this register of mark by establishing signal E. Then refill that the oldest loaded value take this register as the destination refills this register automatically in the circuit 16 usefulness input buffers 12. Reset the value that the order buffer keeps 8 tape labels. Input buffer 12 has the form that is similar to FIFO, but except can be from formation central authorities extracted data word, and after this word of storage in evening fill the room to front transfer. Distance input data word farthest just correspondingly is the oldest, and just determines and should refill input buffer 12 with which data word with it when input buffer 12 keeps two data words with correct mark Rn.

Piccolo exports it by storing data in the output buffer 18 (FIFO) as shown in Figure 3. Data are sequentially to write among the FIFO, and read into memory 8 by ARM with identical order. Output buffer 18 keeps 8 32 place values.

Piccolo is connected on the ARM by coprocessor interface (the CP control signal of Fig. 1). When carrying out the arm coprocessor instruction, Piccolo can carry out this instruction; Before carrying out this instruction, make ARM wait for until Piccolo is ready; Or refusal is carried out this instruction. In the end in a kind of situation, ARM will cause undefined instruction exception.

The prevailing coprocessor instruction that Piccolo carries out is LDC and STC, they respectively by data/address bus to load and the storage data word from memory 8, and ARM generates all addresses. These instructions with data be loaded into reset in the order buffer and storage from the data of output buffer 18. If reset when not having enough spaces to load data in the order buffer in LDC input, if and on STC, do not have enough data in the output buffer for storage, be the data expected of ARM not in output buffer 18 time, Piccolo will stop ARM. The ARM/ coprocessor register of also carrying out Piccolo transmits the particular register that makes ARM energy access Piccolo.

Piccolo comes the data path shown in the control chart 3 and reaches 18 transmission data from the register to the output buffer from resetting the order buffer to register from the instruction that memory takes out itself. The ALU of these instructions of execution of Piccolo has the multiplier/adders circuit 20 of carrying out multiplication, addition, subtraction, multiply-accumulate, logical operation, displacement and circulation. In data path, also be provided with cumulative/regressive (decumulate) circuit 22 and calibration/saturated circuit 24.

Load into instruction cache 6 from memory when the Piccolo instruction is initial, wherein Piccolo can access they and do not need to return to accessing main memory.

Piccolo can not recover from the memory failure. Therefore, if use Piccolo in virtual memory system, all Piccolo data all must be in physical storage in whole Piccolo task. For the real-time such as Piccolo tasks such as real-time DSP, this is not great restriction. If there is the memory failure, Piccolo will stop and in status register S2 sign will be set.

Fig. 3 illustrates the overall data path function of Piccolo. Register group 10 is used 3 read ports and 2 write ports. Utilize a write port (L port) to refill register from resetting the order buffer. Output buffer 18 is directly to upgrade from ALU result bus 26, from the output of output buffer 18 under the ARM programme-control. The arm coprocessor interface is carried out LDC (loading coprocessor) instruction that resets in the order buffer and from STC (storage coprocessor) instruction of output buffer 18, and the MCR on register group 10 and MRC (transmit ARM register extremely/from the CP register).

All the other register ports are used for ALU. Two read ports (A and B) drive and are input to multiplier/adders circuit 20, and the C read port is used for driving accumulator/accumulation subtraction apparatus circuit 22 inputs. All the other write port W are used for the result is returned to register group 10.

Multiplier 20 is carried out 16 * 16 tape symbol or non-signed multiplication, adds up with available 48. Scaler unit 24 can provide 0 to 31 immediately arithmetic or logic shift right, and the back is followed available saturated. Shift unit and 20 each cycle of logical block can be carried out a displacement or logical operation.

Piccolo has 16 general registers that are called D0-D15 or A0-A3, X0-X3, Y0-Y3, Z0-Z3. First group four registers (A0-A3) are predetermined as accumulator and be 48 bit wides, and extra 16 are provided at the protection to overflowing in many continuous calculating. All the other registers are 32 bit wides.

Can with each Piccolo register as comprise two independently 16 place values treat. Position 0 to 15 comprises low half, and position 16 to 31 comprises high half. Instruction can specify specific 16 half of each register as source operand, maybe can specify whole 32 bit registers.

Piccolo also provides saturated computing. If the result is greater than the size of destination register, the modification of multiplication, addition and subtraction instruction provides saturated result. When destination register is 48 bit accumulator, value is saturated to 32 (namely can't saturated 48 place values). On 48 bit registers, do not overflow detection. So just to cause overflowing this be rational restriction owing to can take the cumulative instruction of at least 65536 multiplication.

Each Piccolo register is to be labeled as " sky " (the E sign is seen Fig. 2) or to comprise one of value (it is empty that half register can not be arranged). When initial, be empty with all register taggings. Piccolo attempts will fill one of empty register from the value that input resets the order buffer with refilling control circuit 16 on each cycle. Just no longer it is labeled as " sky " if will write from the value of ALU in addition register. If write register from ALU, there is simultaneously value to wait for and is placed into this register from resetting the order buffer, then the result is uncertain. If dummy register is read, the performance element of Piccolo will stop.

Input resets order buffer (ROB) between the register group of coprocessor interface and Piccolo. With the arm coprocessor transmission data are loaded in ROB. ROB comprises some 32 place values, and each is with the mark of indication as the Piccolo register of the destination of this value. This mark also indicates these data should send whole 32 bit registers to or only to 16 of the bottoms of 32 bit registers. If the destination of data is whole register, 16 of bottoms that then will this item send to destination register the bottom half and 16 at top is sent to the top half (if destination register is 48 bit accumulators then escape character) of register. If the destination of these data is the bottom half (so-called " half register ") of register, at first transmit 16 of bottoms.

Register tagging is always with reference to the physics destination register, do not carry out register and remaps and (remap about register below seeing. )

Piccolo attempts as follows data item to be sent to the register group from ROB on each cycle:

Every and with mark and dummy register relatively among-the checking R OB, determine whether and can transmit register from part or all.

-Xiang Zuzhong from transmitting selects the oldest item and sends its data to the register group.

-will this item flag update be mark this be empty. If only transmitted the part of this item, the part that only will transmit is labeled as empty.

For example, comprising the data take whole register as the destination if destination register is ROB item empty and that select fully, is sky just transmit whole 32 and mark this items. If half is empty and the ROB item comprises half the data of bottom that the destination is register for the bottom of destination register, then 16 of the bottoms of this ROB item are sent to destination register the bottom half and with the bottom of ROB half is labeled as empty.

Can transmit independently the height of the data in any and hang down 16. If do not have item to comprise the data that can send the register group to, do not transmit in this cycle. Following table is described might making up of target ROB item and destination register state.

	Target, Rn, state
	Target, Rn, state			Target ROB item state	Empty	Sky is at half	High one in midair
Full register, two halves are all effective	Rn.h＜-entry.h Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.h entry.h is labeled as sky	Target ROB item state	Empty	Sky is at half	High one in midair
Full register, two halves are all effective	Rn.h＜-entry.h Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.h entry.h is labeled as sky	Full register, half is effective for height	Rn.h＜-the entry.h item is labeled as sky		Rn.h＜-the entry.h item is labeled as sky
Full register is at half effectively	Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-the entry.l item is labeled as sky		Full register, half is effective for height	Rn.h＜-the entry.h item is labeled as sky		Rn.h＜-the entry.h item is labeled as sky
Full register is at half effectively	Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-the entry.l item is labeled as sky		Half register, two halves are all effective	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.l etntry.l is labeled as sky
Half register, half is effective for height	Rn.l＜-the entry.h item is labeled as sky	Rn.l＜-the entry.h item is labeled as sky		Half register, two halves are all effective	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.l etntry.l is labeled as sky

Sum up, can refill independently the two halves of register from ROB, the data markers among the ROB for take whole register as the destination or take half two 16 place value as the destination of bottom of register.

With the arm coprocessor instruction data are loaded in ROB. How which bar coprocessor instruction flag data depends on and carries out transmission in ROB. Following A RM instruction can be used for data stuffing ROB:

LDP{＜cond＞}＜16/32＞   ＜dest＞,[Rn]{!},#＜size＞

    LDP{＜cond＞}＜16/32＞W  ＜dest＞,＜wrap＞,[Rn]{!},#＜size＞

　　LDP{＜cond＞}16U       ＜bank＞,[Rn]{!}

　　MPR{＜cond＞}          ＜dest＞,Rn

　　MRP{＜cond＞}          ＜dest＞,Rn

Provide following ARM instruction to be used for configuration ROB:

LDPA<bank list>

First three bar is collected is that LDC, MPR and MRP are collected and is that MCR, LDPA are collected is the CDP instruction.

Above＜dest〉represent Piccolo register (A0-Z3), Rn represents an ARM register,＜size〉representative must be the fixed word joint number of 4 non-zero multiple, and＜wrap〉represent constant (1,2,4,8). The field of having drawn together with { } is what select. Reset order buffer,＜size for transmission can be met〉be at most 32. In many occasions, for fear of deadlock,＜size〉will be less than this restriction.＜16/32〉field indicates whether the data that load to be treated as 16 bit data and be indicated the specific action of the ending (endian) that will take (below seeing), or 32 bit data.

Annotate 1: in the text below, 16 of its instruction and 32 modification when quoting LDP or LDPW.

Annotate 2: ' word ' is 32 pieces from memory, and it can comprise two 16 bit data items or 32 bit data items.

The LDP instruction transmits some data item, and they are assigned to a full register. This instruction will be from memory address Rn loading＜size/4 words, they are inserted among the ROB. The number of words that can transmit is subjected to following restriction:

-amount＜size〉must be 4 non-zero multiple;

-＜size〉must be less than or equal to the size (be 8 words, in the future version guarantee be no less than this) of the ROB of specific implementation in first version.

The first data item that transmits is labeled as is assigned to＜dest, the second data item is assigned to＜dest 〉+1 etc. (rapping around to A0 from Z3). If specified! , then after this with register Rn increment＜size 〉.

If adopt the LDP16 modification, along with they return from accumulator system, carry out the specific operation of ending (endian) at 2 16 half-words that consist of 32 bit data items. Large ending (Big Endian) was supported with little ending (Little Endian) below details were seen.

The LDPW instruction transmits some data item to one group of register. The first data item that transmits is labeled as is assigned to＜dest, second to＜dest 〉+1, etc. As appearance＜wrap〉when transmitting, the item that the next one is transmitted is labeled as and is assigned to＜dest 〉, etc.＜wrap〉amount is in the amount appointment of half-word.

For LDPW, applicable following restriction:

-amount＜size〉must be 4 non-zero multiple;

-＜size〉must be less than or equal to the size (be 8 words, in the future version guarantee be not less than this) of the ROB of specific implementation in front page;

-＜dest〉can be { one of A0, X0, Y0, Z0};

-for LDP32W,＜wrap〉can be 2,4, one of a 8} half-word, for LDP16W can be 1,2,4, one of a 8} half-word;

-amount＜size〉must be greater than 2*＜wrap, do not replace otherwise do not occur unrolling and use the LDP instruction.

For example, instruction

LDP32W X0,2,[R0]!,#8

Two words are loaded in ROB, they are assigned to whole register X0. R0 will be incremented 8. Instruction

LDP32W X0,4,[R0],#16

Four words are loaded in ROB, they are labeled as are assigned to X0, X1, X0, X1 (by this order). R0 is unaffected.

For LDP16W, can be with＜wrap be appointed as 1,2,4 or 8. 1 unroll will cause all data markers for being assigned to destination register＜dest〉bottom of .1 half. This is ' half register ' situation.

For example, instruction

LDP16W X0,1,[R0]!,#8

Two words are loaded in ROB, they are labeled as 16 bit data that are assigned to X0.l. R0 will be incremented 8. Instruction

LDP16W X0,4,[R0],#16

Performance be similar to the LDP32W example, but when it returns from memory, carry out for except the specific operation of ending in data.

All untapped codings of LDP instruction can be in the future, and expansion keeps.

The LDP16U instruction is to provide for the efficient transmission of supporting 16 data that do not line up. LDP16U supports to provide for register D4 to D15 (X, Y and Z group). The LDP16U instruction is sent to 32 bit data word (comprising two 16 bit data items) the Piccolo from memory. Piccolo will abandon 16 of the bottoms of these data and 16 at top will be stored in the holding register. X, Y and Z group have a holding register. In case loaded the holding register in the group, if data are assigned to register in this group, just changed the performance of LDP{W} instruction. Load data among the ROB by holding register and connecting and composing with 16 of the bottoms of the data of LDP instruction transmission. Put into holding register for high 16 with the data that transmitting:

entry＜-data.l|holding_register

holding_register＜-data.h

Till this operator scheme is continued until and closes with the LDPA instruction. Holding register does not record destination register mark or size. This feature is to obtain from the instruction of the next one value that data.l is provided.

The specific behavior of ending can appear on the data that accumulator system returns forever. Because all 32 bit data items of supposition all are the word alignment in memory, do not have non-16 bit instructions that are equivalent to LDP16U.

The LDPA instruction is used for closing the operator scheme that do not line up of LDP16U instruction starting. Can on group X, Y, Z, independently close the pattern of not lining up. For example instruction,

LDPA {X,Y}

With the pattern that do not line up of closing on group X and the Y. Data in the holding register of these groups will be dropped.

Permission is carried out LDPA in the group that is not in the non-alignment pattern, and this will make this group in alignment pattern.

The MPR instruction is put into ROB with the content of ARM register Rn, is assigned to Piccolo register＜dest 〉. Destination register＜dest〉can be any full register among the scope A0-Z3. For example instruction,

MPR X0,R3

The content of R3 is sent among the ROB, marks the data as and be assigned to full register X0.

Because ARM is inner little ending (endian), when being sent to Piccolo from ARM, data do not occur the specific performance that ends up.

The MPRW instruction in ROB, is labeled as the Content placement of ARM register Rn and is assigned to 16 Piccolo register＜dest with it〉two 16 bit data items of .1. Right＜dest〉restriction and identical (being A0, X0, Y0, Z0) to the LDPW instruction. For example instruction,

MPRW X0,R3

The content of R3 is sent among the ROB, marks the data as two 16 amounts that are assigned to X0.l. Should point out for 1 LDP16W that unrolls, can only for the bottom of 32 bit registers half.

As for MPR, on data, do not act on for the specific operation of ending.

LDP is encoded to:

Wherein PICCOL01 is the first coprocessor number (current is 8) of Piccolo. The N position is selected between LDP32 (1) and LDP16 (0).

LDPW is encoded to:

Wherein DEST is that 0-3 and WRAP are 0-3 for the value 1,2,4 of unrolling, 8 for destination register A0, x0, Y0, Z0, and PICCOL02 is the second coprocessor number (current is 9) of Piccolo. The N position is selected between LDP32 (1) and LDP16 (0).

LDP16U is encoded to:

Wherein DEST is 1-3 for destination group x, Y, z. Marine crab LDPA is encoded to:BANK[3 wherein: 0] be used for closing the pattern of not lining up on every group basis. If be provided with BANK[1], then close the pattern that do not line up on the group X. BANK[2] and BANK[3] close respectively the pattern that do not line up on group Y and the z, if arrange. Notice that this is the CDP operation.

MPR is encoded to:MPRW is encoded to:

Wherein DEST is 1-3 for destination register x0, Y0, z0.

Output FIFO can keep nearly 8 32 place values. They transmit from Piccolo with one of following (ARM) command code:

　　STP{＜cond＞}＜16/32＞    [Rn](!)，#＜size＞

　　MRP             Rn

First will from output FIFO＜size/4 words are kept on the given address of ARM register Rn, if! There is index Rn. For preventing deadlock,＜size〉must not be greater than the size (in this realization being 8) of output FIFO. If adopt the STP16 modification, on the data that accumulator system is returned, can occur for the specific performance of ending.

The MRP instruction is eliminated a word and is placed it among the ARM register Rn from output FIFO. On data, do not act on for the specific operation of ending for MPR.

The ARM of STP is encoded to:

Wherein N selects between STP32 (1) and sTP16 (0). For the definition of P, U and W position, referring to the ARM Fact Book.

The ARM of MRP is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 1,312 11 10 9876543210

COND

1110

0

1

0

1

0000

Rn

PICCOLO1

000

1

0000

The inner little ending of supposition of Piccolo instruction set (little endian) operation. For example, during as two 16 32 bit register, supposing is at half takies position 15 to 0 in access. Piccolo can operate in the system with large ending (big endian) memory or ancillary equipment, therefore must be noted that to load 16 grouped datas with correct way.

Have ' BIGEND ' configuration pin that the programmer can control such as Piccolo such as ARM (the ARM7 microprocessor of producing such as Advanced RISC Machines Ltd. of Britain Camb), control can be carried out with programmable peripheral equipment. Piccolo utilizes this pin to dispose input and resets order buffer and output FIFO.

When 16 bit data that will divide into groups as ARM were loaded into and reset in the order buffer, it must be with the 16 bit formats indication this point of LDP instruction. This information is keeping data placement latch and is resetting in the order buffer with suitable order with the combinations of states of ' BIGEND ' configuration input. Especially in large ending pattern, 16 of the bottoms of the word that the holding register storage loads, and with top 16 bit pairings that next time load. The holding register content forever finishes in being sent to 16 of bottoms that reset the word in the order buffer.

Output FIFO can comprise

grouping

16 or 32 bit data. The programmer must use the correct format of STP instruction so that Piccolo can guarantee 16 bit data are provided at the correct on half of data/address bus. When being configured to end up greatly, when using the STP of 16 bit formats, 16 two halves in up and down exchange.

Piccolo has can only be from 4 special registers of ARM access. They are called S0-S2. They can only use MRC and MCR instruction accessing. Command code is:

　　MPSR      Sn,Rm

　　MRPS      Rm,Sn

These command codes transmit 32 place values between ARM register Rm and special register Sn. They are transmitted among the ARM as coprocessor register and encode:

Wherein for MPSR, L is 0 and to MRPS, L then is 1. Register SO comprises the unique ID of Piccolo and revision version code.

Position [3: 0] comprises the revision number of processor.

Position [15: 4] comprises 3 part number: piccolo take the binary-coded decimal system form as Ox500

Position [23: 16] occlusion body architecture version: 0 * 00=version 1

Position [31: 24] comprises the ASCII character of implementor's trade mark: 0 * 41=A=ARM Co., Ltd

Register S1 is the Piccolo status register.

One-level condition code flag (N, Z, C, V)

Secondary condition code flag (SN, SZ, SC, SV)

E position: Piccolo is forbidden by ARM and stops.

U position: Piccolo runs into undefined instruction and stops.

B position: Piccolo runs into breakpoint and stops.

H position: Piccolo runs into halt instruction and stops.

A position: Piccolo runs into memory failure (loading storage or Piccolo instruction) and stops.

D position: Piccolo detects dead lock condition and stops (seeing lower). Register S2 is the Piccolo program counter: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

Program counter

0

Write-in program counter starting Piccolo is performing a programme (if stop then leaving halted state) on this address. Program counter is without definition, because Piccolo always passes through the starting of write-in program counter when resetting.

The term of execution, the execution of Piccolo monitor command and the state of coprocessor interface. If it detects:

-Piccolo wait out of service is again loaded register or is waited for that output FIFO has available.

-coprocessor interface is busy waiting and since among the ROB space not or output FIFO middle term inadequate.

If detect this two states, the D position in its status register of Piccolo set stops and refusing the arm processor instruction, causes ARM to enter undefined instruction trap.

The detection permission of deadlock state constitutes system by reading ARM and Piccolo program counter and register and can alert program person this state occur and report accurate trouble point at least. Should emphasize that deadlock can only destroy owing to another part of incorrect program or system the state initiation of Piccolo. Deadlock can not be caused by data deficiencies or ' overload '.

Can adopt several operation to control Piccolo from ARM, they are provided by the CDP instruction. These CDP instructions are only just accepted in privileged mode at ARM. If Piccolo will not refuse the CDP instruction and cause ARM to be in undefined instruction trap in this state. The below is available operation:

-reset

-access module gets the hang of

-start

-forbid

Piccolo can reset in software with the PRESET instruction.

PRESET; Remove the state of piccolo

Be 31 30 29 28 27 26 25 24 23 22 21 2,019 18 17 1,615 14 13 12 11 10 9876543210 with this instruction encoding

COND

1110

0000

PICCOLO1

000

0

0000

Following situation appears when carrying out this instruction:

-all register taggings are empty (being ready to refill).

-remove and input ROB.

-remove and export FIFO.

-reset cycle counter.

-Pioccolo is placed halted state (with the H position of set S2).

Carrying out the PRESET instruction can take some cycles and finish (for present embodiment 2-3). When carrying out it, the back will will be in busy waiting in the arm coprocessor instruction that Piccolo carries out.

In the conditional access pattern, can use STC and LDC instruction to preserve and the state that recovers Piccolo (seeing following about accessing the Piccolo state from ARM). For the access module that gets the hang of, must at first carry out the PSTATE instruction:

The PSTATE access module that gets the hang of

With this instruction encoding be: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

0301

0000

PICCOLO1

000

0

000

When carrying out, the PSTATE instruction is incited somebody to action:

-stop Piccolo (if it not yet stops), the E position in the status register of set Piccolo.

-configuration Piccolo enters in its conditional access pattern.

Carry out the PSTATE instruction and can take some cycles and finish, because the instruction pipeline of Piccolo must be finished before stopping. When carrying out, the arm coprocessor instruction that the back will be carried out at Piccolo will be busy waiting.

PENABLE and PDISABLE instruction are used for the fast context switching. When Piccolo is under an embargo, can only access special register 0 and 1 (ID and status register), and during just from privileged mode. Access any other state or will cause the ARM undefined instruction unusual from any access of user model. Forbid that Piccolo causes it to stop to carry out. When Piccolo stopped to carry out, it confirmed this fact by the E position in the SM set mode register.

Piccolo starts by carrying out the PENABLE instruction:

PENABLE; Start Piccolo

With this instruction encoding be: 31 30 29 28 27 26 25 24 23 22 21 2,019 18 17 1,615 14 13 12 11 10 9876543210

COND

1110

0010

0000

PICCOLO1

000

0

0000

Picclol forbids by carrying out the PDISABLE instruction: PDISABLE; Forbid that Piccolo with this instruction encoding is: 31 30 29 28 27 26 25 24 23 22 21 2,019 18 17 1,615 14 13 12 11 10 9876543210

COND

1110

0011

0000

PICCOLO1

000

0

0000

When carrying out this instruction, following situation appears:

The instruction pipeline of-Piccolo will flow.

-Piccolo will shut down and the SM set mode register in the H position.

The Piccolo instruction of Piccolo instruction cache retentive control Piccolo data path. If exist, it guarantees to keep at least 64 instructions, and is initial on 16 word boundarys. Following ARM command code collects in MCR. It is operating as forces cache memory to take out initial delegation (16) instruction of (must be 16 word boundarys) on assigned address. Even this taking-up also occurs in the data that cache memory has maintained about this address.

PMIR Rm

Piccolo must stop before carrying out PMIR.

The MCR of this command code is encoded to: 31 30 29 28 27 26 25 24 23 22 2,120 19 18 17 1,615 14 13 12 11 10 9876543210

COND

1110

011

L

0000

Rm

PICCOLO1

000

1

0000

The Piccolo instruction set of this section discussion control Piccolo data path. Each instruction be 32 long. Instruction is read from the Piccolo instruction cache.

The decoding instruction collection is quite intuitively. High 6 (26 to 31) provide the main operation code, and position 22 to 25 provides the minor actions code for the minority specific instruction. Be that expansion keeps (current they must comprise designated value) with current use the in position of gray shade.

11 main instruction class are arranged. And this is not exclusively corresponding to the main operation code that proposes in instruction, and this is for the ease of some subclass of decoding.

Instruction in the upper table has following title:

The normal data computing

Logical operation

Condition adds/subtracts

Undefined

Displacement

Select

Undefined

The parallel selection

Multiply accumulating

Undefined

Double taking advantage of

Undefined

The moving belt symbol is counted immediately

Undefined

Repeat

The array of registers table handling

Shift

Renaming parameter transmits

Stop/interrupting

Describe the form of all kinds of instructions in the joint below in detail. For most of instructions, source and destination operand field are general and describe in detail that register remaps too in independent joint.

Most of instructions need two source operands; Source 1 and source 2. Some exception is saturated absolute value.

Source 1 (SRC1) operand has following 7 bit formats: 18 17 16 15 14 13 12

Size

Recharge

Register number

High/low

Field element has following implication: the operand size (1=32 position, 0=16 position) that-size-indication will be read.-recharge-stipulate after reading and register tagging should be empty also recharging from ROB.

In 16 32 bit registers that-register number-coding will read which.

-high/low-for 16 read to indicate read 32 bit registers which half. For 32 positional operands, during set the indication should exchange two 16 half.

Size	High/low	The register section of access
Size	High/low	The register section of access	0	0	Low 16
0	1	High 16	0	0	Low 16
0	1	High 16	1	0	Complete 32
1	1	Complete 32, the two halves exchange	1	0	Complete 32

In assembler by adding that in register number suffix specifies register size: l is low 16, and h is that high 16 or .x have high and low 16 exchanges 32.

General source 2 (SRC2) has one of following three kind of 12 bit format: 11 10 9876543210

0	S2	R2	Register number	High/low	Scale
0	S2	R2	Register number	High/low	Scale	1	0	ROT	IMMED_8
1	1	IMMED_6			Scale	1	0	ROT	IMMED_8

Fig. 4 illustrates according to high/low position and size suitable half with selected register and switches to multiplexed apparatus on the Piccolo data path. If 16 of size position indications, the then symbol expanded circuit is used the high position in 0 or 1 padding data path as required.

The first coding assigned source is register, and these fields have the coding identical with the SRC1 specifier. Scale (SCALE) field is specified the scale on the result that will act on ALU.

Scale	Operation
Scale		3 2 1 0
0 0 0 0		3 2 1 0	ASR#0
0 0 0 0	0 0 0 1	ASR#1	ASR#0
0 0 1 0	0 0 0 1	ASR#1	ASR#2
0 0 1 0	0 0 1 1	ASR#3	ASR#2
0 1 0 0	0 0 1 1	ASR#3	ASR#4
0 1 0 0	0 1 0 1	Keep	ASR#4
0 1 1 0	0 1 0 1	Keep	ASR#6
0 1 1 0	0 1 1 1	ASL#1	ASR#6
1 0 0 0	0 1 1 1	ASL#1	ASR#8
1 0 0 0	1 0 0 1	ASR#16	ASR#8
1 0 1 0	1 0 0 1	ASR#16	ASR#10
1 0 1 0	1 0 1 1	Keep	ASR#10
1 1 0 0	1 0 1 1	Keep	ASR#12
1 1 0 0	1 1 0 1	ASR#13	ASR#12
1 1 1 0	1 1 0 1	ASR#13	ASR#14
1 1 1 0	1 1 1 1	ASR#15	ASR#14

Count immediately with 8 immediately number permission available 8 place values of generation of loop coding and 32 of 2 cyclic representations. Express down the immediately numerical value that can generate from 8 place value XY:

Circulation	Count immediately
Circulation	Count immediately	00	0×000000XY
01	0×0000XY00	00	0×000000XY
01	0×0000XY00	10	0×00XY0000
11	0×XY000000	10	0×00XY0000

6 immediately number encoder allow to use 6 not signedly to count immediately (from 0 to 63), and act on the scale in the output of ALU.

Universal source 2 codings are general for most of instruction modification. There are some exceptions in this rule, the finite subset of their support sources 2 codings or it is revised a little:

-selection instruction.

-shift instruction.

-parallel work-flow.

The instruction of-multiply accumulating.

-take advantage of double instruction.

Selection instruction is only supported a not tape symbol operand of number immediately of register or 6. Because these mode fields by instruction are used and must be made this scale unavailable.

Shift instruction is only supported between 16 bit registers or 1 and 31 5 without a symbol operand of number immediately. Can not obtain result's scale.

In the parallel work-flow situation, if specify register as the source of operand, then must carry out 32 and read. The immediately number encoder of parallel work-flow is slightly different. Its allow with one immediately number copy to two of 32 positional operands 16 half in. Parallel work-flow can utilize a little scale of limited field.

If use 6 to count immediately, then always it is copied to two of 32 amounts on half. If use 8 to count immediately, only have when the circulation indication and should be with 8 just copy when several tops that are recycled to 32 amounts are on half immediately.

Circulation	Count immediately
Circulation	Count immediately	00	0×000000XY
01	0×0000XY00	00	0×000000XY
01	0×0000XY00	10	0×00XY00XY
11	0×Y00XY00	10	0×00XY00XY

Scale is not used in parallel selection operation; Scale field that must these instructions is set to 0.

The multiply accumulating instruction does not allow to specify 8 circulations to count immediately. The position 10 of this field is used for partly specifying which accumulator of use. 16 positional operands are contained in source 2.

Take advantage of double instruction not allow to use constant. Can only specify 16 bit registers. The position 10 of this field is used for partly specifying which accumulator of use.

32 bit manipulations (such as ADDADD) are always contained in some instruction, and should size position be set to 1 in these situations, high/low position be used for exchanging selectively two of 32 positional operands 16 half. Some instruction is always contained 16 bit manipulations (such as MUL) and should be set to 0 in the size position. And high/low position is selected which half (the size position that loses has been removed in supposition) of employed register. The multiply accumulating instruction allows independent explanation source accumulator and destination register. For these instructions, size position is used to refer to the source accumulator, and the size position is 0 to contain by instruction type then.

(by A or B bus) automatically carried out sign extended it is extended to 32 amounts when reading 16 place value. If read 48 bit registers (by A or B bus), 32 of bottoms only appear on bus. Thereby in all situations, all convert source l and source 2 to 32 place values. Only have whole 48 that the cumulative instruction of using bus C can the access accumulator registers.

If set recharges the position, just after using with this register tagging as sky and will by common recharge mechanism recharge (seeing the joint about ROB) from ROB. Unless as source operand, Piccolo can be not out of service again for this register before recharging. Minimum period number (optimal cases-data are waited at the ROB head) before the data that recharge are effective is 1 or 2. Therefore the data that recharge are not used in suggestion in the instruction that recharges the request back. If can avoid using operand in two instructions in the back, should do like this, because this can prevent the performance loss that the deep flow waterline is realized.

In assembler, recharge the position by adding that in register number suffix " ^ " is specified. Be labeled as empty register section and depend on the register manipulation number. The two halves of each register can be labeled as recharge independently (for example X0.l^ mark recharge X0 the bottom half, X0^ then mark recharges whole X0). When the top " half " that recharges 48 bit registers (position 47:16), 16 bit data are write a 31:16 and sign extended puts 47 in place.

If attempt to recharge twice in same register (such as ADD X0, X0^, X0^), only once fill. Assembler only allows grammer ADD X1, X0, X0^.

If attempted to read this register before recharging a register, Piccolo wait out of service recharges this register. If flag register is for recharging, and upgraded this register before reading the value that recharges, the result is uncertain (ADD X0 for example, X0^, X1 is uncertain, because its mark X0 recharges, then recharges by X0 and X1 sum are placed on wherein).

14 kinds of scale types of 4 scale code field:

-

ASR#

0,1,2,3,4,6,8,10

-ASR#12 to 16

-LSL#1

Parallel maximum/minimum instruction does not provide scale, does not therefore use 6 constant modification (assembler is set to 0) in source 2.

Support that in repetitive instruction register remaps, allow the movement ' window ' of repetitive instruction access register and the circulation of not unrolling. Following more detailed description this point.

The destination operand has following 7 bit formats: 25 24 23 22 21 20 19

F

SD

HL

DEST

This basic coding has 10 kinds of modification:

Register number (DX) indication just addressing be in 16 registers which. High/low position works addressing as each 32 bit register of a pair of 16 bit registers with the size position. How the definition of size position arranges defined appropriate mark in the instruction type, no matter whether the result is write register group and output FIFO, this allows constituent ratio to reach near order. The cumulative addition class instruction of band must write back register with the result.

Express down the performance of each coding:

Coding	Register is write	FIFO writes	The V sign
Coding	Register is write	FIFO writes	The V sign	1	Write whole register	Do not write	32 overflow
2	Write whole register	Write	32	1	Write whole register	Do not write	32 overflow	32 overflow
2	Write whole register	Write	32	3	Write low 16 and arrive Dx.l	Do not write	16 overflow	32 overflow
4	Write low 16 and arrive Dx.l	Write low 16	16 overflow	3	Write low 16 and arrive Dx.l	Do not write	16 overflow
4	Write low 16 and arrive Dx.l	Write low 16	16 overflow	5	Write low 16 and arrive Dx.h	Do not write	16 overflow
6	Write low 16 and arrive Dx.h	Write low 16	16 overflow	5	Write low 16 and arrive Dx.h	Do not write	16 overflow
6	Write low 16 and arrive Dx.h	Write low 16	16 overflow	7	Do not write	Do not write	16 overflow
8	Do not write	Do not write	32 overflow	7	Do not write	Do not write	16 overflow
8	Do not write	Do not write	32 overflow	9	Do not write	Write low 16	16 overflow
10	Do not write	Write 32	32 overflow	9	Do not write	Write low 16	16 overflow

In all situations, any operation writes back register or inserts output FIFO result before is 48 amounts. Exist two kinds of situations:

Be 16 if write, by selecting bottom 16 [15:0] 48 amounts reduced to 16 amounts. If instruction is saturated, then be worth saturated in scope-2^15 to 2^15-1. Then 16 place values are write back to the register of appointment, write the FIFO position if be provided with, then write output FIFO. If it is write output FIFO, then it is remained to until write next 16 place values and put into when exporting FIFO with this two values pairing and as 32 single place values.

Write for 32, by selecting bottom 32 [31:0] 48 amounts are reduced to 32 amounts.

Write both for 32 with 48, if instruction is saturated, just convert 48 place values among scope-2^31-1 to 2^31 32 place values. Then this is saturated:

If-carry out writing back to accumulator, then write whole 48.

If-carry out writing back to 32 bit registers, then write position [31:0].

If-indication writes back to FIFO, another writes position [31:0].

The destination size by assembler in the register number back with .l or .h appointment. Therefore if do not carry out register write back, then register is unessential, omits destination register and indicates not write register or indicate with ^ and only write output FIFO. For example, SUB, X0, Y0 are equivalent to CMP X0, Y0 and ADD^, X0, Y0 puts into output FIFO with the value of X0+Y0.

If the space of output FIFO void value, Piccolo waiting space out of service becomes available.

If write out 16 place values, ADD X0.h^ for example, X1, X2 then latchs this value until write second 16 place value. Then two values are combined as 32 figure places and put into output FIFO. First that writes 16 place values always appear at 32 words low level half. Be 16 or 32 bit data with the data markers that enters output FIFO, to allow proofreading and correct ending in large ending system.

If twice 16 write 32 place values between writing, then operation is undefined.

Support that register remaps in the repetitive instruction, allow the movement ' window ' of repetitive instruction access register and the circulation of not unrolling. Be described in more detail below this point.

In preferred embodiment of the present invention, repetitive instruction provides the mechanism of specifying the mode of register manipulation number in the circulation that is modified in. Under this mechanism, the register that access is to determine with a function of the register manipulation number in the instruction and the volume amount of moving in the register group. This side-play amount changes with programmable way, is preferably in the end of each instruction circulation. This mechanism can operate at the register that is arranged in X, Y and Z group independently. In preferred embodiment, this facility can not utilize for the register in the A group.

Can use the concept of logical AND physical register. Instruction operands is that logic register is quoted, and the physical register that then it is mapped to the specific Piccolo register 10 of sign is quoted. Comprising all operations that recharges interior all operates at physical register. The data that only register occurs in Piccolo instruction stream one side to remap-load Piccolo always are assigned to physical register and do not carry out and remap.

With further reference to Fig. 5 discussion mechanism that remaps, Fig. 5 is the block diagram that some internal parts of Piccolo coprocessor 4 are shown. ARM nuclear 2 data item that retrieve from memory are placed on reset in the order buffer 12, Piccolo register 10 then recharges from resetting order buffer 12 in the mode of early describing with reference to Fig. 2. Pass to instruction decoder 50 in Piccolo4 with being stored in Piccolo instruction in the cache memory 6, before they are passed to Piccolo processor core 54, decode there. Piccolo processor core 54 comprises the multiplier/adders circuit 20 of early discussing with reference to Fig. 3, cumulative/regressive circuit 22 and foot mark/saturated circuit 24.

If instruction decoder 50 is being processed the instruction of a part that consists of the instruction circulation that identifies with repetitive instruction, and this repetitive instruction has been indicated and should have been carried out remapping of some registers, conveniently carries out necessary remapping with the register logic 52 that remaps. The logic 52 that register can be remapped is thought the part of instruction decoder 50, although the clear logic that register can be remapped of person skilled in the art person is arranged to the entity that complete and instruction decoder 50 separates.

Usually the one or more operands that comprise the register of the required data item of sign include instruction in the instruction. For example, typical instruction can comprise two source operands and a destination operand, and sign comprises two registers of the required data item of this instruction and the result of instruction should be put into wherein register. The register logic 52 that remaps receives the operand of instructions from instruction decoder 50, and these operand identification logic registers are quoted. Quote according to logic register, whether the register logic that remaps determined should or not to apply and remapped, and then will remap as required to act on physical register and quote. If determining should not apply remaps, quote just provide logic register to quote as physical register. To discuss in detail after a while and carry out the preferred mode that remaps.

To quote and pass to Piccolo processor core 54 from the remap physical register of respectively exporting of logic of register, so that processor nuclear energy acts on instruction by on the data item in the particular register 10 of physical register reference identification subsequently.

The mechanism of remapping of preferred embodiment allows each register component is become two parts, the register section that namely can remap and keep their original registers to quote the register section that does not remap. In the preferred embodiment, the part that remaps originates in the bottom of the register group that remaps.

The mechanism of remapping adopts some parameters, and these parameters discuss in detail with reference to Fig. 6, and Fig. 6 illustrates the register logic 22 that remaps how to use the block diagram of various parameters. Should point out that these parameters are with respect to any the set-point in the group that is remapping, this point is the bottom of this group for example.

Can think that the register logic 52 that remaps comprises two main logical blocks, namely remap piece 56 and base upgrade piece 58. The logic 52 of remapping register adopts provides the basic pointer that is added in the deviant that logic register quotes, and upgrades piece 58 by base this basic pointer value is offered the piece 56 that remaps.

Available base initial (BASESTART) signal defines the initial value of basic pointer, and for example this is normally zero, although some other values also can be specified. This basic initial signal is passed to the basic multiplexer 60 that upgrades in the piece 58. In repeating the first time of instruction circulation, multiplexer 60 passes to memory cell 66 with basic initial signal, and for the repetition of the back of circulation, by multiplexer 60 next basic pointer value is offered memory cell 66.

The output of memory cell 66 is passed to the logic 56 that remaps as current basic pointer value, and pass to one of input of the adder 62 in the basic more new logic 58. Adder 62 also receives provides the basic increment of basic increment size (BASEINC) signal. Adder 62 is configured to the current basic pointer value that memory cell 66 provides is increased this base increment size, and the result is passed to moding circuit 64.

This moding circuit also receive basic ring around (BASEWRAP) value and with this value with from the output base signal-arm of adder 62 relatively. If the basic pointer value behind the increment is equal to or greater than basic ring around value, just new basic pointer is rapped around to new deviant. At this moment the output of moding circuit 64 is next basic pointer value that will be stored in the memory cell 66. This output is offered multiplexer 60, and from there to memory cell 66.

Yet, memory cell 66 receives base renewal (BASEUPDATE) signal from the loop hardware of managing repetitive instruction before, this can not be stored in the memory cell 66 at next basic pointer value. Loop hardware periodically generates basic update signal, for example whenever wanting the repetitive instruction circulation time. When memory cell 66 received basic update signal, memory cell was just rewritten last basic pointer value with next basic pointer value that multiplexer 60 provides. In this way, the basic pointer value that offers the logic 58 that remaps will change over new basic pointer value.

The physical register that will get at the partial memory that remaps of register group by the logic register in the operand that is included in instruction quote with base more the basic pointer value sum that provides of new logic 58 determine. This addition be carried out by adder 68 and output passed to moding circuit 70. In preferred embodiment, moding circuit 70 is gone back receiving register around value, if surpass register around value from the output signal (logic register is quoted and basic pointer value sum) of adder 68, the result will be around the bottom of getting back to the district of remapping. Then the output with moding circuit 70 offers multiplexer 72.

Register counting (REGCOUNT) value is offered the interior logic 74 of the piece 56 of remapping, the number of the register that will remap in the identified group. Logic 74 is quoted comparison with this register count value and logic register, and according to comparative result control signal is passed to multiplexer 72. Multiplexer 72 is quoted as two input RL register and the output (register that remaps is quoted) of moding circuit 70. In the preferred embodiment of the present invention, if logic register is quoted less than the register count value, just quoting as physical register, quotes by the register that 72 outputs of logic 74 instruction multiplexers are remapped. Yet, if logic register is quoted more than or equal to the register count value, quote just the direct output logic register of logic 74 instruction multiplexers is quoted as physical register.

As mentioned above, in preferred embodiment, repetitive instruction is called the mechanism of remapping. As discussing in detail after a while, repetitive instruction provides four circulations null cycle in hardware. These hardware loop are illustrated among Fig. 5 as the part of instruction decoder 50. Each time instruction decoder 50 request is during from the instruction of cache memory 6, and cache memory just returns to instruction decoder with this instruction, and this moment, instruction decoder was judged whether repetitive instruction of the instruction returned. If so, just this repetitive instruction is processed in one of configure hardware circulation.

Instruction number in each repetitive instruction designated cycle reaches the number of times (it is constant or reads the register from Piccolo) around circulation. Two command codes ' repetition ' are provided (REPEAT) and next (NEXT) define hardware loop, ' next one ' command code only is not assembled into instruction as delimiter. Repeat from the starting point of circulation, and ' next one ' defines the end of circulation, allow the instruction number in the assembler computation cycles body. In preferred embodiment, repetitive instruction can comprise will by register remap that logic 52 uses such as register counting (REGCOUNT), basic increment (BASEINC), basic ring around (BASEWRAP) and register around (REGWRAP) parameter etc. parameter that remaps.

Some registers can be set come the memory register employed parameter that remaps of logic that remaps. In these registers, the some groups of predefined parameters that remap can be provided, keep simultaneously some registers for the user-defined parameter that remaps of storage. If the parameter that remaps with the repetitive instruction appointment equals predefined one of the parameter group that remaps, then adopt suitable repeated encoding, this coding causes multiplexer and so on that the suitable parameter that remaps is directly offered the register logic that remaps from register. Otherwise, parameter is all different from any predefined parameter group that remaps if remap, then assembler generates the parameter move instruction (RMOV) of remapping, and its allows the register of configure user definition parameter that remaps, and RMOV instruction back is repetitive instruction. Preferably the RMOV instruction is placed on the user-defined instruction of remapping for storing in the register that this user-defined parameter that remaps reserves, and then multiplexer is programmed for the delivery of content of these registers to the register logic that remaps.

In preferred embodiment, register counting, basic increment, basic ring take off one of value of determining in the table around reaching register around parameter:

Parameter	Describe
Parameter	Describe	REGCOUNT (register counting)	But it determines to carry out 16 bit register numbers and the value 0,2,4,8 that remaps in the above. The following register of REGCOUNT remaps, more than or what equal REGCOUNT is direct access.
BASEINC (basic increment)	This is defined in and respectively is cycled to repeat when finishing what 16 bit registers of basic pointer increment. But its value 1,2 or 4 in preferred embodiment, although if need in fact its desirable other value, can comprise negative value in the time of suitably.	REGCOUNT (register counting)
BASEINC (basic increment)		BASEWRAP (basic ring around)	It determines the upper limit that base calculates. But basic ring winding mold value 2,4,8.
REGWRAP (register around)	The upper limit that it is determined to remap and calculates. But register is around mould value 2,4,8. REGWRAP may be selected to be and equals REGCOUNT	BASEWRAP (basic ring around)

Referring to Fig. 6, how the piece 56 that remaps uses the example of various parameters following (in this example, logical AND physical register value is with respect to particular group):

If (logic register＜REGCOUNT)

Physical register=(logic register+yl) MOD REGCOUNT

else

Physical register=logic register

end if

In circulation end, circulation repeat beginning next time before, base more new logic 58 is carried out following renewal to basic pointer:

Base=(the MOD BASEWRAP of base+BASEINC)

In the circulation end of remapping, close register and remap, then as all registers of physical register access. In the preferred embodiment, only have the REPEAT that remaps (repetition) to enliven on any one time. Circulation also can be nested, but only have a circulation can upgrade the variable that remaps in any particular moment. Yet if necessary, the repetition of can nestedly remapping.

In order to show the benefit about code density that reaches as the result who adopts according to the mechanism of remapping of preferred embodiment of the present invention, the below discusses typical piece filter algorithm. The principle of blocking filter algorithm at first is discussed with reference to Fig. 7. As shown in Figure 7, accumulator registers A0 is configured to the result of cumulative several times multiplying, multiplying is the multiplication that coefficient C0 multiply by data item d0, and coefficient c1 multiply by the multiplication of data item d1, and coefficient c2 multiply by the multiplication of data item d2 etc. The result of the cumulative similar multiplying group of register A1, but at this moment coefficient sets has been shifted so that c0 multiply by d1 now, c1 multiply by d2, and c2 multiply by d3 etc. Similarly, the result of the register A2 cumulative data coefficient value with one step of right shift again on duty, so that c0 multiply by d2, c1 multiply by d3, c2 multiply by d4 etc. Then repeat this displacement, take advantage of and cumulative process, the result is placed among the register A3.

If do not adopt the register according to preferred embodiment of the present invention to remap, then need following instruction to circulate the execution block filtering instructions:

4
ZERO{A0-A3}；
REPEAT Z1；Z1=(/4)
；
；a0+=d0*c0+d1*c1+d2*c2+d3*c3
；a1+=d1*c0+d2*c1+d3*c2+d4*c3
；a2+=d2*c0+d3*c1+d4*c2+d5*c3
；a3+=d3*c0+d4*c1+d5*c2+d6*c3
MULA  A0,x0.l＾,Y0.l,A0；a0+=d0*c0，d4
MULA  A1,X0.h,Y0.l,A1；a1+=d1*c0
				<!-- SIPO <DP n="35"> -->
				<dp n="d35"/>
MULA    A2,X1.l, Y0.l,A2      ；a2*=d2*c0
MULA    A3,X1.h,Y0.l＾,A3     ；a3+=d3*c0，c4
MULA    A0,X0.h＾,Y0.h,A0     ；a0+=d1*c1，d5
MULA    A1,X1.l,Y0.h,A1       ；a1+=d2*c1
MULA    A2,X1.h,Y0.h,A2       ；a2+=d3*c1
MULA    A3,X0.l,Y0.h＾,A3     ；a3+=d4*c1，c5
MULA    A0,X1.l＾,Y1.l,A0     ；a0+=d2*c2，d6
MULA    A1,X1.h,Y1.l,A1       ；a1+=d3*c2
MULA    A2,X0.l,Y1.l,A2       ；a2+=d4*c2
MULA    A3,X0.h,Y1.l^,A3      ；a3+=d5*c2，c5
MULA    A0,X1.h＾,Y1.h,A0     ；a0+=a3*c3，d7
MULA    A1,X0.l,Y1.h,A1       ；a1+=d4*c3
MULA    A2,X0.h,Y1.h,A2       ；a2+=d5*c3
MULA    A3,X1.l,Y1.h＾,A3     ；a3+=d6*c3，c7
NEXT

In this example, data value is placed in the X register group coefficient value is placed in the y register group. As the first step, four accumulator registers A0, A1, A2 and A3 are set to zero. The accumulator registers in case resetted, just entry instruction circulation, this circulation (REPEAT) reaches ' next one ' (NEXT) instruction demarcation with ' repetitions '. Value Z1 determines the number of times that the instruction circulation should repeat, and for reason discussed below, the number of its as many as coefficient (c0, c1, c2 etc.) is divided by 4.

Instruction circulation comprises 16 multiply accumulating instructions (MULA), and these are proposed order and will cause at register A0 after for the first time by circulation, A1, and A2 comprises the result of calculation shown in the code between above-mentioned repetition and article one MULA instruction among the A3. In order to illustrate how the multiply accumulating instruction operates, we will consider front four MULA instructions. Article one, instruction first or low 16 data value that X is organized register 0 multiply by in the Y group register 0 low 16, and the result is added among the accumulator registers A0. With low 16 that recharge a mark X group register 0, this indicates the present available new data value of this part of this register to recharge simultaneously. Mark is because as can be seen from Figure 7 in this way, in case data item d0 be multiply by coefficient c0 (being represented by article one MULA instruction), just no longer needs for all the other piece filtering instructions d0, therefore can replace with new data value.

Then second MULA instruction with X organize register 0 second or high 16 multiply by low 16 of Y group register 0 (multiplication d1 shown in this presentation graphs 7 * c0). Similarly, the 3rd and the 4th MULA instruction represents respectively multiplication d2 * c0 and d3 * c0. As can be seen from Fig. 7, in case carried out this four calculating, coefficient c0 just no longer needs, and therefore with recharging a flag register Y0.l it can be rewritten with another coefficient (c4).

Below four MULA instructions represent respectively to calculate d1 * c1, d2 * c1, d3 * c1 and d4 * c1. In case carried out d1 * c1, just with recharging a flag register x0.h, because no longer need d1. Similarly, in case carried out whole four instructions, just register Y0.h is labeled as for recharging, because no longer need coefficient c1. Similarly, below four MULA instructions corresponding to calculating d2 * c2, d3 * c2, d4 * c2 and d5 * c2, last four instructions is then corresponding to calculating d3 * c3, d4 * c3, d5 * c3 and d6 * c3.

In the above-described embodiments, because register can not remap, each multiplying must be regenerated significantly with the required particular register of appointment in the operand. In case carried out 16 MULA instructions, and just can repeat this instruction circulation for coefficient c4 to c7 and data item d4 to d10. And circulate on four coefficient values and operate owing to repeat each time this. So the number of coefficient value must be 4 multiple and must calculate Z1=coefficient number/4.

By adopting the mechanism that remaps according to preferred embodiment of the present invention, can greatly dwindle the instruction circulation, so that it only comprises 4 multiply accumulating instructions rather than otherwise needed 16 multiply accumulating instructions. The employing mechanism that remaps is written as code following listed:

；4
ZERO{A0-A3}　　　　　　　　　；
REPEAT Z1,X++n4 w4 r4,Y++ n4 w4 r4；Z1=()
；XY
；16
；。
；。
MULA    A0,X0.l＾,Y0.l,A0     ；a0+=d0*c0，d4
MULA    A1,X0.h,Y0.1,A1       ；a1+=d1*c0
MULA    A2,X1.l,Y0.l,A2       ；a2+=d2*c0
MULA    A3,x1.h,Y0.l＾,A3     ；a3+=d3*c0，c4
NEXT                          ；

As mentioned above, the first step is arranged to 0 with four accumulator registers A0-A3. Then enter the instruction circulation that usefulness ' repetition ' and ' next one ' command code are delimited. Repetitive instruction has some parameters of associated, and they are:

X++: indication is " 1 " for X register group base increment.

N4: the indicator register counting is " 4 ", and front four X group register X0.l to X1.h therefore will remap

W4: indicate for X register group basic ring around being " 4 "

R4: indicate for X register group register around being " 4 "

Y++: indication is " 1 " for y register group base increment

N4: the indicator register counting is " 4 " so front 4 Y group register Y0.l to Y1.h that will remap.

W4: indicate for y register group basic ring around being " 4 "

R4: indicate for y register group register around being " 4 "

Be also pointed out that present value Z1 equals to equal number of coefficients/4 in number of coefficients rather than the prior art example.

For the circulation first time of instruction circulation, basic pointer value is 0, therefore without remapping. Yet carry out circulation time, organizing basic pointer value for X and Y all will be " 1 " next time, and it is as follows therefore operand to be remapped:

X0.l becomes X0.h

X0.h becomes X1.l

X1.l becomes X1.h

X1.h becomes X0.l (because basic ring is around being " 4 ")

Y0.l becomes Y0.h

Y0.h becomes Y1.l

Y1.l becomes Y1.h

Y1.h becomes Y0.l (because basic ring is around being " 4 ")

Therefore, can find out when repeating for the second time that in fact four MULA instructions carry out not comprising in the example that remaps of the present invention with the 5th to the 8th the indicated calculating of MULA instruction of early discussing. Similarly, the calculating that carry out with the the the 9th to the 12nd and the 13rd to the 16th MULA instruction of prior art code the front is carried out in the 3rd and the 4th multiple passage circulation.

Therefore can find out that above-mentioned code carries out and identical filter algorithm of prior art code, but the code density in the loop body has been improved a factor 4, owing to only need to provide 4 instructions rather than prior art required 16.

By adopting register according to the preferred embodiment of the present invention technology that remaps, can realize following advantage:

1. improvement code density;

2. in certain occasion, hide from flag register and delay for the empty order that resets to Piccolo

Rush device and recharge stand-by period of this register. This can be to increase the generation of code size

Valency reaches by separating open cycle.

3. can access the register-by changing the number of times that is cycled to repeat of carrying out of variable number, can

Change the register number of access; And

4. being convenient to algorithm launches. For suitable algorithm, the programmer can be the n stage of algorithm

Generate one section code, then utilize register to remap formula is applied in a cunning

On the moving data group.

Clearly can not depart from the scope of the present invention the above-mentioned register mechanism of remapping is made some change. For example, might by register group 10 provide than the programmer in instruction operands the more physical register of energy appointment. These extra registers can not direct access, and the register mechanism of remapping can be utilized these registers. For example, consider that the previous X register group of discussing has available 4 32 bit registers of programmer also thereby the utilogic register is quoted the example of specifying 8 16 bit registers. Might make X register group in fact comprise for example 6 32 bit registers, will have in this case 4 16 additional bit registers can not be by programmer's direct access. Yet these four extra registers can be remapped mechanism utilization, provide additional register for storing data item whereby.

Can use following assembler grammer:

＞＞presentation logic moves to right, perhaps move to left when negative at the shifting function number (below seeing＜lscale 〉).

-＞＞expression arithmetic shift right, perhaps move to left when negative at the shifting function number (below seeing＜scale 〉).

ROR represents ring shift right

The saturation value (size that depends on destination register is saturated to 16 or 32) of SAT (a) expression a. Particularly, in order to be saturated to 16, any value greater than+0 * 7fff replaces with+0 * 7fff, and any value less than-0 * 8000 then uses-0 * 8000 to replace. Be saturated to similarly 32 with the limit+0 * 7fffffff and-0 * 80000000. If destination register is 48, saturated still on 32.

Source operand 1 can be with one of following form:

＜Src1〉will writing a Chinese character in simplified form as [Rn|Rn.l|Rn.h| Rn. *] [^]. In other words, 7 of all of source specifier are all effective, and read register as the value of (selectively exchanging) 32 place values or the expansion of 16 bit signs. Only read 32 of bottoms for accumulator. The ^ indicator register recharges.

＜src1_16〉be writing a Chinese character in simplified form of [Rn.l|Rn.h] [^]. Can only read 16 place values.

＜src1_32〉be writing a Chinese character in simplified form of [Rn|Rn.X] [^]. Can only read 32 place values, the high and selectively exchange that is at half.

＜src_2〉(source operand 2) can be one of following form:

＜src2〉be writing a Chinese character in simplified form of three kinds of options

The source register of-form [Rn|Rn.l|Rn.h|Rn.x] [^] adds the scale (＜scale 〉) of final result.

8 constants of-selectable displacement (＜immed_8 〉), but without the scale of final result.

-6 constants (＜immed_6 〉) add the scale (＜scale 〉) of final result.

＜src2_maxmin〉with＜src2 identical but do not allow calibration.

＜src2_shift〉provide＜src2 the shift instruction of finite subset. See above-mentioned details.

＜src2_par〉＜src2_shift〉aspect

Instruction for the appointment 3-operand:

＜acc〉any one writing a Chinese character in simplified form in four accumulator registers [A0|A1|A2|A3]. Read whole 48. Can not specify and recharge.

Destination register has form:

＜dest〉it is writing a Chinese character in simplified form of [Rn|Rn.l|Rn.h|.l|] [^]. Be not with ". " expansion to write whole register (being 48 in the accumulator situation). Do not needing to write back in the situation of register, employed register is unessential. The assembler support is omitted destination register and is indicated and do not need to write back, or indicates with " .l " and not need to write back, but sign should be set, and is 16 amounts just as the result. ^ represents value is write among the output FIFO.

＜scale〉represent some arithmetic standards. Utilizable have 14 kinds of scales:

ASR#

0,1,2,3,4,6,8,10

ASR#12 to 16

LSL#1

＜immed-8〉not signed 8 immediate values of representative. This comprises ring shift left 0,8, a byte of 16 or 24. Therefore can be any YZ encoded radio 0 * YZ000000,0 * 00YZ0000,0 * 0000YZ00, and 0 * 000000YZ. Circulation is to encode as 2 amount.

＜imm_6〉not signed 6 of representative counts immediately.

＜PARAMS〉be used for specifying register to remap and have following form:

＜BANK〉can be [X|Y|Z]

＜BASEINC〉can be [++ |+1|+2|+4]

＜RENUMBER〉can be [0|2|4|8]

＜BASEWRAP〉can be [2|4|8]

Expression formula＜cond〉be any in the following conditional code. Notice that coding and ARM are slightly different, because not signed LS and HI code are substituted by more useful signed overflow/underflow test. The setting of the V on the Piccolo and N sign is different with ARM's, and the translation of therefore checking from state verification to sign is also different from ARM.

The last result of 0000 EQ Z=0 is 0.

The last result of 0001 NE Z=1 is non-zero.

0010 CS C=1 uses after displacement/maximum operation.

0011 CC C=0

The last result of 0100 MI/LT N=1 is for negative

The last result of 0101 PL/GE N=0 is for just

The last as a result tape symbol of 0110 VS V=1 overflows/and saturated

The last result of 0111 VC V=0 without overflow/saturated

The last result of 1000 VP V=1﹠N=0 is just overflowed

Negative the overflowing of the last result of 1001 VN V=1﹠N=1

1010 keep

1011 keep

1100 GT N=0&Z=0

1101 LE N=1|Z=1

1110 AL

1111 keep

Because Piccolo processes signed amount, discard not signed LS and HI state and replace with VP and the VN of any direction of overflowing of description. Because the result of ALU is 48 bit wides, MI and LT carry out identical function now, similarly PL and GE. This stays 3 dead slots for following expansion.

Except as otherwise noted, all computings all are signed.

One-level and secondary conditional code respectively comprise:

N-is negative.

Z-zero.

The C-carry/tape symbol does not overflow.

The V-tape symbol overflows.

Arithmetic instruction can be divided into two classes: parallel and " full duration ". " full duration " instruction only arranges the one-level sign, and concurrent operation symbol according to result's height with low 16 half one-level and secondary sign are set.

Applying calibration but before writing the destination, N, Z and V sign is according to whole ALU result's calculating. ASR will always reduce the required figure place of event memory, and ASL then increases figure place. In order to prevent 48 results of Piccolo truncation when applying the ASL calibration, figure place is limited in carries out zero detection and overflow.

The N sign calculates when supposing to carry out signed arithmetic operation. This is because when overflowing, and result's highest order is one of C sign or N sign, and this depends on that input operand is tape symbol or not signed.

Whether the indication of V sign any loss of significance occurs as the result of the destination of the result being write selection. If selected not write back, still contain ' size ', and overflow indicator correctly is set. In following situation, occur overflowing:

-when the result is not in scope-2^15 to 2^15-1, write 16 bit registers.

-when the result is not in scope-2^31 to 2^31-1, write 32 bit registers.

Walk abreast and add/subtract instruction at result's height and N, Z and V sign are set on being at half independently.

When write accumulator with write the same V of setting of 32 bit registers sign. This is to allow saturated instruction to use accumulator as 32 bit registers.

Saturated absolute value instruction (SABS) also arranges overflow indicator when the absolute value of input operand does not meet the designated destination.

Carry flag is by adding and subtracting the instruction setting and indicated as ' binary system ' by MAX/MIN, SABS and CLB instruction. Comprise multiplying at all interior other instruction partial carry signs.

For adding and subtracting computing, be 32 or 16 bit wides according to the destination, carry is by position 31 or position 15 or result's generation.

According to how sign is set, can be with standard arithmetic instruction divide into several classes type:

Add with the situation that subtracts instruction in, if the N position be set all signs of maintenance. If it is as follows that N not set of position then will indicate is upgraded:

If complete 48 results are 0 just set Z.

If complete 48 as a result meta 47 set (bearing) then set N.

The set V if one of following condition is set up:

Destination register is 16 and signed result is put to advance (not in scope-2^15＜=x＜2^15) in 16 bit registers.

Destination register is 32/48 bit register and signed result is put to advance in 32.

If at summation＜src1〉with＜src2 the time from the position 31 carry is arranged or from＜src1 deduct＜src2 the time position 31 borrow does not appear, if then＜dest just be 32 or set C sign (with the desired identical carry value on the ARM) during 48 bit register. If＜dest〉be 16 bit registers, if just then and position 31 carry set C sign.

Keep secondary sign (SZ, SN, SV, SC).

Carrying out the situation of multiplication or cumulative instruction from 48 bit registers.

If complete 48 results are 0 just set Z.

If complete 48 as a result meta 47 set (bearing), then set N.

If (1) destination register be 16 and signed result to put to advance 16 bit registers (not in scope-2^15＜=x＜2^15) or (2) destination register be 32/48 bit register and signed result is put to advance in 32, just set V.

Keep C.

Keep secondary sign (SZ, SN, SV, SC).

The below discuss comprise logical operation, parallel add with subtract, maximum and minimum, displacement etc. are in other interior instruction.

Add and subtract instruction with two register additions or subtract each other, calibrate this result, then a register is got back in storage. Operand is treated as signed value. For the unsaturation modification, sign upgrades and supplies to select, and can upgrade by suppressing sign at the additional N of instruction afterbody. 31 30 29 28 2,726 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

0

OPC

F

S D

DEST

S1

R 1

SRC1

SRC2

The type operations of OPC designated order (OPC):

100N0    dest=(src1+src2)(-＞＞scale)(,N)
110N0    dest=(src1-src2)(-＞＞scale)(,N)
10001    dest=SAT((src1+src2)(-＞＞scale))
11001    dest=SAT((src1-src2)(-＞＞scale))
01110    dest=(src2-src1)(-＞＞scale)
01111    dest=SAT((src2-src1)(-＞＞scale))
101NO    dest=(src1+src2+Carry)(-＞＞scale)(,N)
111NO    dest=(src1-src2+Carry-1)(-＞＞scale)(,N)
：
100NO    ADD{N}  ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}
110NO    SUB{N}  ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}
10001    SADD    ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}
11001    SSUB    ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}
01110    RSB     ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}
01111    SRSB    ＜dest＞,＜src1＞,＜src2＞{,＜scale>}
101N0    ADC{N}  ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}
111N0    SBC{N}  ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}

Assembler is supported following command code

　　CMP  ＜src1＞,＜src2＞,

　　CMN  ＜src1＞,＜src2＞，

CMP is subtraction, and it arranges sign and disable register is write. CMN is addition, and it arranges sign and disable register is write.

Sign: toply discussed.

The reason that comprises:

It is useful after displacement/maximum/minimum operation carry being inserted register bottom ADC. It also is used for carrying out 32/32 division. It also provides the extended precision addition, and N position addition provides more accurate sign control, particularly carry. This so that 32/32 division can carry out 2 every cycles.

G.729 waiting needs saturated add and subtract.

The increment/decrement counter. RSB is useful (x-32-x is common operation) for calculating displacement. Need saturated RSB for saturated negating (in being used in G.729).

Add/subtract the accumulative total instruction and carry out band accumulative total and calibrate/saturated addition and subtraction. Different from the multiply accumulating instruction, can not be independent of destination register and specify accumulator number. Two of the bottoms of destination register provide the 48 bit accumulator acc that will be accumulated to wherein, so ADDA x0, and x1, x2, A0 and ADDA A3, x1, x2, A3 are effectively, and ADDA x1, x1, x2, A0 are then invalid. For this class instruction, what the result must be write back register-do not allow destination field does not write back coding.

The type of OPC designated order. Below acc be (DEST[1:0]). The indication of Sa position is saturated.

Operation (OPC):

　　0 dest={SAT}(acc+(src1+src2))(-＞＞scale}

　　1 dest=(SAT)(acc+(src1-src2)){-＞＞scale}

Memonic symbol

　　0    {S}ADDA  ＜dest＞,＜src1＞,＜src2＞,＜acc＞{,＜scale＞}

　　1    {S}SUBA  ＜dest＞,＜arc1＞,＜src2＞,＜acc＞{,＜scale＞}

The S of order front represents saturated.

Sign: above seeing.

The reason that comprises:

ADDA (adding accumulative total) instruction is useful (for example finding out their mean value) for two words with each cycle summation integer array of accumulator. SUBA (subtracting accumulative total) instruction is useful calculating poor sum (being used for relevant); It with two independently value subtract each other and difference be added in the 3rd register.

The addition that rounds up of band can be used and＜acc〉different＜dest carry out. For example, X0=(X1+X2+16384)＞＞15 can be by remaining among the A0 and finish in one-period 16384. The addition of the constant that band rounds up can be used ADDA X0, X1, and #16384, A0 finishes.

For ((a_i*b_j)＞＞k) accurately realize the position of sum (quite commonly used in TrueSpeech):

Standard P iccolo code is:

　　MUL t1,a_0,b_0,ASR#K

　　ADD ans,ans,t1

　　MUL t2,a_1,b_1,ASR#k

　　ADD ans,ans,t2

This code has two problems: it is oversize and be not to be added to 48 precision, therefore can not use guard bit. Solution is for using ADDA preferably:

　　MUL t1,a_0,b_0,ASR#k

　　MUL t2,a_1,b_1,ASR#k

　　ADDA ans,t1,t2,ans

This improves 25% speed and keeps 48 precision.

Parallel add/subtract two signed 16 amounts of instruction in remaining in pairs 32 bit registers and carry out addition and subtraction. The one-level condition code flag is from high 16 as a result setting, and the secondary sign is then from half renewal of low level. Can only specify 32 bit registers as the source of these instructions, although these values can be exchanged by half-word. With each register each half treat as signed value. Calculating and calibration not loss of accuracy are finished. Therefore ADD ADD X0, X1, X2, ASR#1 will be at the high position and low level of the X0 correct mean value of generation in half. For each instruction that must set Sa position provides select saturated. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

0

OPC

Sa

F

S D

DEST

S1

R 1

SRC1

SRC2

The OPC defining operation

Operation (OPC):

　　000     dest.h=(src1.h+src2.h)-＞＞{scale},

　　        dest.l=(src1.l+src2.l)-＞＞{scale}

　　001     dest.h=(src1.h+src2.h)-＞＞{scale},

　　        dest.l=(src1.l-src2.l)-＞＞{scale}

　　100     dest.h=(src1.h-src2.h)-＞＞{scale},

　　        dest.l=(src1.l+src2.l)-＞＞{scale}

　　101     dest.h=(src1.h-src2.h)-＞＞{scale},

　　        dest.l=(src1.l-src2.l)-＞＞{scale}

If set the Sa position, each and/difference be independence saturated. Memonic symbol:

　　000    {S}ADDADD    ＜dest＞,＜src1_32＞,＜src2_32＞{,＜scale＞}

　　001    {S}ADDSUB    ＜dest＞,＜src1_32＞,＜src2_32＞{,＜scale＞}

　　100    {S}SUBADD    ＜dest>,＜src1_32＞,＜src2_32＞{,＜scale＞}

　　101    {S}SUBSUB    ＜dest＞,＜src1_32＞,＜src2_32＞{,＜scale＞}

S before the order represents saturated. Assembler is also supported

　　CMNCMN   ＜dest＞,＜src1_32＞,＜src2_32＞{,＜scale＞}

　　CMNCMP   ＜dest＞,＜src1_32＞,＜src2 32＞{,＜scale>}

　　CMPCMN   ＜dest＞,＜src1_32＞,＜src2_32>{,＜scale＞}

　　CMPCMP   ＜dest＞,＜srcl_32＞,＜src2_32＞{,＜scale＞}

They are not to be with the stereotyped command that writes back to generate.

Sign:

If C two high 16 halfs of addition from the position 15 carries, just set.

16 half sums are 0 if Z is high, just set.

If high 16 half sums of N are for negative, just set.

If V is high 16 half signed 17 and can not pack into (after the calibration) in 16, just set.

Be similarly low 16 half set SZ, SN, SV and SC.

The reason that comprises:

Parallel add that to carry out computing for the plural number in remaining on single 32 bit registers be useful with subtracting instruction. They are used in FFT (Fast Fourier Transform (FFT)) core. It also is useful for the simple vector addition/subtraction of 16 bit data, allows to process in one-period two elements.

Shifting (condition) instruction allows the condition in the control stream to change. Piccolo takies the transfer that three cycles carry out to get. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

0

11111

100

000

IMMEDIATE_16

COND

Operation:

If according to one-level sign＜cond〉set up, shift with side-play amount.

Side-play amount is signed 16 numbers of words. The scope of current skew is limited in-32768 to+32767 words.

The address computation of carrying out is

Destination address=jump instruction address+4+ side-play amount

Memonic symbol:

B<cond><destination_label>

Sign: unaffected.

The reason that comprises:

Highly useful in most of routines.

Condition adds deduct, and conditionally src1 to be added in src2 upper or deduct src2 from src1 in instruction. 3,130 29 28 27 26 25 24 22 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

0010

O P C

F

S D

DEST

S1

R 1

SRC1

SRC2

The type of OPC designated order.

Operation (OPC):

(if carry set) temp=src1-src2 otherwise temp=src1+src2

dest=temp{-＞＞scale}

(if carry set) temp=src1-src2 otherwise temp=src1+src2

Dest=temp{-＞＞it is to shift left that if scale} still calibrates

New value that then will (from src1-src2 or src1+src2) carry is shifted in the bottom.

Memonic symbol:

　　0    CAS  ＜dest＞,＜src1＞,＜src2＞,{,＜scale＞}

　　1    CASC ＜dest＞,＜src1＞,＜src2＞,{,＜scale＞}

Sign: above seeing:

The reason that comprises:

The condition instruction that adds deduct can consist of efficient division code.

Example 1: with 32 among the X0 not signed value divided by 16 among the X1 not signed value (suppose X0＜(X1＜＜16) and X1.h=0).

LSL X1, X1, #15; On remove number

SUB X1, X1, #0; The set carry flag

REPEAT#16

CASC X0,X0,X1,LSL#1

Example 2: with 32 among the X0 on the occasion of divided by 32 among the X1 on the occasion of, band early finishes.

MOV X2, #0; Remove the merchant

LOG Z0, X0; The displaceable figure place of X0

LOG Z1, X1; The displaceable figure place of X1

SUBS Z0, Z1, Z1; The X1 upward displacement is 1 coupling therefore

BLT div_end; X1＞X0 so answer are 0

LSL X1, X1, Z0; 1 of coupling front

ADD Z0, Z0, #1; The test number that carries out

SUBS Z0, Z0, #0; The set carry

REPEAT Z0

CAS X0,X0,X1,LSL#1

ADCN X2,X2,X2

In end, X2 keeps the merchant and remainder can recover from X0.

The instruction of counting bit preamble makes the normalization of data energy. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

011011

F

S D

DEST

S1

R 1

SRC1

101110000000

Operation:

Dest is set as in order to make position 31 and figure place that value in src1 must move to left different with 30. This is a value among the scope 0-30, but except src1 be-1 or 0 special circumstances, at this moment return 31.

Memonic symbol:

CLB<dest>,<src1>

Sign:

If Z result is 0, just set.

N eliminates.

If C src1 is one of-1 or 0, just set.

V keeps.

The reason that comprises:

The step that normalization needs.

Be provided with and stop with break-poing instruction for the execution 31 30 29 28 27 2,625 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210 that stops Picco1o

1

11111

11

OP

00000000000000000000000

The type of OPC designated order.

Operation (OPC):

0 Piccolo carry out be stopped and in the Piccolo status register set stop

The position.

1 Piccolo carries out and to stop, and in the Piccolo status register set interrupt bit,

And interruption ARM report has arrived breakpoint.

Memonic symbol:

0 HALT

1 BREAK

Sign: unaffected.

Logic instruction actuating logic computing on 32 or 16 bit registers. Operand is treated as signed value not. 31 3,029 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

000

OPC

F

S D

DEST

S1

R 1

SRC1

SRC2

The logical operation operation (OPC) that the OPC coding will be carried out:

00    dest=(src1&amp;src2){-＞＞scale}

　　01    dest=(src1|src2){-＞＞scale}

　　10    dest=(src1&amp;-src2){-＞＞scale}

　　11    dest=(src1＾src2){-＞＞scale}

　　：

　　00    AND  ＜dest＞,＜src1＞,＜src2>{,＜scale＞}

　　01    ORR  ＜dest＞,＜src1＞,＜src2＞{,＜scale>}

　　10    BIC  ＜dest＞,＜src1＞,＜src2＞{,＜scale＞}

　　11    EOR  ＜dest＞,＜src1>,＜src2＞{,＜scale>}

Assembler is supported following command code;

　　TST  ＜src1＞,＜src2＞

　　TEQ  ＜src1＞,＜src2＞

TST be disable register write " with ". TEQ is " EOR " that disable register is write.

Sign:

If Z result is full 0, just set

N, C, V keep

SZ, SN, SC, SV keep

The reason that comprises:

The voice compression algorithm adopts the combination bit field to come coded message. The bit mask instruction assists to extract/these fields of combination.

Max and Min operational order are carried out the maximal and minmal value computing. 3,130 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

0

101

O P C

1

F

S D

DEST

S1

R 1

SRC1

SRC2

The type of OPC designated order.

Operation (OPC):

　　0 dest=(src1＜=src2)?src1:src2
　　1 dest=(src1＞src2)?src1:src2

Memonic symbol:

　　0    MIN＜dest＞,＜src1＞,＜src2＞
　　1    MAX＜dest＞,＜src1＞,＜src2＞

Sign:

If Z result is 0, just set.

If N result is for negative, just set.

C is for Max: if src2＞=src1 (dest=src1 situation), set C

For Min: if src2＞=src1 (dest=src2 situation), set C

V keeps

The reason that comprises:

In order to find out signal strength signal intensity, many algorithm scanned samples are found out the maximum/minimum of a value of the absolute value of sample. To this, MAX and MIN are priceless treasures. Depend on and will find out in the signal first or last maximum, operand src1 and src2 can exchange.

MAX X0, X0, #0 convert X0 to the positive number that prunes away from below.

MIN X0, X0, #255 prunes away from above. This is useful for graphics process.

Max in the parallel instruction and Min computing are carried out maximum and minimum operation in 16 parallel bit data.

The type of OPC designated order. Operation (OPC):

0    dest.1=(src1.1<=src2.1)?src1.1:src2．1

　　dest.h=(src1.h<=scr2.h)?src1.h：src2.h
1    dest.1=(src1.1>src2.1)?src1.1:src2.1

　　dest.h=(src1.h>src2.h)?src1.h:src2.h
：
0    MINMIN    ＜dest＞，＜src1＞，＜src2＞
1    MAXMAX    ＜dest＞，＜src1＞，＜src2＞

Sign:

If high 16 of Z result is 0, just set.

If high 16 of N result is negative, just set.

C is for Max: if src2.h 〉=src1.h

(dest=src1 situation), set C

For Min: if src2.h=src1.h

(dest=src2 situation), set C.

V keeps.

SZ, SN, SC, SV are low 16 half set similarly.

The reason that comprises:

About 32 Max and Min.

Transmitting the long operational orders of counting immediately allows register is arranged to the value that any signed 16, symbol extend. Article two, this instruction 32 bit registers can be arranged to any value (by sequential access high-order with low level half). See the selection operation for the transmission between the register. 31 3,029 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

11100

F

S D

DEST

IMMEDIATE_15

+/ -

000

Memonic symbol

MOV<dest>,#<inm_16>

Assembler utilizes the MOV instruction that non-interlocking NOP (do-nothing operation) operation is provided, that is, NOP is equivalent to MOV, #0.

Sign: indicate unaffected.

The reason that comprises:

Initialization register/counter.

The multiply accumulating operational order is carried out signed multiplication and cumulative or regressive (de-accumulation), and calibration is with saturated. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 1,716 15 14 13 12 11 10 9876543210

1

10

OPC

Sa

F

S D

DEST

A 1

R 1

SRC1

SRC2_MULA

The type of field OPC designated order.

Operation (OPC):

　　00  dest=(acc+(src1*src2)){-＞＞scale}

　　01  dest=(acc-(src1*src2)){-＞＞scale}

In each situation, if set the Sa position, before writing the destination that the result is saturated.

Memonic symbol:

　　00    {S}MULA  ＜dest＞,＜src1_16＞,＜src2-16＞,＜acc>{,＜scale＞}

　　01    {S}MULS  ＜dest＞,＜src1_16＞,＜src2_16＞,＜acc＞{,＜scale＞}

S indication before the order is saturated.

Sign: see upper joint.

The reason that comprises:

Need lasting MULA of monocycle for the FIR code. MULS is used in the FFT butterfly circuit. The multiplication MULA that rounds up for band also is useful. For example can in one-period, finish A0=(X0 by remaining in another accumulator (for example A1) 16384^*X1+16384)＞＞15. Also need different＜dest for the FFT core〉with＜acc 〉.

Take advantage of double computing (Multiply Double Operation) instruction fill order sign multiplication, cumulative or regressive, calibration and saturated before the result is doubled. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 1,716 15 14 13 1,211 10 9876543210

1

10

1

O P C

1

F

S D

DEST

A 1

R 1

SRC1

0

A 0

R 2

SRC2

SCALE

The type of OPC designated order.

Operation (OPC):

　　0  dest=SAT((acc+SAT(2<superscript>*</superscript>src1<superscript>*</superscript>src2)){-＞＞scale})
　　1  dest=SAT((acc-SAT(2<superscript>*</superscript>src1<superscript>*</superscript>src2)){-＞＞scale})

Memonic symbol:

　　0 SMLDA   ＜dest>,＜src1_16＞,＜src2_16＞,＜acc＞{,＜scale>}

　　1 SMLDS   ＜dest＞,＜src1_16＞,＜src2_16＞,＜acc>{,＜scale＞}

Sign: see upper joint.

The reason that comprises:

G.729 reach and make other algorithm of decimally arithmetical operation need the MLD instruction. Most of DSP provide can cumulative or write back before at move to left one little digital modeling of the output of multiplier. It provides larger flexible in programming as specific instruction support. The name that is equivalent to some G series basic operation is called:

　　L_msu=＞SMLDS

　　L_mac=＞SMLDA

They utilize the saturated of multiplier moving to left one the time. Loss of accuracy not can adopt MULA if need the decimal multiply accumulating of a sequence, itself and remain in 33.14 forms. In case of necessity, can when finishing, utilization move to left and saturated 1.15 forms that are transformed into.

Signed multiplication is carried out in the multiplying instruction, and the calibration of selecting/saturated. Source register (just 16) is treated as signed number. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

00011

O P C

F

S D

DEST

S1

R 1

SRC1

SRC2

The type of OPC designated order.

Operation (OPC):

　　0    dest=(src1<superscript>*</superscript>src2){-＞＞scale}
　　1    dest=SAT((src1<superscript>*</superscript>src2){-＞＞scale})

Memonic symbol:

　　0    MUL＜dest＞,＜src1_16＞,＜src2＞{,＜scale＞}

　　1    SMUL＜dest＞,＜src1_16＞,＜src2＞{,＜scale＞}

Sign: see upper joint.

The reason that comprises:

Many processing need tape symbol and saturated multiplication.

The array of registers table handling is used for executable operations on one group of register. Provide empty and zero instruction be used for before the routine or between the reset register of selection. Provide the content of the register of output order with listing to store among the output FIFO. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

11111

0

OPC

00

REGISTER_LIST_16

SCALE

The type of OPC designated order. Operation (OPC):

000(k=0；k＜16；k++)k，
k。
001(k=0；k＜16；k++)k，
k0。
010
011
100(k=0；k＜16；k++)k，

　　(k-＞＞scale)FIFO。
101(k=0；k＜16；k++)k，

　　(k-＞＞scale)FIFO

　　k。
110(k=0；k＜16；k++)k，

　　SAT(k-＞＞scale)FIFO。
111(k=0；k＜16；k++)k，

　　SAT(k-＞＞scale)FIFO

　　k。

Memonic symbol:

　　000  EMPTY    ＜register_list＞

　　001  ZERO     ＜register_list＞

　　010  Unused

　　011  Unused

　　100  OUTPUT   ＜register_list＞{,＜scale＞}

　　101  OUTPUT   ＜register_list＞＾{,＜scale＞}

　　110  SOUTPUT  ＜register_list＞{,＜scale＞}

　　111  SOUTPUT  ＜register_list＞＾{,＜scale>}

Sign:

Unaffected example:

　　EMPTY    {A0,A1,X0-X3}

　　ZERO     {Y0-Y3}

　　OUTPUT   {X0-Y1}＾

Assembler is also supported grammer

OUTPUT Rn utilizes MOV^ in this case, register of Rn instruction output. The EMPTY instruction will stop until all registers that will empty comprise valid data

(namely not empty).

The array of registers table handling must not be used in the REPEAT that remaps (repetition) circulation.

Output (OUTPUT) instruction can only be specified at most 8 registers of output.

The reason that comprises:

After routine finished, next routine expected that all registers are empty so that it can be from the ARM receive data. Need the EMPTY instruction to accomplish this point. Before carrying out FIR or filter, need all accumulators and partial results zero clearing. ZERO (zero) instruction assists to accomplish this point. By replacing a series of single register transfers, both be designed to improve code density. Comprise OUTPUT (output) instruction by replacing a series of MOV^, the Rn instruction improves code density.

The register that provides the parameter move instruction RMOV that remaps the to allow configure user definition parameter that remaps.

This instruction encoding is as follows: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

11111

101

00

ZPARAMS

YPARAMS

XPARAMS

Each PARAMS field comprises following item: 6543210

BASEWRAP

BASEINC

0

RENUMBER

These implication is as follows:

Parameter	Explanation
Parameter	Explanation	RENUMBER	To carry out the 16 bit register numbers that remap thereon, but value 0,2,4,8. The following register of RENUMBER remaps above direct access.
BASEINC	The amount that the base pointer increased when each circulation finished. But value 1,2 or 4.	RENUMBER
BASEINC		BASEWRAP	But basic ring winding mold value 2,4,8.

Memonic symbol:

　　RMOV＜PARAMS＞,[＜PARAMS＞]
＜PARAMS＞：

　　＜PARAMS＞∷=＜BANK>＜BASEINC＞n＜RENUMBER＞
w＜BASEWRAP＞
				<!-- SIPO <DP n="63"> -->
				<dp n="d63"/>
　　＜BANK＞  ∷=[X|Y|Z]

　　＜BASEINC＞∷=[++|+1|+2|+4]

　　＜RENUMBER＞∷=[0|2|4|8]
　　＜BASEWRAP＞∷=[2|4|8]

If it is movable using the RMOV instruction to remap simultaneously, its behavior is UNPREDICTABLE (unpredictable).

Sign: unaffected

Repetitive instruction provides 4 circulations null cycle in the hardware. The hardware loop that the repetitive instruction definition is new. Piccolo is that article one repetitive instruction is utilized hardware loop 0, for the repetitive instruction that is nested in the first repetitive instruction is utilized hardware loop 1 etc. Repetitive instruction does not need to specify is using for which circulation. Repetitive cycling must be strictly nested. If attempt the nested degree of depth that is recycled to greater than 4, then behavior is uncertain.

Instruction number in each repetitive instruction designated cycle (be right after repetitive instruction back) and the number of times (it is constant or reads the register from Piccolo) by circulating.

If the circulation in instruction number less (1 or 2) Piccolo could set up circulation with additional cycles.

If cycle count is the register appointment, then contains 32 accesses (S1=1), but only think that 16 of bottoms are effective and numeral is not signed. If cycle count is 0, then the operation of circulation is undefined. Therefore take copying of cycle count, can reuse immediately this register (even recharging) and do not affect circulation.

Repetitive instruction provides the mechanism of the mode of revising the register manipulation number in the designated cycle. Described above the details.

Coding with the repetition of the period of register appointment: 31 3,029 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

1110

0

RFIELD4

00

0

R 1

SRC1

0000

#INSTRCCTIONS_8

The coding of the repetition of the period that band is fixing: 31 30 29 28 27 26 25 24 32 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

11110

1

RFIELD4

#LOOPS_13

#INSTRUCTIONS_8

The RFIELD operand specifies in and uses any of 16 kinds of parameter configuration that remap in the circulation.

RFIELD	The operation of remapping
RFIELD	The operation of remapping	0	Do not carry out and remap
1	User-defined remapping	0	Do not carry out and remap
1	User-defined remapping	2..15	The configuration TBD that remaps that presets

Assembler provides two command code REPEAT and NEXT to define hardware loop, and REPEAT at the beginning of the cycle and NEXT defines the end of circulation allows the instruction number in the assembler computation cycles body. As for REPEAT, it need only be as constant or register designated cycle number of times. For example:

　　REPEAT     X0

　　MULA       A0,Y0.l,Z0.l,A0

　　MULA       A0,Y0.h＾,Z0.h＾,A0

　　NEXT

This will carry out two MULA instructions X0 time. Simultaneously,

　　REPEAT    #10

　　MULA      A0,X0＾,Y0＾,A0

　　NEXT

To carry out multiply accumulating 10 times.

Assembler is supported grammer:

REPEAT#iterations[,＜PARAMS 〉] repeat the used parameter that remaps with appointment. If the required parameter that remaps equals one of predefined parameter group, then use suitable REPEAT coding. If not, then assembler will generate RMOV and load user-defined parameter, and the REPEAT instruction is followed in the back. See RMOV instruction in the top joint and the details of the parameter format that remaps.

If the number of repetition of circulation is 0 then the operation of REPEAT is uncertain.

If the numeral of instruction field is set to 0 then the operation of REPEAT is uncertain.

Circulation only comprise an instruction and this instruction when shifting, then have uncertain performance.

What transfer to this circulation in REPEAT circulation circle out-of-bounds is uncertain.

The saturated absolute value in source 1 is calculated in saturated absolute value instruction.

Operation:

Dest=SAT ((Src1 〉=0) src1:-src1). This value is always saturated.

Memonic symbol:

SABS<dest>，<src1>

Sign:

If Z result is 0, just set.

N keeps.

If c is src1＜0 (dest=-src1 situation), just put.

V just puts if there is saturated.

The reason that comprises:

It is useful in many DSP use.

Select operation (condition transmission) to be used for conditionally source 1 or source 2 being sent in the destination register. Select always to be equivalent to transmission. Also have parallel add/subtract after the parallel work-flow of use.

The reason of attention in order to realize can read two source operands, if one of them is empty, instruction will stop, and whether strictly need regardless of this operand. 31 30 29 28 27 26 25 2,423 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

011

OPC

F

S D

DEST

S1

R 1

SRC1

SRC2_SEL

The type of OPC designated order.

Operation (OPC):

00＜cond＞dest=src1dest=

　　src2

　　01＜cond＞dest.h=src1.h

　　dest.h=src2.h

　　＜cond＞dest.l=src1.l

　　dest.l=src2.l

　　10＜cond＞dest.h=src1.h

　　dest.h=src2.h

　　＜cond＞dest.l=src1.l

　　dest.l=src2.l

　　11

Memonic symbol

　　00    SEL＜cond＞   ＜dest＞,＜src1＞,＜src2＞

　　01    SELTT＜cond＞ ＜dest＞,＜src1＞,＜src2＞

　　10    SELTF＜cond＞ ＜dest＞,＜src1＞,＜src2＞

　　11

If register tagging for recharging, is unconditionally recharged it. Assembler also provides following memonic symbol:

　　MOV＜cond＞    ＜dest＞,＜src1＞

　　SELFT        ＜cond＞＜dest＞,＜src1＞,＜src2＞

　　SELFF        ＜cond＞＜dest＞,＜src1＞,＜src2＞

MOV＜cond〉A, B is equivalent to SEL＜cond〉A, B, A. By exchanging src1 and src2 and using SELTF, SELTT to obtain SELFT and SELFF.

Sign: keep all signs in order to can carry out a sequence selection.

The reason that comprises:

For making simple decision online and need not relying on transfer. Being used for the Viterbi algorithm reaches when sample or vector scanning greatest member.

The shifting function instruction provides logic left and moves to right the amount of arithmetic shift right and circulation appointment. Think shift amount be take from content of registers least-significant byte-128 and+signed integer or counting immediately in scope+1 to+31 between 127. The displacement of negative amount causes superior displacement ABS (shift amount) in the other direction.

With input operand sign extended to 32; Thereby 32 output symbols that will draw before writing back expand to 48 and write the performance of 48 bit registers rationally. 31 3,029 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

010

OPC

F

S D

DEST

S1

R 1

SRC1

SRC2_SEL

The type of OPC designated order.

Operation (OPC):

　　 00   dest=(src2＞=0)?src1＜＜src2:src1＞＞-src2
　　 01   dest=(src2＞=0)?src1＞＞src2:src1＜＜-src2
　　 10   dest=(src2＞=0)?src1-＞＞src2:src1＜＜-src2
　　 11   dest=(src2＞=0)?src1 ROR src2:src1ROL-src2

Memonic symbol:

　　00    ASL＜dest＞,＜src1＞,＜src2_16＞    

　　01    LSR＜dest＞,＜src1＞,＜src2_16＞

　　10    ASR＜dest＞,＜src1＞,＜src2_16＞

　　11    ROR＜dest＞,＜src1＞,＜src2_16＞

Sign:

If Z result is 0, just set.

If N result is for negative, just set.

V keeps

The value (with the same on ARM) of last that C is arranged to be shifted out

The behavior of the displacement of register appointment is:

-LSL displacement 32 0, the C that obtains a result is set to the position 0 of src1.

-LSL displacement 0, the C that obtains a result more than 32 is set to 0.

-LSR displacement 32 0, the C that obtains a result is set to the position 31 of src1.

-LSR displacement 0, the C that obtains a result more than 32 is set to 0.

-ASR displacement build 32 or later draws the position 31 that is set to src1 with position 31 fillings of src1 and C.

-ROR displacement 32 has the position 31 that the result equals src1 and C is arranged to src1.

-ROR displacement n position, wherein n provides and carries out the ROR displacement n-32 identical result in position greater than 32; Therefore from n, repeat to deduct 32 until this amount in 1 to 32 scope, on seeing.

The reason that comprises:

Power with 2 is taken advantage of/is removed. The position is extracted with field. Serial register.

Undefined instruction is stated as above in the instruction set inventory. Their execution will cause Piccolo to stop to carry out, the U position in the juxtaposition bit status register, and forbid itself (as if having removed the E position in the control register). This allows to intercept and capture any following of instruction set and expands and selectively emulation on existing the realization.

As follows from ARM access Piccolo state. The conditional access pattern is used for observing/revising the state of Piccolo. Be that two kinds of purposes arrange this mechanism:

-context switches.

-debugging.

By carrying out the PSTATE instruction Piccolo is placed the conditional access pattern. This pattern allows with a sequence STC and LDC instruction preservation and recovers all Piccolo states. When getting the hang of access module, the use of Piccolo coprocessor ID PICCOL01 is modified as the state of permission access Piccolo. 7 groups of Piccolo states are arranged. Can load and all data of storing in the particular group with single LDC or STC.

Group 0: special register.

-one 32 word comprises the value (read-only) of Piccolo ID register.

-one 32 word comprises the state of control register.

-one 32 word comprises the state of status register.

-one 32 word comprises the state of program counter.

Group 1: general register (GPR)

-16 32 words comprise the general register state.

Group 2: accumulator

-4 32 words, comprise accumulator registers high 32 (note, for the purpose of recovering, with the GPR state copy be necessary-otherwise can contain another time on this register group write enable).

Group 3: register/Piccolo ROB/ exports fifo status.

Which register tagging-one 32 word indicates for recharging (2 of each 32 bit registers).

-8 32 words comprise the state (storing 87 items in place 7 to 0) of ROB label.

-3 32 words comprise the state (position 17 to 0) of the ROB latch that does not line up.

-one 32 word, which groove comprises valid data (position 4 expressions are empty, the number that position 3 to 0 codings are used) in the indication Output Shift Register.

-one 32 word comprises the state (position 17 to 0) that output FIFO keeps latch.

Group 4:ROB input data.

-8 32 bit data value.

Group 5: output data fifo.

-8 32 bit data value.

Group 6: loop hardware.

-4 32 words comprise the circulation initial address.

-4 32 words comprise loop end address.

-4 32 words comprise cycle count (position 15 to 0).

-one 32 word comprises user-defined parameter and other state that remaps of remapping.

The LDC instruction is used for loading the Piccolo state during in the conditional access pattern at Piccolo. Which group the indication of BANK field is loading. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

110

P

U

O

W

1

BASE

BANK

PICCOLO1

OFFSET

Following sequence loads all the Piccolo states from the address among the register R0.

LDP B0, [R0], #16! Special register

LDP B1, [R0], #64! Load general register

LDP B2, [R0], #16! Load accumulator

LDP B3, [R0], #56! Bit load registers/ROB/FIFO state

LDP B4, [R0], #32! Load the ROB data

LDP B5, [R0], #32! Load the output data fifo

LDP B6, [R0], #52! Loaded cycle hardware

The STC instruction is used for the storage Piccolo state during in the conditional access pattern at Piccolo. The BANK field is specified and is being stored which group. 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

110

P

U

0

W

0

BASE

BANK

PICCOLO1

OFFSET

Following sequence stores all Piccolo states among the register R0 address.

STP B0, [R0], #16! Preserve special register

STP B1, [R0], #64! Preserve general register

STP B2, [R0], #16! Preserve accumulator

STP B3, [R0], #56! Save register/ROB/FIFO state

STP B4, [R0], #32! Preserve the ROB data

STP B5, [R0], #32! Preserve the output data fifo

STP B6, [R0], #52! Preserve loop hardware

Debugging mode-Piccolo need to respond the identical debug mechanism of supporting with ARM, and namely software is by Demon and Angel, and with the hardware of the ICE that embeds, the below is some mechanism of debugging Piccolo system:

-ARM instruction breakpoint.

-data breakpoint (point of observation).

-Piccolo instruction breakpoint.

-Piccolo software breakpoint.

ARM instruction and data breakpoint are the ICE resume module that embedded by ARM; The Piccolo instruction breakpoint is the ICE resume module that embedded by Piccolo; The Piccolo software breakpoint is processed by Piccolo nuclear.

The Hardware Breakpoint system can be configured to make ARM and Piccolo that breakpoint is both arranged.

As if software breakpoint is processed by Piccolo instruction (shutting down or interruption), causes Piccolo to stop to carry out, enter debugging mode (the B position in the SM set mode register) and forbid itself (forbidding Piccolo with the PDISABLE instruction). Program counter is remained valid, and allows to recover breakpoint address. Piccolo no longer carries out instruction.

Single step is advanced Piccolo and can be finished by connecing a breakpoint at breakpoint of Piccolo instruction stream setting.

The basic function that software debugging-Piccolo provides is to load and the ability of all states of preservation in the memory by coprocessor instruction in the conditional access pattern. This allows debugging routine that all states are kept in the memory, read with/or upgrade it and return among the Piccolo. Piccolo store status mechanism right and wrong are destructive, i.e. the store status of Piccolo operation can not destroy any Piccolo internal state. This means that Piccolo does not recover at first again it and just can reset after its state of dump.

Determine to find out the mechanism of the state of Piccolo cache memory.

Hardware debug-hardware debug is provided by the scan chain on the coprocessor interface of Piccolo. Then Piccolo can be placed the conditional access pattern and check/revise its state by this scan chain.

The Piccolo status register comprises the single position break-poing instruction of having indicated its executed. When carrying out break-poing instruction, the B position in the Piccolo SM set mode register, and stop to carry out. In order to inquire about Piccolo, debugging routine must start Piccolo and be placed in the conditional access pattern by write its control register before the access that can occur subsequently.

Fig. 4 illustrates the high/low position of response and size position suitable half of the register selected is switched to multiplexer configuration on the Piccolo data path. If 16 of size position indications, the then symbol expanded circuit is with the high position in 0 or 1 suitable padding data path.

Claims

1. one is used for the device that data are processed, and described device comprises:

2. according to claim 1 device also comprises a N bit data bus, is used for transmitting data word between a data memory device and described a plurality of register.

3. according to claim 2 device also comprises an input buffer, is used for offering described a plurality of register from described N bit data bus receive data word and with described N bit data word.

4. according to the device of aforementioned arbitrary claim, wherein said ALU responds the programmed instruction word of at least one parallel work-flow, carry out independent arithmetical logic operation with second (N/2) position input operand data word on the input operand data word of first (N/2) position, these input operand data words are stored in respectively upper level position and a low order position in the source register.

5. according to claim 4 device, wherein said ALU has a signal path, its carry chain between the position that plays a part to put in place in arithmetical logic operation, and when carrying out the programmed instruction word of a parallel work-flow, described signal path disconnects between described first (N/2) position input operand data word and described second (N/2) position input operand data word.

6. according to claim 4 or 5 device, one of following arithmetical logic operation carried out in the programmed instruction word of wherein said parallel work-flow:

(ⅰ) parallel adding, wherein carry out two parallel (N/2) positions and add;

7. according to the device of aforementioned arbitrary claim, wherein when a N bit length of described input length mark indication, whether described high/low tick lables indication will be stored in described upper level position before using as a N position input operand data word those positions move to described low order position, and those positions that will be stored in described low order position move to described upper level position.

8. according to the device of aforementioned arbitrary claim, wherein said ALU has the data path of a N position.

9. according to claim 8 device, also comprise at least one multiplexer, this multiplexer responds described high/low tick lables, selects to provide to low (N/2) position of described data path one (N/2) position input operand data word of one of the upper level position that is stored in described source register and low order position of described source register.

10. according to claim 8 or 9 device, also comprise a circuit, before one (N/2) position input operand data word is input to described N bit data path, it is carried out sign extended.

11. the method for a deal with data said method comprising the steps of:

At least one programmed instruction word comprises: