CN1135468C

CN1135468C - Digital signal processing integrated circuit architecture

Info

Publication number: CN1135468C
Application number: CNB971981442A
Authority: CN
Inventors: D��V��Ÿ��; D·V·雅格加; S·J·格拉斯
Original assignee: Advanced Risc Machines Ltd
Current assignee: ARM Ltd
Priority date: 1996-09-23
Filing date: 1997-08-22
Publication date: 2004-01-21
Anticipated expiration: 2017-08-22
Also published as: JP2001501330A; KR100500890B1; CN1231741A; RU2223535C2; KR20000048533A; JP3756195B2; GB2317468A; WO1998012629A1; TW318915B; IL128321A; EP0927393B1; GB9619833D0; EP0927393A1; MY115104A; DE69707486T2; DE69707486D1; IL128321A0; GB2317468B

Abstract

A digital signal processing system has a microprocessor unit 2 operating under control of microprocessor program instruction words which controls data transfer to and from a data storage device 8 and the supply and fetching of data to and from a digital signal processing unit 4.

Description

To signal data word combine digital method for processing signals and device in the memory device

The present invention relates to digital processing field.More specifically, the present invention relates to the integrated circuit architecture that in digital signal processing, uses.

Digital information processing system is characterized in that the arithmetical logic operation of needs execution relative complex so that lot of data is handled, produces real time output data stream.The common application of Digital Signal Processing comprises the mobile phone that need carry out real-time conversion between the digital signal that is used to send of simulated audio signal and coding.

Consider the requirement special and that continue in the digital signal processing application, the integrated circuit specific to application of the architecture with the special applications of being applicable to is provided usually.

As an example, desired typical digital signal processing operation can require three input operand data words (being generally 16 or 32), require to they take advantage of and/or add operation to produce an output data word.This input operand is arranged in a large amount of input data, and respectively they is taken out from storer under the needs data conditions.For enough bandwidth are provided for memory access, use a plurality of physical storages and memory bus/port.

Although adopt above-mentioned integrated circuit architecture to allow the related mass data of system handles, but its shortcoming is to need complicated and memory construction costliness, the architecture of this integrated circuit will be used specific to each, needs huge hardware to change for different terminal applies.And, making memory bus remain on duty constantly and consumed a large amount of electric energy, this is its significant disadvantages, especially in battery-powered mobile device.

Europe publication application EP-A-0,442,041 disclose a kind of system with a DSP unit and a general processor GPP.GPP takes out the certain operations number from primary memory, they are stored in the region of memory of sharing with the DSP unit.This GPP starts a DSP operation of DSP unit then, and enters a waiting status.When this DSP operation was finished, this GPP recovered other operations

According to an aspect of the present invention, the invention provides a kind of digital signal processing device that uses and carry out the method for digital signal processing, said method comprising the steps of being stored in signal data word in the data storage device:

The microprocessor unit that utilization is operated under the control of microprocessor unit programmed instruction word produces address word, is used for the storage unit at the described signal data word of described data storage device addressable storage;

Under the control of described microprocessor unit, from the described storage unit that is addressed of the described data storage device of storing described signal data word, read described signal data word;

Under the control of described microprocessor unit, provide described signal data word to the digital signal processing unit of under the control of digital signal processing unit programmed instruction word, operating;

The described digital signal processing unit that utilization is operated under the control of digital signal processing unit programmed instruction word is carried out described signal data word and is comprised convolution operation at least, and the arithmetical logic operation of one of associative operation and map function is with the data word that bears results;

The described microprocessor unit that utilization is operated under the control of microprocessor unit programmed instruction word takes out described result data word from described digital signal processing unit; It is characterized in that:

With described provide and to take out operation parallel mutually that described microprocessor unit is carried out, described digital signal processing unit is carried out described logical operation.

The present invention recognizes, can with management and drive to the task of the memory access of data memory device with such as convolution, digital signal processing operation relevant and conversion is come respectively, make and produce such system: utilize simple memory construction to make the complicacy of entire system reduce, and can handle the mass data that in digital signal processing, relates to, and can carry out true-time operation.

The present invention uses a microprocessor to be used to produce suitable address word to visit this data storage device, reads this data storage device and provides data word to this digital signal processing unit.And this microprocessor is responsible for taking out result data word from digital signal processing unit.In this way, allow this digital signal processing unit to be independent of its Mass Data Storage Facility that connects and operate, and come with the work disengaging of data transmission and management.And can preserve from a plurality of data sources and the data that are used for a plurality of application with data memory device of complex way permission of control and diode-capacitor storage operation so that a microprocessor is programmed.

In preferred embodiment, this method also is included under the control of described microprocessor unit, is created in the address word of the storage unit of the described result data word of addressable storage in the described data storage device;

Under the control of described microprocessor unit, described result data word is write the storage unit of the described institute addressing that in described data storage device, is used to store described result data word.

Except the transmission of the signal data word of control from the data storage device to the digital signal processing unit, this microprocessor is also operated with control and will be write back to by the result data word that digital signal processing unit produces in this data storage device, if necessary.

From another aspect of the present invention, the present invention also provides and has carried out the device of digital signal processing to being stored in signal data word in the data storage device, and described device comprises:

A microprocessor unit, it carries out the address word of addressing to the storage unit in described data storage device operating under the control of microprocessor unit programmed instruction word to produce, and controls the transmission of described signal data word between described device that is used for the combine digital signal Processing and described data storage device; And

A digital signal processing unit, it is operated under the control of digital signal processing unit instruction word so that the described signal data word that is taken out from described data storage device by described microprocessor unit is carried out and comprises convolution operation at least, the arithmetical logic operation of one of associative operation and map function is with the data word that bears results; It is characterized in that:

Described microprocessor unit and described digital signal processing unit parallel work-flow.

In preferred embodiment of the present invention, described microprocessor unit response more than provides instruction word that the signal data word of a plurality of sequential addressings is provided to described digital signal processing unit.

The ability of microprocessor gating pulse string data transmission allows more effectively to use this memory bus.The ability of complicated response more that this microprocessor has for the state of total system also allows these employed burst modes transmission to have best effect.

Although this digital signal processing unit once can be accepted a signal data word, in preferred embodiment of the present invention, described digital signal processing unit comprises a multiword input buffer.

In digital signal processing unit, provide a multiword input buffer to allow between this microprocessor and this digital signal processing unit, also to use the burst mode transmission.This has further strengthened the data transmission efficiency in system, and improve this digital signal processing unit according to provided through the buffering the input signal data word be independent of the ability that this microprocessor is operated, because from the transmission of data storage device, microprocessor can not have interruptedly these input signal data words is carried out the digital signal processing operation.

Outgoing side at this digital signal processing unit has also carried out corresponding consideration.

In system, improved the dirigibility of the mode that microprocessor can the control data memory device, multiplexed data have been connected described data storage device and described digital signal processing device to transmit described signal data word with instruction bus in this system, and described microprocessor unit programmed instruction word and described digital signal processing unit programmed instruction word are to described digital signal processing device.

Preferred embodiment of the present invention is such, described digital signal processing unit comprises that a digital signal processing unit registers group is used to preserve data word, can carry out arithmetic logical operation to these data words, described DSP program instruction word comprises the register specific field.

Use on the mode of operation of a registers group at digital signal processing unit and bring great dirigibility at digital signal processing unit (it has register and is used for a specific operation in the appointment of DSP program instruction word).And signal data word can be loaded the register in the digital signal processing unit into, and is used repeatedly before being replaced by another signal data word.The repeatedly use of sort signal data word in digital signal processing unit reduced the data stream momentum, and alleviated power consumption problem, and the bandwidth problem relevant with existing system.

In preferred embodiment of the present invention, for each data word that is stored in the described input buffer, the destination data of a purpose digital signal processing unit register of described input buffer stores sign.

The destination data that is provided for identifying a digital signal processing unit register allows to utilize better the function of this microprocessor, because this microprocessor can be done the work that sign is used for the target of a specific signal data word in this digital signal processing unit registers group, thereby alleviated this task of digital signal processing unit.

This digital signal processing unit can take out new signal data word in many ways from this input buffer.Yet, in preferred embodiment of the present invention, the digital signal processing unit programmed instruction word that reads a digital signal processing unit register comprises a sign, and indication is stored in the data word that a data word in the described digital signal processing unit register can be stored in the described input buffer with the destination data that is complementary and replaces.

Digital signal processing unit by only with its oneself register tagging for requiring one to refill operation and will oneself from this input buffer and the data transmission between himself, free and come.Other circuit can be used to be responsible for satisfying this and refill requirement then, and this refills and can require the arbitrary time before the new data in this relevant register of using to take place.

In order further to improve the independence of this microprocessor unit and this digital signal processing unit, if described input buffer comprises a plurality of data words with the destination data that is complementary, then refill described digital signal processing unit register with such data word, described data word has first and is stored in the destination data that is complementary in the described input buffer.

Under the situation of the transmission of the burst mode from this microprocessor to this input buffer, can increase progressively this destination data for each word, and after appearance is once unrolled, can have a limit destination data value selectively.

This microprocessor unit and this digital signal processing unit can lock mutually, if make a Founder wait for the operation that will be finished by the opposing party, this corresponding side is deadlocked.If this digital signal processing unit reduces power consumption when pausing, this feature will be further strengthened.

Be understandable that this microprocessor unit can be manufactured on the different integrated circuit with this digital signal processing unit, if but they are fabricated on the same integrated circuit, will be in the spaces, and speed, power consumption and cost aspect have great benefit.

Embodiments of the invention are described with reference to the accompanying drawings by way of example, in the accompanying drawing:

Fig. 1 illustrates the high level configuration of digital signal processing device;

Fig. 2 illustrates the input buffer of the register configuration of coprocessor;

Fig. 3 illustrates the data routing by coprocessor;

Fig. 4 illustrates the multiplex electronics that reads position, high or low position from register;

Fig. 5 is the employed register of the coprocessor block diagram of map logic again that illustrates in the preferred embodiment;

Fig. 6 illustrates in greater detail the map logic again of the register shown in Fig. 5; And

Fig. 7 is the table that the piece filter algorithm is shown.

The system that describes below is about digital signal processing (DSP).DSP can take many forms, but need generally can think the processing of (in real time) processing mass data at a high speed.Certain analog physical signal of this data ordinary representation.The good example of DSP is used in the digital mobile phone, wherein receives need be decoded into analoging sound signal with the radio signal that sends and with analoging sound signal coding (adopting convolution, conversion and related operation usually).Another example is the disk drive controller, wherein handles the signal that recovers from coiled hair to produce a tracking Control.

In the superincumbent context, be below to based on the description of the digital information processing system of the microprocessor core of coprocessor cooperation (being the ARM nuclear in the microprocessor scope of Britain Camb Advanced RISC Machines Ltd. design in this example).The interface of microprocessor and coprocessor and coprocessor processor system structure itself are special in DSP is provided functional configuration.Microprocessor core will be known as ARM and coprocessor is called Piccolo.ARM and Piccolo manufacture the single IC for both of other element (as DRAM, ROM, D/A and A/D converter etc. on the sheet) that comprises as the part of ASIC usually.

Piccolo is an arm coprocessor, so it carries out a part of ARM instruction set.Arm coprocessor instruction allows ARM to transmit data (utilize and load coprocessor LDC and storage coprocessor STC instruction) between Piccolo and storer, and to transmit ARM register (the MRC instruction that utilization is sent to coprocessor MCR and transmits from coprocessor) from Piccolo.A kind of mode of observing the cooperative interaction of ARM and Piccolo is the strong address generator work of ARM as the Piccolo data, needs to handle in real time the DSP computing that mass data produces corresponding real-time results and Piccolo is carried out if having time.

Fig. 1 illustrates ARM2 and Piccolo4, and ARM2 issue control signal is controlled to Piccolo4 to Piccolo4 and transmitted data and transmit data word from Piccolo4.The needed Piccolo programmed instruction of instruction cache 6 storage Piccolo4 word.Single DRAM storer 8 storage ARM2 and needed all data and instruction words of Piccolo4.ARM2 is responsible for addressable memory 8 and controls all data and transmit.Only simple and cheap with the layout of address bus than the typical DSP method of the bus that needs a plurality of storeies and high bus bandwidth with single memory 8 and one group of data.

Piccolo carries out second instruction stream (DSP program instruction word) from the instruction cache 6 of control Piccolo data routing.Comprise such as the operation of digital signal processing types such as multiply-accumulate and such as control flow commands such as zero-overhead loop instructions in these instructions.Operate on the data of these instructions in remaining on Piccolo register 10 (see figure 2)s.These data are that previous ARM2 sends from storer 8.Instruction stream is from instruction cache 6; Instruction cache 6 conducts are bus master driving data bus completely.Little Piccolo instruction cache 6 is the direct mapping cache (64 instruction) of 4 lines, 16 words of every line.In some implementations, make that instruction cache is bigger to be worth.

Thereby two tasks are independent operatings, and ARM loading data and Piccolo handle it.This allows monocycle data processing lasting on 16 bit data.Piccolo has the ARM of the making alphabetic data of looking ahead, the scanning machine system of loading data before Piccolo needs it (being illustrated among Fig. 2).Piccolo can be with any order access loaded data, along with the last use of old data automatically refills its register (each source operand of all instructions all have indicate should refill source-register).This input mechanism is called the sequencing impact damper again and comprises input buffer 12.Each value (face is by LDC or MCR as follows) that loads Piccolo carries the mark Rn of the destination register of specifying this value.Mark Rn is stored in the input buffer with data word.When selecting circuit 14 access function resisters to instruct appointment will refill this data register by register, just come this register of mark by establishing signal E.Refilling then in the circuit 16 usefulness input buffers 12 with this register is that the oldest loaded value of destination refills this register automatically.Reset the value that the preface impact damper keeps 8 tape labels.Input buffer 12 has the form that is similar to FIFO, but except can be from formation central authorities extracted data word, and the word of later storage is after this filled the room to front transfer.Distance input data word farthest just correspondingly is the oldest, and just determines and should refill input buffer 12 with which data word with it when input buffer 12 maintenances have two data words of correct mark Rn.

Piccolo is by exporting it with data storage in output buffer 18 (FIFO) as shown in Figure 3.Data are sequentially to write among the FIFO, and read into storer 8 by ARM with identical order.Output buffer 18 keeps 8 32 place values.

Piccolo is connected on the ARM by coprocessor interface (the CP control signal of Fig. 1).When carrying out the arm coprocessor instruction, Piccolo can carry out this instruction; It is ready up to Piccolo to make ARM wait for before carrying out this instruction; Or refusal is carried out this instruction.In in the end a kind of situation, ARM will cause undefined instruction exception.

The prevailing coprocessor instruction that Piccolo carries out is LDC and STC, they respectively by data bus to load and the storage data word from storer 8, and ARM generates all addresses.Be that these instruct data load to resetting the data of also storing in the preface impact damper from output buffer 18.Reset when not having enough spaces to come loading data in the preface impact damper if on LDC, import, if and on STC, do not have enough data in the output buffer for storage, be the data expected of ARM not in output buffer 18 time, Piccolo will stop ARM.The ARM/ coprocessor register of also carrying out Piccolo transmits the particular register that makes ARM energy access Piccolo.

Piccolo comes the data routing shown in the control chart 3 and reaches 18 transmission data from the register to the output buffer from resetting the preface impact damper to register from the instruction that storer takes out itself.The ALU of these instructions of execution of Piccolo has the multiplication of execution, addition, subtraction, multiply-accumulate, logical operation, displacement and round-robin multiplier/adders circuit 20.Also being provided with in data routing adds up/tire out subtracts (decumulate) circuit 22 and calibration/saturated circuit 24.

Advance the instruction cache 6 from memory load when Piccolo instruction is initial, wherein Piccolo can access they and do not need to return accessing main memory.

Piccolo can not recover from the storer failure.Therefore, if use Piccolo in virtual memory system, all Piccolo data all must be in physical storage in whole Piccolo task.For the real-time such as Piccolo tasks such as real-time DSP, this is not great restriction.If the storer failure, Piccolo will stop and in status register S2 sign will be set.

Fig. 3 illustrates the overall data path function of Piccolo.Registers group 10 is used 3 read ports and 2 write ports.Utilize a write port (L port) to refill register from resetting the preface impact damper.Output buffer 18 is directly to upgrade from ALU result bus 26, from the output of output buffer 18 under the ARM programmed control.The arm coprocessor interface is carried out LDC (loading coprocessor) instruction that resets in the preface impact damper and from STC (storage coprocessor) instruction of output buffer 18, and the MCR on registers group 10 and MRC (transmit ARM register extremely/from the CP register).

All the other register ports are used for ALU.Two read ports (A and B) drive and are input to multiplier/adders circuit 20, and the C read port is used to drive totalizer/accumulation subtraction apparatus circuit 22 inputs.All the other write port W are used for the result is returned to registers group 10.

Multiplier 20 is carried out 16 * 16 tape symbol or non-signed multiplication, has available 48 and adds up.Scaler unit 24 can provide 0 to 31 arithmetic or logical shift right immediately, and the back is followed available saturated.Shift unit and 20 each cycle of logical block can be carried out a displacement or logical operation.

Piccolo has 16 general-purpose registers that are called D0-D15 or A0-A3, X0-X3, Y0-Y3, Z0-Z3.First group four registers (A0-A3) are predetermined as totalizer and be 48 bit wides, and extra 16 are provided at the protection to overflowing in many continuous calculating.All the other registers are 32 bit wides.

Can with each Piccolo register as comprise two independently 16 place values treat.Position 0 to 15 comprises low half, and position 16 to 31 comprises high half.Instruction can specify each register specific 16 half as source operand, maybe can specify whole 32 bit registers.

Piccolo also provides saturated computing.If the result is greater than the size of destination register, the modification of multiplication, addition and subtraction instruction provides saturated result.When destination register is 48 bit accumulators, value is saturated to 32 (promptly can't saturated 48 place values).On 48 bit registers, do not overflow detection.So add up that just to cause overflowing this be rational restriction in instruction owing to can take at least 65536 multiplication.

Each Piccolo register is to be labeled as " sky " (the E sign is seen Fig. 2) or to comprise one of value (it is empty that half register can not be arranged).When initial, be empty with all register taggings.Piccolo attempts will fill one of empty register from the value that input resets the preface impact damper with refilling control circuit 16 on each cycle.Just no longer it is labeled as " sky " if will write register in addition from the value of ALU.If write register from ALU, there is value to wait for simultaneously and is placed into this register from resetting the preface impact damper, then the result is uncertain.If dummy register is read, the performance element of Piccolo will stop.

Input resets preface impact damper (ROB) between the registers group of coprocessor interface and Piccolo.Transmit data load is advanced among the ROB with arm coprocessor.ROB comprises some 32 place values, respectively has the mark of indication as the Piccolo register of the destination of this value.This mark also indicates these data should send 16 of the bottoms that whole 32 bit registers are still only given 32 bit registers to.If the destination of data is whole register, 16 of bottoms that then will this item send to destination register the bottom half and 16 at top is sent to the top half (if destination register is 48 bit accumulators then escape character) of register.If the destination of these data is the bottom half (so-called " half register ") of register, at first transmit 16 of bottoms.

Register tagging is always with reference to the physics destination register, do not carry out register and remaps that (face remaps about register as follows.)

Piccolo attempts as follows data item to be sent to registers group from ROB on each cycle:

Every and relatively among-the checking R OB with mark and dummy register, determine whether and can transmit register from part or all.

-Xiang Zuzhong from transmitting selects the oldest item and sends its data to registers group.

-will this item flag update be mark this be empty.If only transmitted the part of this item, a part that will transmit is labeled as empty.

For example, if destination register be empty fully and ROB item that select to comprise with whole register be the data of destination, be sky just transmit whole 32 and mark this items.If half is empty and the ROB item comprises half the data of bottom that the destination is a register for the bottom of destination register, then 16 of the bottoms of this ROB item are sent to destination register the bottom half and with the bottom of ROB half is labeled as empty.

Can transmit the height of the data in any independently and hang down 16.If do not have item to comprise the data that can send registers group to, do not transmit in this cycle.The institute that following table is described target ROB item and destination register state might make up.

	Target, Rn, state
	Target, Rn, state			Target ROB item state	Empty	Sky is at half	High one in midair
Full register, two halves are all effective	Rn.h＜-entry.h Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.h entry.h is labeled as sky	Target ROB item state	Empty	Sky is at half	High one in midair
Full register, two halves are all effective	Rn.h＜-entry.h Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.h entry.h is labeled as sky	Full register, half is effective for height	Rn.h＜-the entry.h item is labeled as sky		Rn.h＜-the entry.h item is labeled as sky
Full register is at half effectively	Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-the entry.l item is labeled as sky		Full register, half is effective for height	Rn.h＜-the entry.h item is labeled as sky		Rn.h＜-the entry.h item is labeled as sky
Full register is at half effectively	Rn.l＜-the entry.l item is labeled as sky	Rn.l＜-the entry.l item is labeled as sky		Half register, two halves are all effective	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.l etntry.l is labeled as sky
Half register, half is effective for height	Rn.l＜-the entry.h item is labeled as sky	Rn.l＜-the entry.h item is labeled as sky		Half register, two halves are all effective	Rn.l＜-entry.l entry.l is labeled as sky	Rn.l＜-entry.l etntry.l is labeled as sky

Sum up, can refill the two halves of register independently from ROB, the data markers among the ROB for whole register be the destination or with the bottom of register half is two 16 place values of destination.

With the arm coprocessor instruction data load is advanced among the ROB.How which bar coprocessor instruction flag data depends on and carries out transmission in ROB.Following A RM instruction can be used for filling ROB with data:

LDP{<cond>}<16/32> <dest>，[Rn]{！}，#<size>

LDP{<cond>}<16/32>W <dest>，<wrap>，[Rn]{！}，#<size>

LDP{<cond>}16U <bank>，[Rn]{！}

MPR{<cond>} <dest>，Rn

MRP{<cond>} <dest>，Rn

Provide following ARM instruction to be used to dispose ROB:

LDPA<bank?list>

First three bar is collected to be that LDC, MPR and MRP are collected and to be that it is the CDP instruction that MCR, LDPA are collected.

Above＜dest〉represent Piccolo register (A0-Z3), Rn represents an ARM register,＜size〉representative must be the fixed word joint number of 4 non-zero multiple, and＜wrap〉represent constant (1,2,4,8).The field of having drawn together with { } is what select for use.Reset preface impact damper,＜size for transmission can be met〉be at most 32.In many occasions, for fear of deadlock,＜size〉will be less than this restriction.＜16/32〉field indicates whether loaded data to be treated as 16 bit data and be indicated the specific action (face as follows) of ending (endian) that will take, or 32 bit data.

Annotate 1: in the text below, it instructs when quoting LDP or LDPW 16 and 32 modification.

Annotate 2: ' word ' is 32 pieces from storer, and it can comprise two 16 bit data items or one 32 bit data item.

The LDP instruction transmits the plurality of data item, and they are assigned to a full register.This instruction will be from storer address Rn loading＜size/4 words, they are inserted among the ROB.The number of words that can transmit is subjected to following restriction:

-amount＜size〉must be 4 non-zero multiple;

-＜size〉must be less than or equal to the size (be 8 words, in the future version guarantee be no less than this) of the ROB of specific implementation in first version.

First data item that transmits is labeled as is assigned to＜dest, second data item is assigned to＜dest 〉+1 or the like (rapping around to A0) from Z3.If specified! , then after this with register Rn increment＜size 〉.

If adopt the LDP16 modification,, on 2 16 half-words that constitute 32 bit data items, carry out ending (endian) specific operation along with they return from accumulator system.The details big ending of face as follows (Big Endian) is supported with little ending (Little Endian).

The LDPW instruction transmits the plurality of data item to one group of register.First data item that transmits is labeled as is assigned to＜dest, second to＜dest 〉+1, or the like.As appearance＜wrap〉when transmitting, the item that the next one is transmitted is labeled as and is assigned to＜dest 〉, or the like.＜wrap〉amount is in the amount appointment of half-word.

For LDPW, be suitable for following restriction:

-amount＜size〉must be 4 non-zero multiple;

-＜size〉must be less than or equal to the size (be 8 words, in the future version guarantee be not less than this) of the ROB of specific implementation in first published;

-＜dest〉can be { one of A0, X0, Y0, Z0};

-for LDP32W,＜wrap〉can be 2,4, one of a 8} half-word, for LDP16W can be 1,2,4, one of a 8} half-word;

-amount＜size〉must be greater than 2*＜wrap, otherwise do not occur unrolling and use LDP and instruct and replace.

For example, instruction

LDP32W X0，2，[R0]！，#8

Two words are loaded among the ROB, they are assigned to whole register X0.R0 will be by increment 8.Instruction

LDP32W X0，4，[R0]，#16

Four words are loaded among the ROB, they are labeled as are assigned to X0, X1, X0, X1 (by this order).R0 is unaffected.

For LDP16W, can be with＜wrap be appointed as 1,2,4 or 8.1 unroll will cause all data markers for being assigned to destination register＜dest〉bottom of .l half.This is ' half register ' situation.

For example, instruction

LDP16W X0，1，[R0]！，#8

Two words are loaded among the ROB, they are labeled as 16 bit data that are assigned to X0.1.R0 will be by increment 8.Instruction

LDP16W X0，4，[R0]，#16

Performance be similar to the LDP32W example, but carry out for except the specific operation of ending in data when storer returns at it.

LDP instructs all untapped codings to can be in the future, and expansion keeps.

The LDP16U instruction is to provide for the efficient transmission of supporting 16 data that do not line up.LDP16U supports to provide for register D4 to D15 (X, Y and Z group).The LDP16U instruction is sent to one 32 bit data word (comprising two 16 bit data items) the Piccolo from storer.Piccolo will abandon 16 of the bottoms of these data and 16 at top will be stored in the maintenance register.X, Y and Z group have one to keep register.In case loaded the maintenance register in the group, if data are assigned to register in this group, just changed the performance that LDP{W} instructs.Load data the connecting and composing among the ROB by 16 of the bottoms that keeps register and the data that transmitting with the LDP instruction.Put into and keep register for high 16 with the data that transmitting:

entry<-data.lholding_register

holding_register<-data.h

This operator scheme last till always close with LDPA instruction till.Keep register not write down destination register mark or size.This feature is to obtain from the instruction of the next one value that data.l is provided.

The specific behavior of ending can appear on the data that accumulator system returns forever.Because all 32 bit data items of supposition all are the word alignment in storer, do not have non-16 bit instructions that are equivalent to LDP16U.

The LDPA instruction is used to close the operator scheme that do not line up of LDP16U instruction starting.Can on group X, Y, Z, independently close the pattern of not lining up.For example instruction,

LDPA {X，Y}

With the pattern that do not line up of closing on group X and the Y.Data in the maintenance register of these groups will be dropped.

Permission is carried out LDPA on the group that is not in the non-alignment pattern, this will make this group in alignment pattern.

The MPR instruction is put into ROB with the content of ARM register Rn, is assigned to Piccolo register＜dest 〉.Destination register＜dest〉can be any full register among the scope A0-Z3.For example instruction,

MPR X0，R3

The content of R3 is sent among the ROB, marks the data as and be assigned to full register X0.

Because ARM is inner little ending (endian), and data are not occurred the specific performance that ends up when ARM is sent to Piccolo.

MPRW instruction is placed on the content of ARM register Rn among the ROB, it is labeled as is assigned to 16 Piccolo register＜dest〉two the 16 bit data items of .l.Right＜dest〉restriction and to identical (being A0, X0, Y0, Z0) of LDPW instruction.For example instruction,

MPRW X0，R3

The content of R3 is sent among the ROB, marks the data as two 16 amounts that are assigned to X0.1.Should point out for having 1 LDP16W that unrolls, can only at the bottom of 32 bit registers half.

As for MPR, on data, do not act on for the specific operation of ending.

LDP is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

110

P

?U

?N

?W

?I

Rn

OEST

?PICCOLO1

SIZE/4

Wherein PICCOLO1 is first coprocessor number (current is 8) of Piccolo.The N position is selected between LDP32 (1) and LDP16 (0).

LDPW is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

110

P

?U

?N

?W

?I

Rn

DES

?WRA

PICCOLO2

SIZE/4

Wherein DEST is that 0-3 and WRAP are 0-3 for the

value

1,2,4,8 of unrolling for destination register A0, X0, Y0, Z0.PICCOLO2 is second coprocessor number (current is 9) of Piccolo.The N position is selected between LDP32 (1) and LDP16 (0).

LDP16U is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

110

?P

?U

?0

W

?1

Rn

?DES

01

?PICCOLO2

00000001

Wherein DEST is 1-3 for destination group X, Y, Z.

LDPA is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

0000

?PICCOLO1

000

0

BANK

BANK[3 wherein: 0] be used on every group basis, closing the pattern of not lining up.If be provided with BANK[1], then close the pattern that do not line up on the group X.BANK[2] and BANK[3] close the pattern that do not line up on group Y and the Z respectively, if be provided with.Notice that this is the CDP operation.

MPR is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

0

1

0

DEST

Rn

PICCOLO1

000

?1

0000

MPRW is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

?0

?1

?0

DEST

00

Rn

PICCOLO2

000

?1

0000

Wherein DEST is 1-3 for destination register X0, Y0, Z0.

Output FIFO can keep nearly 8 32 place values.They transmit from Piccolo with one of following (ARM) operational code:

STP{<cond>}<16/32> [Rn]{！}，#<size>

MRP Rn

First will from output FIFO＜size/4 words are kept on the given address of ARM register Rn, if! There is index Rn.For preventing deadlock,＜size〉must not be greater than the size (in this realization being 8) of output FIFO.If adopt the STP16 modification, on the data that accumulator system is returned, can occur for the specific performance of ending.

The MRP instruction is eliminated a word and is placed it among the ARM register Rn from output FIFO.On data, do not act on for the specific operation of ending for MPR.

The ARM of STP is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

110

P

U

N

W

0

Rn

C000

?PICCOLO1

SIZE/4

Wherein N selects between STP32 (1) and STP16 (0).For the definition of P, U and W position, referring to the ARM Fact Book.

The ARM of MRP is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

?0

?1

?0

?1

0000

Rn

PICCOLO1

000

?1

0000

The inner little ending of supposition of Piccolo instruction set (little endian) operation.For example, during as two 16 32 bit registers, supposing is at half takies position 15 to 0 in access.Piccolo can operate in the system that has big ending (big endian) storer or peripherals, therefore must be noted that with correct way to load 16 integrated datas.

Have ' BIGEND ' configuration pin that the programmer can control such as ARM Piccolo such as (the ARM7 microprocessors of producing as Advanced RISC Machines Ltd. of Britain Camb), control can be carried out with programmable peripheral equipment.Piccolo utilizes this pin to dispose input and resets preface impact damper and output FIFO.

When 16 bit data that will divide into groups as ARM were loaded into and reset in the preface impact damper, it must be with the 16 bit formats indication this point of LDP instruction.This information is placed on data the maintenance latch and resets in the preface impact damper with suitable order with the combinations of states of ' BIGEND ' configuration input.Especially in big ending pattern, keep 16 of the bottoms of the word that register-stored loads, and with top 16 bit pairings that next time load.Keep content of registers to finish forever in being sent to 16 of bottoms that reset the word in the preface impact damper.

Output FIFO can comprise grouping 16 or 32 bit data.The programmer must use the correct format of STP instruction so that Piccolo can guarantee 16 bit data are provided at the correct on half of data bus.When being configured to end up greatly, when using the STP of 16 bit formats, 16 two halves in up and down exchange.

Piccolo has can only be from 4 special registers of ARM access.They are called S0-S2., and they can only use MRC and MCR instruction accessing.Operational code is:

MPSR Sn，Rm

MRPS Rm，Sn

These operational codes transmit 32 place values between ARM register Rm and special register Sn.They are transmitted among the ARM as coprocessor register and encode: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

001

?L

Sn

Rm

PICCOLO

000

?1

0000

Wherein for MPSR, L is 0 and to MRPS, L then is 1.

Register S0 comprises unique ID of Piccolo and revision version code.31?30?29?28?27?26?25?24 23?22?21?20?19?18?17?16 15?14?13?12?11?10?9?8?7?6?5?4 3?2?1?0

The implementor

Architecture

Part number

Revision version

Position [3: 0] comprises the revision number of processor.

3 part number: piccolo that position [15: 4] comprises with the binary-coded decimal form are 0 * 500

Position [23: 16] occlusion body architecture version: 0 * 00=version 1

Position [31: 24] comprises the ASCII character of implementor's trade mark: 0 * 41=A=ARM company limited

Register S1 is the Piccolo status register.31?30 29?28 27?26 25?24?23?22?21?20?19?18?17?16?15?14?13?12?11?10?9?8?7?6 5 4 3 2 1 0

?N

Z

C

?V

?S ?N

?S ?Z

?S ?C

?S ?V

Keep

D

A

?H

?B

?U

E

One-level condition code flag (N, Z, C, V)

Secondary condition code flag (SN, SZ, SC, SV)

E position: Piccolo is forbidden by ARM and stops.

U position: Piccolo runs into undefined instruction and stops.

B position: Piccolo runs into breakpoint and stops.

H position: Piccolo runs into halt instruction and stops.

A position: Piccolo runs into storer failure (loading, storage or Piccolo instruction) and stops.D position: Piccolo detects dead lock condition and stops (as follows).Register S2 is the Piccolo programmable counter: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

Programmable counter

0

Write-in program counter starting Piccolo is executive routine (if stop then leaving halted state) on this address. and programmable counter is not have definition when resetting, because Piccolo is always by the starting of write-in program counter.

The term of execution, if the state of the execution of Piccolo monitor command and coprocessor interface. it detects:

-Piccolo wait out of service is loaded register again or is waited for that output FIFO has available.

One coprocessor interface is busy waiting, because the space is not enough or output FIF0 discipline is not enough among the ROB.

If detect this two states, the D position in its status register of Piccolo set stops and refusing arm processor instruction, causes ARM to enter undefined instruction trap.

The detection permission of deadlock state constitutes system by reading ARM and Piccolo programmable counter and register and can alert program person occur this state and report accurate trouble spot at least.Should emphasize that deadlock can only destroy the state initiation of Piccolo owing to another part of incorrect program or system.Deadlock can not be caused by data deficiencies or ' overload '.

Can adopt some kinds of operations from ARM control Piccolo, they are provided by the CDP instruction.If these cDP instruction is only just accepted in privileged mode at ARM. Piccolo will refuse CDP and instruct and cause ARM to be in undefined instruction trap in this state. below be available operation:

-reset

-access module gets the hang of

-start

-forbid

Piccolo can reset in software with the PRESET instruction.

PRESET; Remove the state of p iccolo

With this order number is 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

0000

PICCOLOI

000

0

0000

Following situation appears when carrying out this instruction :-all register taggings are removed input ROB for empty (being ready to refill) .-.-remove and export FIFO.-reset cycle counter.-Pioccolo is placed halted state (with the H position of set S2).

Carrying out the PRESET instruction can take some cycles and finish (for present embodiment 2-3).When carrying out it, the arm coprocessor instruction that the back will be carried out on Piccolo will be in busy waiting.

In the conditional access pattern, can use STC and LDC instruction to preserve and the state that recovers Piccolo (face is about visiting the Piccolo state from ARM as follows).For the access module that gets the hang of, must at first carry out the PSTATE instruction:

The PSTATE access module that gets the hang of

With this order number be: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

0001

0000

PICCOLOI

000

?0

0000

When carrying out, PSTATE instruct the general :-stop Piccolo (if it does not stop as yet), the E position in the status register of set Piccolo.

-configuration Piccolo enters in its conditional access pattern.

Carry out the PSTATE instruction and can take some cycles and finish, because the instruction pipelining of Piccolo must use up before stopping.When carrying out, the arm coprocessor instruction that the back will be carried out on Piccolo will be busy waiting.

PENABLE and PDISABLE instruction is used for fast context switches. and when Piccolo is under an embargo, can only visit special register O and l (ID and status register), and be during from privileged mode.Visit any other state or will cause the ARM undefined instruction unusual from any visit of user model.Forbid that Piccolo causes it to stop to carry out.When Piccolo stopped to carry out, it confirmed this fact by the E position in the SM set mode register.

Piccolo starts by carrying out the PENABLE instruction:

PENABLE; Start Piccolo

COND

1110

0010

0000

PICCOLOI

000

?0

0000

Picclol forbids by carrying out the PDISABLE instruction: PDISABLE; Forbid that Piccolo with this order number is: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

0011

0000

PICCOLOI

000

?0

0000

When carrying out this instruction, following situation appears:

The instruction pipelining of-Piccolo will flow.

-Piccolo will shut down and the SM set mode register in the H position.

The Piccolo of Piccolo instruction cache retentive control Picclo data routing instruction. if exist, its guarantees to keep at least 64 instructions, and is initial on 16 word boundarys, and following ARM operational code collects among the MCR.It is operating as forces cache memory to take out initial delegation (16) instruction of (must be 16 word boundarys) on assigned address.Even this taking-up also takes place in the data that cache memory has maintained about this address.

PMIR Rm

Piceolo must stop before carrying out PMIR.

The MCR of this operational code is encoded to: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

COND

1110

011

L

COCO

Rm

?PICCOLOI

?000

I

0000

The Piccoloo instruction set of this section discussion control Piccolo data routing.Each the instruction be 32 long.Instruction is read from the Piccolo instruction cache.

The decoding instruction collection is quite intuitively.High 6 (26 to 31) provide the main operation sign indicating number, and position 22 to 25 provides the minor actions sign indicating number for the minority specific instruction.The position of band gray shade is current not to make Zhou Erwei expansion keep (current they must comprise designated value).

11 main instruction class are arranged.And this is not exclusively corresponding to the main operation sign indicating number that proposes in instruction, and this is for the ease of some subclass of decoding.3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 9 8 7 6 5 4 3 2 1 01 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0

0	OPC				?F	S D	DEST			S 1	R 1	SRC1	SRC2
0	OPC				?F	S D	DEST			S 1	R 1	SRC1	SRC2						1	000		OPC		?F	S D	DEST			S 1	R 1	SRC1	SRC2
1	001		0	?O ?P	?F	S D	DEST			S 1	R 1	SRC1	SRC2						1	000		OPC		?F	S D	DEST			S 1	R 1	SRC1	SRC2
1	001		0	?O ?P	?F	S D	DEST			S 1	R 1	SRC1	SRC2						1	0011
1	010		OPC		?F	S D	DEST			S 1	R 1	SRC1	SRC2_SHIFT						1	0011
1	010		OPC		?F	S D	DEST			S 1	R 1	SRC1	SRC2_SHIFT						1	011		00		?F	S D	DEST			S 1	R 1	SRC1	SRC2_SEL		COND
1	011		01																1	011		00		?F	S D	DEST			S 1	R 1	SRC1	SRC2_SEL		COND
1	011		01																1	011		1	?O ?P	?F	S D	DEST			S 1	R 1	SRC1	SRC2_SEL		COND
1	10	0	O P	?S ?a	?F	S D	DEST			A 1	R 1	SRC1	SRC2_MULA						1	011		1	?O ?P	?F	S D	DEST			S 1	R 1	SRC1	SRC2_SEL		COND
1	10	0	O P	?S ?a	?F	S D	DEST			A 1	R 1	SRC1	SRC2_MULA						1	10	?1		?0
1	10	?1	O P	?1	?F	S D	DEST			A 1	R 1	SRC1	0	A 0	R 2	SRC2_REG		SCALE	1	10	?1		?0
1	10	?1	O P	?1	?F	S D	DEST			A 1	R 1	SRC1	0	A 0	R 2	SRC2_REG		SCALE	1	110
1	11100				?F	S D	DEST			IMMEDIATE_15								R 2	1	110
1	11100				?F	S D	DEST			IMMEDIATE_15								R 2	1	11101
1	11110				?0	?RFIELD_ ?4				0	R 1	SRC1					#INSTRUCTIONS_8		1	11101
1	11110				?0	?RFIELD_ ?4				0	R 1	SRC1					#INSTRUCTIONS_8		1	11110				?1	?RFIELD_ ?4		#LOOPS_13						#INSTRUCTION_8
1	11111				?0	?OPC			REGISTER_LIST_16									SCALE	1	11110				?1	?RFIELD_ ?4		#LOOPS_13						#INSTRUCTION_8
1	11111				?0	?OPC			REGISTER_LIST_16									SCALE	1	11111				?100				IMMEDIATE_16						COND
1	11111				?101			PARAMETERS_21											1	11111				?100				IMMEDIATE_16						COND
1	11111				?101			PARAMETERS_21											1	11111				?11		O P

Instruction in the last table has following title:

The normal data computing

Logical operation

Condition adds/subtracts

Undefined

Displacement

Select

Undefined

The parallel selection

Multiply accumulating

Undefined

Double taking advantage of

Undefined

The moving belt symbol is counted immediately

Undefined

Repeat

The array of registers table handling

Shift

Renaming parameter transmits

Stop/interrupting

Describe the form of all kinds of instructions in the joint below in detail.For the great majority instruction, source and destination operand field are general and describe in detail that register remaps too in independent joint.

The great majority instruction needs two source operands; Source 1 and source 2.Some exception is saturated absolute value.

Source 1 (SRC1) operand has following 7 bit formats:

18 17 16 15 14 13 12

Size

Recharge

Register number

High/low

Field element has following implication:

The operand size that-size-indication will be read (1=32 position, 0=16 position).

-recharge-stipulate after reading and register tagging should be empty also recharging from ROB.

In 16 32 bit registers that-register number-coding will read which.

-high/low-for 16 read to indicate read 32 bit registers which half.For 32 positional operands, during set the indication should exchange two 16 half.

Size	High/low	The register section of access
Size	High/low	The register section of access	0	0	Low 16
0	1	High 16	0	0	Low 16
0	1	High 16	1	0	Complete 32
1	1	Complete 32, the two halves exchange	1	0	Complete 32

In assembly routine by adding that on register number suffix specifies register size: 1 is low 16, h be high 16 or.X has height and 32 that hang down 16 exchanges.

General source 2 (SRC2) has one of following three kind of 12 bit format: 11 10 9876543210

0	S2	R2	Register number	High/low	Scale
0	S2	R2	Register number	High/low	Scale	1	0	ROT	IMMED_8
1	1	IMMED_6			Scale	1	0	ROT	IMMED_8

Fig. 4 illustrates according to high/low position and size suitable half with selected register and switches to multiplexed apparatus on the Piccolo data routing.If 16 of size position indications, the then symbol expanded circuit is used the high position in 0 or 1 padding data path as required.

First kind of coding assigned source is register, and these fields have the coding identical with the SRC1 specifier.Scale (SCALE) field is specified the scale on the result that will act on ALU.

Scale	Operation
Scale		3 2 1 0
0 0 0 0		3 2 1 0	ASR#0
0 0 0 0	0 0 0 1	ASR#1	ASR#0
0 0 1 0	0 0 0 1	ASR#1	ASR#2
0 0 1 0	0 0 1 1	ASR#3	ASR#2
0 1 0 0	0 0 1 1	ASR#3	ASR#4
0 1 0 0	0 1 0 1	Keep	ASR#4
0 1 1 0	0 1 0 1	Keep	ASR#6
0 1 1 0	0 1 1 1	ASL#1	ASR#6
1 0 0 0	0 1 1 1	ASL#1	ASR#8
1 0 0 0	1 0 0 1	ASR#16	ASR#8
1 0 1 0	1 0 0 1	ASR#16	ASR#10
1 0 1 0	1 0 1 1	Keep	ASR#10
1 1 0 0	1 0 1 1	Keep	ASR#12
1 1 0 0	1 1 0 1	ASR#13	ASR#12
1 1 1 0	1 1 0 1	ASR#13	ASR#14
1 1 1 0	1 1 1 1	ASR#15	ASR#14

Having 8 available 8 place values of generation of number permission immediately of loop coding and 32 of 2 cyclic representations counts immediately.Express the numerical value immediately that can generate from 8 place value XY down:

Circulation	Count immediately
Circulation	Count immediately	00	0x000000XY
01	0x0000XY00	00	0x000000XY
01	0x0000XY00	10	0x00XY0000
11	0xXY000000	10	0x00XY0000

6 immediately number encoder allow to use 6 not signedly to count (from 0 to 63) immediately, and act on the scale in the output of ALU.

Universal source 2 codings are general for great majority instruction modification.There are some exceptions in this rule, the finite subset of their support sources 2 codings or it is revised a little:

-selection instruction.

-shift order.

-parallel work-flow.

The instruction of-multiply accumulating.

-take advantage of double instruction.

Selection instruction is only supported a not tape symbol operand of number immediately of register or 6.Because these mode fields by instruction are used and must be made this scale unavailable.11 10 9 8 7 6 5 4 3 2 1 0

0	S2	R2	Register number	High/low	State
0	S2	R2	Register number	High/low	State	1	1	IMMED_6			State

SRC2_SEL

Shift order is only supported the operand that 5 no symbols between 16 bit registers or 1 and 31 are counted immediately.Can not obtain result's scale.11 10 9 8 7 6 5 4 3 2 1 0

0	0	?R2	Register number				High/low	0	0	0	0
0	0	?R2	Register number				High/low	0	0	0	0	1	0	0	0	0	0	0	IMMED_5

SRC2_SHIFT

In the parallel work-flow situation,, then must carry out 32 and read if specify the source of register as operand.The number encoder immediately of parallel work-flow is slightly different.Its allow with one immediately number copy to two of 32 positional operands 16 half in.Parallel work-flow can utilize the scale of limited field a little.11 10 9 8 7 6 5 4 3 2 1 0

0	1	R2	Register number	High/low	SCALE_PAR
0	1	R2	Register number	High/low	SCALE_PAR	1	0	ROT	IMMED_8
1	1	IMMED_6			SCALE_PAR	1	0	ROT	IMMED_8

SRC2_PARALLEL

If use 6 to count immediately, then always it is copied to two of 32 amounts on half.If use 8 to count immediately, have only the top that is recycled to 32 amounts when the circulation indication is should be with 8 several immediately just to duplicate on half the time.

Circulation	Count immediately
Circulation	Count immediately	00	?0x000000XY
01	?0x0000XY00	00	?0x000000XY
01	?0x0000XY00	10	?0x00XY00XY
11	?0xXY00XY00	10	?0x00XY00XY

Parallel selection operation does not use scale; Scale field that must these instructions is set to 0.

The multiply accumulating instruction does not allow to specify 8 circulations to count immediately.The position 10 of this field is used for partly specifying which totalizer of use.16 positional operands are contained in source 2.11 10 9 8 7 6 5 4 3 2 1 0

0	?A0	R2	Register number	High/low	Scale
0	?A0	R2	Register number	High/low	Scale	1	?A0	IMMED_6			Scale

SRC2_MULA

Take advantage of double instruction not allow to use constant.Can only specify one 16 bit register.The position 10 of this field is used for partly specifying which totalizer of use.11 10 9 8 7 6 5 4 3 2 1 0

0

?A0

?R2

Register number

High/low

Scale

SRC2_MULD

32 bit manipulations (as ADDADD) are always contained in some instruction, and should size position be set to 1 in these situations, high/low position be used for exchanging selectively two of 32 positional operands 16 half.Some instruction is always contained 16 bit manipulations (as MUL) and should be set to 0 in the size position.And high/low position is selected which half (the size position that loses has been removed in supposition) of employed register.The multiply accumulating instruction allows independent explanation source totalizer and destination register.For these instructions, size position is used to refer to the source totalizer, and the size position is 0 to contain by instruction type then.

(by A or B bus) carried out sign extended automatically it is extended to 32 amounts when reading 16 place values.If read 48 bit registers (by A or B bus), 32 of bottoms only appear on bus.Thereby in all situations, all convert source 1 and source 2 to 32 place values.Have only whole 48 that the instruction that adds up of using bus C can the access accumulator registers.

If set recharges the position, just after using with this register tagging as sky and will recharge (seeing joint) from ROB by the common mechanism that recharges about ROB.Unless as source operand, Piccolo can be not out of service again for this register before recharging.Minimum period number (optimal cases-data are waited at the ROB head) before the data that recharge are effective is 1 or 2.Therefore the data that recharge are not used in suggestion in the instruction that recharges the request back.If can avoid using operand in two instructions in the back, should do like this, because this can prevent the performance loss that the deep-water current waterline is realized.

In assembly routine, recharge the position by adding that on register number suffix " ^ " is specified.Be labeled as empty register section and depend on the register manipulation number.The two halves of each register can be labeled as recharge independently (for example X0.l^ mark recharge X0 the bottom half, X0^ then mark recharges whole X0).When the top " half " that recharges 48 bit registers (position 47: 16), 16 bit data are write position 31: 16 and sign extended puts 47 in place.

If (as ADD X0, X0^ X0^), only once fills to attempt to recharge twice in same register.Assembly routine only allows grammer ADD X1, X0, X0^.

If attempted to read this register before recharging a register, Piccolo wait out of service recharges this register.If flag register is for recharging, and upgraded this register before reading the value that recharges, the result is uncertain (ADD X0 for example, X0^, X1 is uncertain, because its mark X0 recharges, recharges by X0 and X1 sum are placed on wherein then).

14 kinds of scale types of 4 scale code field:

-

ASR #

0，1，2，3，4，6，8，10

-ASR #12 to 16

-LSL #1

Parallel maximum/minimum instruction does not provide scale, does not therefore use 6 constant modification (assembly routine is set to 0) in source 2.

Support that in repetitive instruction register remaps, allow the moving of repetitive instruction access register ' window ' and the circulation of not unrolling.Following more detailed description this point.

The destination operand has following 7 bit formats: 25 24 23 22 21 20 19

F

SD

HL

DEST

This basic coding has 10 kinds of modification:

Assembly routine memonic symbol 25 24 23 22 21 20 19

0	1	0	Dx
0	1	0	Dx			1	1	0	Dx
0	0	0	Dx			1	1	0	Dx
0	0	0	Dx			1	0	0	Dx
0	0	1	Dx			1	0	0	Dx
0	0	1	Dx			1	0	1	Dx
0	1	1	0000			1	0	1	Dx
0	1	1	0000			1	1	1	0	?0	00
1	1	1	0	?1	00	1	1	1	0	?0	00
1	1	1	0	?1	00	1	1	1	1	?0	00
1	1	1	1	?1	00	1	1	1	1	?0	00

Dx 1 Dx ^2 Dx.l 3 Dx.l ^4 Dx.h, 5 Dx.h ^6 undefined .1 (16 of no register write back) 7 ＂＂ (32 of no register write back) 8 .l^ (16-position) output 9 ^ (32-position) output 10

Register number (DX) indication just addressing be in 16 registers which.Addressing each 32 bit register as a pair of 16 bit registers are worked with the size position in high/low position.How a size definition is provided with defined appropriate mark in the instruction type, no matter whether the result is write registers group and output FIFO, this allows constituent ratio to reach near order.Implicate the addition class instruction that adds and the result must be write back register.

Express the performance of each coding down:

Coding	Register is write	FIFO writes	The V sign
Coding	Register is write	FIFO writes	The V sign	1	Write whole register	Do not write	32 overflow
2	Write whole register	Write 32	32 overflow	1	Write whole register	Do not write	32 overflow
2	Write whole register	Write 32	32 overflow	3	Write low 16 and arrive Dx.l	Do not write	16 overflow
4	Write low 16 and arrive Dx.l	Write low 16	16 overflow	3	Write low 16 and arrive Dx.l	Do not write	16 overflow
4	Write low 16 and arrive Dx.l	Write low 16	16 overflow	5	Write low 16 and arrive Dx.h	Do not write	16 overflow
6	Write low 16 and arrive Dx.h	Write low 16	16 overflow	5	Write low 16 and arrive Dx.h	Do not write	16 overflow
6	Write low 16 and arrive Dx.h	Write low 16	16 overflow	7	Do not write	Do not write	16 overflow
8	Do not write	Do not write	32 overflow	7	Do not write	Do not write	16 overflow
8	Do not write	Do not write	32 overflow	9	Do not write	Write low 16	16 overflow
10	Do not write	Write 32	32 overflow	9	Do not write	Write low 16	16 overflow

In all situations, any operation writes back register or inserts output FIFO result before is 48 amounts.Exist two kinds of situations:

If write is 16, by selecting bottom 16 [15: 0] 48 amounts is reduced to 16 amounts.If instruct saturated, then the value with saturated in scope-2^15 to 2^15-1.Then 16 place values are write back to the register of appointment, write the FIFO position, then write output FIFO if be provided with.If it is write output FIFO, then it is remained to up to writing next 16 place values and put into when exporting FIFO with this two values pairing and as 32 single place values.

Write for 32,48 amounts are reduced to 32 amounts by selecting bottom 32 [31: 0].

Write both for 32 with 48, if instruct saturated, just convert 48 place values among scope-2^31-1 to 2^31 32 place values.Then this is saturated:

If-carry out writing back to totalizer, then write whole 48.

If-carry out writing back to 32 bit registers, then write position [31: 0].

If-indication writes back to FIFO, another writes position [31: 0].

The destination size by assembly routine in the register number back with .l or .h appointment.Therefore if do not carry out register write back, then register is unessential, omits destination register and indicates not write register or use ^ to indicate and only write output FIFO.For example, SUB, X0, YO are equivalent to CMP X0, Y0 and ADD^, X0, Y0 puts into output FIFO with the value of X0+Y0.

If the space of output FIFO void value, Piccolo waiting space out of service becomes available.

If write out 16 place values, ADD X0.h^ for example, X1, X2 then latchs this value up to writing second 16 place value.Put into output FIFO with two value combinations and as one 32 figure place then.First that writes 16 place values always appear at 32 words low level half.With the data markers that enters output FIFO is 16 or 32 bit data, to allow proofreading and correct ending in big ending system.

If twice 16 write 32 place values between writing, then operation is undefined.

Support that register remaps in the repetitive instruction, allow the moving of repetitive instruction access register ' window ' and the circulation of not unrolling.Be described in more detail below this point.

In preferred embodiment of the present invention, repetitive instruction provides the mechanism of specifying the mode of register manipulation number in the circulation that is modified in.Under this mechanism, the register that visit is to determine with a function of register manipulation number in the instruction and the volume amount of moving in registers group.This side-play amount changes with programmable way, is preferably in the end of each instruction cycle.This mechanism can be operated on the register that is arranged in X, Y and Z group independently.In preferred embodiment, this facility can not utilize for the register in the A group.

Can use the notion of logical and physical register.Instruction operands is that logic register is quoted, and the physical register that then it is mapped to the specific Piccolo register 10 of sign is quoted.Comprising all operations that recharges interior all operates on physical register.The data that only register occurs in Piccolo instruction stream one side to remap-load Piccolo always are assigned to physical register and do not carry out and remap.

With further reference to Fig. 5 discussion mechanism that remaps, Fig. 5 is the block scheme that some internal parts of Piccolo coprocessor 4 are shown.ARM nuclear 2 data item that retrieve from storer are placed on reset in the preface impact damper 12, Piccolo register 10 then recharges from resetting preface impact damper 12 in the mode of early describing with reference to Fig. 2.Pass to instruction decoder 50 in Piccolo4 with being stored in Piccolo instruction in the cache memory 6, before they are passed to Piccolo processor core 54, decode there.Piccolo processor core 54 comprises early multiplier/adders circuit 20 with reference to Fig. 3 discussion, adding up/tiring out subtracts circuit 22 and calibration/saturated circuit 24.

If instruction decoder 50 is being handled the instruction of formation with the part of the instruction cycle of repetitive instruction sign, and this repetitive instruction has been indicated and should have been carried out remapping of some registers, conveniently carries out necessary remapping with the register logic 52 that remaps.The logic 52 that register can be remapped is thought the part of instruction decoder 50, though the clear logic that register can be remapped of person skilled in the art person is arranged to the entity that complete and instruction demoder 50 separates.

Usually comprise one or more operands that sign comprises the register of the required data item of instruction in the instruction.For example, typical instruction can comprise two source operands and a destination operand, and sign comprises two registers of the required data item of this instruction and the result who instructs should be put into wherein register.The register logic 52 that remaps receives the operand of instruction from instruction decoder 50, and these operand identification logic registers are quoted.Quote according to logic register, whether the register logic that remaps determined should or not to apply and remapped, and will remap as required then to act on physical register and quote.If determining should not apply remaps, quote just provide logic register to quote as physical register.To go through after a while and carry out the preferred mode that remaps.

To quote and pass to Piccolo processor core 54 from the remap physical register of respectively exporting of logic of register, make that processor nuclear energy acts on instruction by on the data item in the particular register 10 of physical register reference identification subsequently.

The mechanism of remapping of preferred embodiment allows each registers group separated into two parts, i.e. the register section that can remap and keep their original registers to quote the register section that does not remap.In the preferred embodiment, the part that remaps originates in the bottom of the registers group that remaps.

The mechanism of remapping adopts several parameters, and these parameters go through with reference to Fig. 6, and Fig. 6 illustrates the register logic 22 that remaps how to use the block scheme of various parameters.Should point out that these parameters are with respect to any the set-point in the group that is remapping, this point is the bottom of this group for example.

Can think that the register logic 52 that remaps comprises two main logical blocks, promptly remap piece 56 and base upgrade piece 58.The logic 52 of remapping register adopts provides the basic pointer that is added in the off-set value that logic register quotes, and upgrades piece 58 by base this basic pointer value is offered the piece 56 that remaps.

Available base initial (BASESTART) signal defines the initial value of basic pointer, and for example this is normally zero, though some other values also can be specified.This basic start signal is passed to the basic multiplexer 60 that upgrades in the piece 58.In repeating the first time of instruction cycle, multiplexer 60 passes to storage unit 66 with basic start signal, and for the repetition of round-robin back, by multiplexer 60 next basic pointer value is offered storage unit 66.

The output of storage unit 66 is passed to the logic 56 that remaps as current basic pointer value, and pass to one of input of the totalizer 62 in the basic more new logic 58.Totalizer 62 also receives provides the basic increment of basic increment size (BASEINC) signal.Totalizer 62 is configured to the current basic pointer value that storage unit 66 is provided is increased this base increment size, and the result is passed to moding circuit 64.

This moding circuit also receive basic ring around (BASEWRAP) value and with this value with from the output base signal-arm of totalizer 62 relatively.If the basic pointer value behind the increment is equal to or greater than basic ring around value, just new basic pointer is rapped around to new off-set value.At this moment the output of moding circuit 64 is next basic pointer value that will be stored in the storage unit 66.This output is offered multiplexer 60, and from there to storage unit 66.

Yet, storage unit 66 receives base renewal (BASEUPDATE) signal from the loop hardware of managing repetitive instruction before, this can not be stored in the storage unit 66 at next basic pointer value.Loop hardware periodically generates basic update signal, for example whenever wanting the repetitive instruction circulation time.When storage unit 66 received basic update signal, storage unit was just rewritten last basic pointer value with next basic pointer value that multiplexer 60 provides.In this way, the basic pointer value that offers the logic 58 that remaps will change over new basic pointer value.

Quote the basic pointer value sum that provides with basic more new logic 58 by the logic register in the operand that is included in instruction at the physical register that the partial memory that remaps of registers group is got determines.This addition be carry out by totalizer 68 and output passed to moding circuit 70.In preferred embodiment, moding circuit 70 is gone back receiving register around value, if surpass register around value from the output signal (logic register is quoted and basic pointer value sum) of totalizer 68, the result will be around the bottom of getting back to the district of remapping.Output with moding circuit 70 offers multiplexer 72 then.

Register counting (REGCOUNT) value is offered the interior logic 74 of the piece 56 of remapping, the number of the register that will remap in the identified group.Logic 74 is quoted comparison with this register count value and logic register, and according to comparative result control signal is passed to multiplexer 72.Multiplexer 72 is quoted as two input receive logic register and the output (register that remaps is quoted) of moding circuit 70.In the preferred embodiment of the present invention,,, logic 74 quotes just instructing register that multiplexer 72 output is remapped to quote as physical register if logic register is quoted less than the register count value.Yet,, quote just logic 74 instructs the direct output logic register of multiplexer to quote as physical register if logic register is quoted more than or equal to the register count value.

As mentioned above, in preferred embodiment, repetitive instruction is called the mechanism of remapping.As going through after a while, repetitive instruction provides four circulations null cycle in hardware.These hardware loop are illustrated among Fig. 5 as the part of instruction decoder 50.Instruction decoder 50 request each time is during from the instruction of cache memory 6, and cache memory just returns to instruction decoder with this instruction, and this moment, instruction decoder judged whether the instruction of returning is repetitive instruction.If just this repetitive instruction is handled in one of configure hardware circulation.

Instruction number in each repetitive instruction designated cycle reaches around round-robin number of times (it is constant or reads the register from Piccolo).Provide two operational codes ' repetition ' (REPEAT) and next (NEXT) define hardware loop, ' next one ' operational code only is not assembled into instruction as delimiter.Repeat from the round-robin starting point, and ' next one ' defines the round-robin end, allows the instruction number in the assembly routine computation cycles body.In preferred embodiment, repetitive instruction can comprise will by register remap that logic 52 uses such as register counting (REGCOUNT), basic increment (BASEINC), basic ring around (BASEWRAP) and register around (REGWRAP) parameter etc. parameter that remaps.

Some registers can be set come the storage register employed parameter that remaps of logic that remaps.In these registers, the some groups of predefined parameters that remap can be provided, keep some registers simultaneously for the user-defined parameter that remaps of storage.If the parameter that remaps with the repetitive instruction appointment equals predefined one of the parameter group that remaps, then adopt suitable repeated encoding, this coding causes multiplexer and so on that the suitable parameter that remaps is directly offered the register logic that remaps from register.Otherwise, parameter is all different with any predefined parameter group that remaps if remap, then assembly routine generates the parameter move instruction (RMOV) of remapping, and its allows the register of configure user definition parameter that remaps, and RMOV instruction back is a repetitive instruction.Preferably the RMOV instruction will user-definedly be remapped to instruct and will be placed on to storing in the register that this user-defined parameter that remaps reserves, and then multiplexer will be programmed for the delivery of content of these registers to the register logic that remaps.

In preferred embodiment, register counting, basic increment, basic ring take off one of value of determining in the table around reaching register around parameter:

Parameter	Describe
Parameter	Describe	REGCOUNT (register counting)	But it determines to carry out 16 bit register numbers and the value 0,2,4,8 that remaps in the above.The following register of REGCOUNT remaps, more than or what equal REGCOUNT is direct access.
BASEINC (basic increment)	This is defined in each and circulates when repeating to finish what 16 bit registers of basic pointer increment.But its value 1,2 or 4 in preferred embodiment, though its desirable other value in fact if desired can comprise negative value in the time of suitably.	REGCOUNT (register counting)
BASEINC (basic increment)		BASEWRAP (basic ring around)	It determines the upper limit that base calculates.But basic ring winding mold value 2,4,8.
REGWRAP (register around)	The upper limit that it is determined to remap and calculates.But register is around mould value 2,4,8.REGWRAP may be selected to be and equals REGCOUNT	BASEWRAP (basic ring around)

Referring to Fig. 6, how the piece 56 that remaps uses the example of various parameters following (in this example, logical and physical register value is with respect to particular group):

If (logic register＜REGCOUNT)

Physical register=(logic register+yl) MOD REGCOUNT

else

Physical register=logic register

end?if

In loop ends place, before round-robin repeats beginning next time, the following renewal that basic more new logic 58 is carried out basic pointer:

Base=(the MOD BASEWRAP of base+BASEINC)

In loop ends place of remapping, close register and remap, then as all registers of physical register access.In the preferred embodiment, have only the REPEAT that remaps (repetition) to enliven on any one time.Circulation also can be nested, but have only a circulation can upgrade the variable that remaps in any particular moment.Yet the repetition of can nestedly remapping if desired.

As the benefit that the result reached that adopts according to the mechanism of remapping of preferred embodiment of the present invention typical piece filter algorithm is discussed below in order to show about code density.The principle of blocking filter algorithm at first is discussed with reference to Fig. 7.As shown in Figure 7, with accumulator registers A0 be configured to the to add up result of several times multiplying, multiplying be multiply by the multiplication of data item d0 for coefficient C0, and coefficient c1 multiply by the multiplication of data item d1, and coefficient c2 multiply by the multiplication of data item d2 etc.The add up result of similar multiplying group of register A1, but at this moment coefficient sets has been shifted and makes c0 multiply by d1 now, c1 multiply by d2, and c2 multiply by d3 etc.Similarly, the result of the register A2 cumulative data coefficient value with one step of right shift again on duty makes c0 multiply by d2, and c1 multiply by d3, and c2 multiply by d4 etc.The process that repeats this displacement then, takes advantage of and add up is placed on the result among the register A3.

If do not adopt register to remap, then need following instruction cycle to come the execution block filtering instructions according to preferred embodiment of the present invention:

Begin with 4 new data value

ZERO{A0-A3}; The zero clearing totalizer

REPEAT Z1; Z1=(coefficient number/4)

Four coefficients below carrying out in the first round

；a0+＝d0*c0+d1*c1+d2*c2+d3*c3

；a1+＝d1*c0+d2*c1+d3*c2+d4*c3

；a2+＝d2*c0+d3*c1+d4*c2+d5*c3

；a3+＝d3*c0+d4*c1+d5*c2+d6*c3

MULA A0, X0.l^, Y0.l, A0; A0+=d0*c0, and load d4

MULA A1, X0.h, Y0.l, A1; A1+=d1*c0MULA A2, X1.l, Y0.l, A2; A2+=d2*c0 MULA A3, X1.h, Y0.l^, A3; A3+=d3*c0, and load c4MULA A0, X0.h^, Y0.h, A0; A0+=d1*c1, and load d5MULA A1, X1.l, Y0.h, A1; A1+=d2*c1MULA A2, X1.h, Y0.h, A2; A2+=d3*c1MULA A3, X0.l, Y0.h^, A3; A3+=d4*c1, and load c5MULA A0, X1.l^, Y1.l, A0; A0+=d2*c2, and load d6MULA A1, X1.h, Y1.l, A1; A1+=d3*c2MULA A2, X0.l, Y1.l, A2; A2+=d4*c2MULA A3, X0.h, Y1.l^, A3; A3+=d5*c2, and load c6MULA A0, X1.h^, Y1.h, A0; A0+=d3*c3, and load d7MULA A1, X0.l, Y1.h, A1; A1+=d4*c3MULA A2, X0.h, Y1.h, A2; A2+=d5*c3MULA A3, X1.l, Y1.h^, A3; A3+=d6*c3, and load c7NEXT

In this example, data value is placed in the X registers group coefficient value is placed in the y register group.As the first step, four accumulator registers A0, A1, A2 and A3 are set to zero.The accumulator registers in case resetted, just entry instruction circulation, this circulation (REPEAT) reaches ' next one ' with ' repetition ' and (NEXT) instructs demarcation.Value Z1 determines the number of times that instruction cycle should repeat, and for reason discussed below, the number of its as many as coefficient (c0, c1, c2 etc.) is divided by 4.

Instruction cycle comprises 16 multiply accumulatings instructions (MULA), and these are proposed order and will cause at register A0 after for the first time by circulation, A1, and A2 comprises the result of calculation shown in above-mentioned repetition and article one MULA code between instructing among the A3.In order to illustrate how the multiply accumulating instruction is operated, we will consider preceding four MULA instruction.Article one, instruction first or low 16 data value that X is organized register 0 multiply by in the Y group register 0 low 16, and the result is added among the accumulator registers A0.With low 16 that recharge a mark X group register 0, this indicates the present available new data value of this part of this register to recharge simultaneously.Mark is because as can be seen from Figure 7 in this way, in case data item d0 be multiply by coefficient c0 (by article one MULA instruction expression), just no longer needs for all the other piece filtering instructions d0, therefore can replace with new data value.

Then second MULA instruction with X organize register 0 second or high 16 multiply by low 16 of Y group register 0 (multiplication d1 shown in this presentation graphs 7 * c0).Similarly, multiplication d2 * c0 and d3 * c0 are represented in the 3rd and the 4th MULA instruction respectively.As can be seen from Fig. 7, in case carried out this four calculating, coefficient c0 just no longer needs, and therefore with recharging a flag register Y0.l it can be rewritten with another coefficient (c4).

Below four MULA instruction respectively expression calculate d1 * c1, d2 * c1, d3 * c1 and d4 * c1.In case carried out d1 * c1, just with recharging a flag register x0.h, because no longer need d1.Similarly, in case carried out whole four instructions, just register Y0.h is labeled as for recharging, because no longer need coefficient c1.Similarly, below four MULA instruction corresponding to calculating d2 * c2, d3 * c2, d4 * c2 and d5 * c2, last four instructions is then corresponding to calculating d3 * c3, d4 * c3, d5 * c3 and d6 * c3.

In the above-described embodiments, because register can not remap, each multiplying must be regenerated significantly with the required particular register of appointment in the operand.In case carry out 16 MULA instruction, just can repeat this instruction cycle for coefficient c4 to c7 and data item d4 to d10.And circulate on four coefficient values and operate owing to repeat this each time.So the number of coefficient value must be 4 multiple and must calculate Z1=coefficient number/4.

By adopting the mechanism that remaps according to preferred embodiment of the present invention, can greatly dwindle instruction cycle, make it only comprise 4 multiply accumulating instructions rather than otherwise needed 16 multiply accumulatings instruction.The employing mechanism that remaps becomes following listed with code compiling:

Begin with 4 new data value

ZERO{A0-A3}; The zero clearing totalizer

REPEAT Z1, X++ n4 w4 r4, Y++ n4 w4 r4; Z1=(number of coefficient)

X and Y group are remapped

Four 16 bit registers in these groups that remap

The basic pointer that repeats each time two groups at round-robin increases progressively.

Just wraparound when basic pointer arrives in this group the 4th register.

MULA A0, X0.l^, Y0.l, A0; A0+=d0*c0, and load d4

MULA A1，X0.h，Y0.l，A1 ；a1+＝d1*c0

MULA A2，X1.l，Y0.l，A2 ；a2+＝d2*c0

MULA A3, X1.h, Y0.l^, A3; A3+=d3*c0, and load c4

NEXT; Rap around to and circulate and remap

As mentioned above, the first step is arranged to 0 with four accumulator registers A0-A3.Enter the instruction cycle that usefulness ' repetition ' and ' next one ' operational code are delimited then.Repetitive instruction has related with it several parameters, and they are:

X++: indication is " 1 " for X registers group base increment.

N4: the indicator register counting is " 4 ", and preceding four X group register X0.l to X1.h therefore will remap

W4: indicate for X registers group basic ring around being " 4 "

R4: indicate for X registers group register around being " 4 "

Y++: indication is " 1 " for y register group base increment

N4: the indicator register counting is " 4 " so preceding 4 Y group register Y0.l to Y1.h that will remap.

W4: indicate for y register group basic ring around being " 4 "

R4: indicate for y register group register around being " 4 "

Be also pointed out that present value Z1 equals to equal number of coefficients/4 in number of coefficients rather than the prior art example.

For the circulation first time of instruction cycle, basic pointer value is 0, does not therefore have and remaps.Yet carry out circulation time, organizing basic pointer value for X and Y all will be " 1 " next time, and it is as follows therefore operand to be remapped:

X0.l becomes X0.h

X0.h becomes X1.l

X1.l becomes X1.h

X1.h becomes X0.l (because basic ring is around being " 4 ")

Y0.l becomes Y0.h

Y0.h becomes Y1.l

Y1.l becomes Y1.h

Y1.h becomes Y0.l (because basic ring is around being " 4 ")

Therefore, can find out when repeating for the second time that in fact four MULA instructions carry out not comprising in the example that remaps of the present invention with the 5th to the 8th the indicated calculating of MULA instruction of early discussing.Similarly, repeat for the 3rd and the 4th time to carry out nine to 12nd and 13rd to 16th the calculating that MULA instruction carry out of front with the prior art code by circulation.

Therefore above-mentioned as can be seen code is carried out and identical filter algorithm of prior art code, but the code density in the loop body has been improved a factor 4, owing to only need provide 4 instructions rather than prior art required 16.

By adopting register according to the preferred embodiment of the present invention technology that remaps, can realize following advantage:

1. improvement code density;

2. in certain occasion, hide from flag register and be the empty stand-by period that the preface impact damper recharges this register that resets to Piccolo.This can reach by separating open cycle with the cost that increases the code size.

3. can access the register-, can change the register number of access of variable number by changing the circulation multiplicity of carrying out; And

4. being convenient to algorithm launches.For suitable algorithm, the n stage that the programmer can be algorithm generates one section code, utilizes register to remap then formula is applied on the slip data set.

Clearly can not depart from the scope of the present invention the above-mentioned register mechanism of remapping is made some change.For example, might by registers group 10 provide than the programmer in instruction operands the more physical register of energy appointment.These extra registers can not direct access, and the register mechanism of remapping can be utilized these registers.For example, consider the previous X registers group of discussing have available 4 32 bit registers of programmer and thereby the utilogic register quote the example of specifying 8 16 bit registers.Might make the X registers group in fact comprise for example 6 32 bit registers, will have 4 16 additional bit registers can not be in this case by programmer's direct access.Yet these four extra registers mechanism of being remapped is utilized, and provides additional register for storing data item whereby.

Can use following assembly routine grammer:

Presentation logic moves to right, perhaps move to left when negative at the shifting function number (face＜lscale as follows 〉).

-the expression arithmetic shift right, perhaps move to left when negative at the shifting function number (face＜scale as follows 〉).

ROR represents ring shift right

The saturation value (size that depends on destination register is saturated to 16 or 32) of SAT (a) expression a.Particularly, in order to be saturated to 16, any value usefulness+0x7fff greater than+0x7fff replaces, any then usefulness-0x8000 replacement of value less than-0x8000.Be saturated to 32 similarly with the limit+0x7fffffff and-0x80000000.If it is destination register is 48, saturated still on 32.

Source operand 1 can be with one of following form:

＜Src1〉will writing a Chinese character in simplified form as [Rn  Rn.l  Rn.h  Rn. *] [^].In other words, 7 of all of source specifier are all effective, and read register as the value of (selectively exchanging) 32 place values or the expansion of 16 bit signs.Only read 32 of bottoms for totalizer.The ^ indicator register recharges.

＜src1_16〉be writing a Chinese character in simplified form of [Rn.l  Rn.h] [^].Can only read 16 place values.

＜src1_32〉be writing a Chinese character in simplified form of [Rn  Rn.X] [^].Can only read 32 place values, the high and exchange selectively that is at half.

＜src_2〉(source operand 2) can be one of following form:

＜src2〉be writing a Chinese character in simplified form of three kinds of options

The source-register of-form [Rn  Rn.l  Rn.h  Rn.x] [^] adds the scale (＜scale 〉) of net result.

8 constants of-selectable displacement (＜immed_8 〉), but do not have the scale of net result.

-6 constants (＜immed_6 〉) add the scale (＜scale 〉) of net result.

＜src2_maxmin〉with＜src2 identical but do not allow calibration.

＜src2_shift〉provide＜src2 the shift order of finite subset.See above-mentioned details.

＜src2_par〉＜src2_shift〉aspect

Instruction for the appointment 3-operand:

＜acc〉any one writing a Chinese character in simplified form in four accumulator registers [A0  A1  A2  A3].Read whole 48.Can not specify and recharge.

Destination register has form:

＜dest〉it is writing a Chinese character in simplified form of [Rn  Rn.l  Rn.h  .l ] [^].Be not with ". " expansion to write whole register (being 48 in the totalizer situation).Do not needing to write back in the situation of register, employed register is unessential.The assembly routine support is omitted destination register and is indicated and do not need to write back, or indicates with " .l " and not need to write back, but sign should be set, and is 16 amounts just as the result.^ represents value is write among the output FIFO.

＜scale〉the some arithmetic scales of expression.Utilizable have 14 kinds of scales:

ASR?#0，1，2，3，4，6，8，10

ASR #12 to 16

LSL?#1

＜immed-8〉not signed 8 immediate values of representative.This comprises ring shift left 0,8, a byte of 16 or 24.Therefore can be any YZ encoded radio 0xYZ000000,0x00YZ0000,0x0000YZ00, and 0x000000YZ.Circulation is to encode as 2 amount.

＜imm_6〉represent not signed 6 to count immediately.

＜PARAMS〉be used for specifying register to remap and have following form:

＜BANK〉can be [X  Y  Z]

＜BASEINC〉can be [++ +1 +2 +4]

＜RENUMBER〉can be [0 , 2 , 4  8]

＜BASEWRAP〉can be [2 , 4  8]

Expression formula＜cond〉be any in the following status code.Notice that coding and ARM are slightly different, because not signed LS and HI sign indicating number are substituted by more useful signed overflow/underflow test.The setting of the V on the Piccolo and N sign is different with ARM's, and therefore the translation of checking from state verification to sign is also different with ARM.

The last result of 0000 EQ Z=0 is 0.

The last result non-0 of 0001 NE Z=1.

0010 CS C=1 uses in displacement/maximum operation back.

0011?CC C＝0

The last result of 0100 MI/LT N=1 is for negative

The last result of 0101 PL/GE N=0 is for just

The last tape symbol as a result of 0110 VS V=1 overflows/and saturated

The last result of 0111 VC V=0 do not have overflow/saturated

1000 VP V=1﹠amp; The last result of N=0 is just overflowed

1001 VN V=1﹠amp; Negative the overflowing of the last result of N=1

1010 keep

1011 keep

1100?GT N＝0&Z＝0

1101?LE N＝1Z＝1

1110?AL

1111 keep

Because Piccolo handles signed amount, discard not signed LS and HI state and replace with the VP and the VN of any direction of overflowing of description.Because the result of ALU is 48 bit wides, MI and LT carry out identical function now, similarly PL and GE.This stays 3 dead slots for following expansion.

Except as otherwise noted, all computings all are signed.

One-level and secondary status code respectively comprise:

N-is negative.

Z-zero.

The C-carry/tape symbol does not overflow.

The V-tape symbol overflows.

Arithmetic instruction can be divided into two classes: parallel and " full duration "." full duration " instruction only is provided with the one-level sign, and concurrent operation symbol according to result's height with low 16 half one-level and secondary sign are set.

Applying calibration but before writing the destination, N, Z and V sign is according to whole ALU result's calculating.ASR will always reduce the required figure place of event memory, and ASL then increases figure place.In order to prevent 48 results of Piccolo truncation when applying the ASL calibration, figure place is limited in carries out zero detection and overflow.

The N sign calculates when supposing to carry out signed arithmetic operation.This is because when overflowing, and result's most significant digit is one of C sign or N sign, and this depends on that input operand is a tape symbol or not signed.

Whether the indication of V sign any loss of significance occurs as the result of the destination of the result being write selection.If selected not write back, still contain ' size ', and overflow indicator correctly is set.In following situation, occur overflowing:

-when the result is not in scope-2^15 to 2^15-1, write 16 bit registers.

-when the result is not in scope-2^31 to 2^31-1, write 32 bit registers.

Parallel add/subtract instruction result's height be at half on N, Z be set independently indicate with V.

When write accumulator with write the same V of setting of 32 bit registers sign.This is to allow saturated instruction to use totalizer as 32 bit registers.

Saturated absolute value instruction (SABS) also is provided with overflow indicator when the absolute value of input operand does not meet the designated destination.

Carry flag is by adding and subtracting the instruction setting and be used as ' scale-of-two ' sign by MAX/MIN, SABS and CLB instruction.Comprise multiplying at all interior other instruction partial carry signs.

For adding and subtracting computing, be 32 or 16 bit wides according to the destination, carry is by position 31 or position 15 or result's generation.

According to how sign is set, can be with standard arithmetic instruction divide into several classes type:

Adding and subtract in the situation of instruction, if the N position be set all signs of maintenance.If it is as follows that N not set of position then will indicate is upgraded:

If complete 48 results are 0 just set Z.

If complete 48 meta 47 set as a result (bearing) then set N.

The set V if one of following condition is set up:

Destination register is 16 and signed result is put to advance (not in scope-2^15＜=x＜2^15) in 16 bit registers.

Destination register is 32/48 bit register and signed result is put to advance in 32.

If at summation＜src1〉with＜src2 the time from the position 31 carry is arranged or from＜src1 deduct＜src2 the time position 31 borrow does not appear, if then＜dest just set C sign (with the desired identical carry value on the ARM) when being 32 or 48 bit registers.If＜dest〉be 16 bit registers, if just and position 31 carry set C sign then.

Keep secondary sign (SZ, SN, SV, SC).

The situation of carrying out the multiplication or the instruction that adds up from 48 bit registers.

If complete 48 results are 0 just set Z.

If complete 48 meta 47 set as a result (bearing), then set N.

If (1) destination register be 16 and signed result to put to advance 16 bit registers (not in scope-2^15＜=x＜2^15) or (2) destination register be 32/48 bit register and signed result is put to advance in 32, just set V.

Keep C.

Keep secondary sign (SZ, SN, SV, SC).

Discuss below comprise logical operation, parallel add with subtract, maximum and minimum, displacement etc. are in other interior instruction.

Add and subtract instruction with two register additions or subtract each other, calibrate this result, a register is got back in storage then.Operand is treated as signed value.For the unsaturation modification, sign upgrades and supplies to select for use, and can upgrade by suppressing sign at the additional N of instruction afterbody.31?30?29?28?27?26 25?24?23?22?21?20?19 18 17?16?15?14?13?12 11?10?9?8?7?6?5?4?3?2?1?0

0

OPC

?F

?S ?D

DEST

S1

?R ?1

SRC1

SRC2

The type of OPC designated order

Operation (OPC):

100N0 dest＝(src1+src2)(->>scale)(，N)

110N0 dest＝(src1-src2)(->>scale)(，N)

10001 dest＝SAT((src1+src2)(->>scale))

11001 dest＝SAT((src1，src2)(->>scale))

01110 dest＝(src2-src1)(->>scale)

01111 dest＝SAT((src2-src1)(->>scale))

101N0 dest＝(src1+src2-Carry)(->>scale)(，N)

111N0 dest＝(src1-src2+Carry-1)(->>scale)(，N)

Memonic symbol:

100N0 ADD{N} <dest>，<src1>，<src2>{，<scale>}

110N0 SUB{N} <dest>，<src1>，<src2>{，<scale>}

10001 SADD <dest>，<src1>，<src2>{，<scale>}

11001 SSUB <dest>，<src1>，<src2>{，<scale>}

01110 RSB <dest>，<src1>，<src2>{，<sca1e>}

01111 SRSB <dest>，<src1>，<src2>{，<scale>}

101N0 ADC{N} <dest>，<src1>，<src2>{，<scale>}

111N0 SBC{N} <dest>，<src1>，<src2>{，<scale>}

Assembly routine is supported following operational code

CMP <src1>，<src2>，

CMN <src1>，<src2>，

CMP is a subtraction, and it is provided with sign and disable register is write.CMN is an addition, and it is provided with sign and disable register is write.

Sign: toply discussed.

The reason that comprises:

It is useful after displacement/maximum/minimum operation carry being inserted register bottom ADC.It also is used for carrying out 32/32 division.It also provides the extended precision addition.N position addition provides more accurate sign control, particularly carry.This makes 32/32 division to carry out on 2 every cycles.

G.729 waiting needs saturated add and subtract.

The increment/decrement counter.RSB is useful (x=32-x is a common operation) for calculating displacement.Need saturated RSB for saturated negating (in being used in G.729).

Add/subtract accumulative total instruction execution and implicate meter and calibration/saturated addition and subtraction.Different with multiply accumulating instruction, can not be independent of destination register and specify totalizer number.Two of the bottoms of destination register provide the 48 bit accumulator acc that will be accumulated to wherein.So ADDA X0, X1, X2, A0 and ADDA A3, X1, X2, A3 are effectively, and ADDA X1, X1, X2, A0 are then invalid.For the instruction of this class, what the result must be write back register-do not allow destination field does not write back coding.31?30?29?28?27?26 25 24?23?22?21?20?19?18?17?16?15?14?13?12?11?10?9?8?7?6?5?4?3?2?1?0

?0

?O ?P ?C

?1

?0

Sa

F

S D

DEST

S1

?R ?1

SRC1

SRC2

The type of OPC designated order.Below acc be (DEST[1: 0]).The indication of Sa position is saturated.

Operation (OPC):

0 dest＝{SAT}(acc+(src1+src2)){->>scale}

1 dest＝{SAT}(acc+(src1-src2)){->>scale}

Memonic symbol

0 {S}ADDA <dest>，<src1>，<src2>，<acc>{，<scale>}

1 {S}SUBA <dest>，<src1>，<src2>，<acc>{，<scale>}

The S of order front represents saturated.

Sign: above seeing,

The reason that comprises:

ADDA (adding accumulative total) instruction is useful (for example finding out their mean value) for two words with each cycle summation integer array of totalizer.SUBA (subtracting accumulative total) instruction is useful calculating difference sum (being used for being correlated with); It with two independently value subtract each other and difference be added in the 3rd register.

The addition that rounds up of band can be used and＜acc〉different＜dest carry out.For example, X0=(X1+X2+16384)〉〉 15 can be by remaining among the A0 and in one-period, finish with 16384.The addition of the constant that band rounds up can be used ADDA X0, X1, and #16384, A0 finishes.

Accurately realize position for ((a_i*b_j)〉〉 k) sum (quite commonly used in TrueSpeech):

Standard P iccolo code is:

MUL t1，a_0，b_0，ASR#K

ADD ans，ans，t1

MUL t2，a_1，b_1，ASR#k

ADD ans，ans，t2

This code has two problems: it is oversize and be not to be added to 48 precision, therefore can not use safeguard bit.Solution is for using ADDA preferably:

MUL t1，a_0，b_0，ASR#k

MUL t2，a_1，b_1，ASR#k

ADDA?ans，t1，t2，ans

This improves 25% speed and keeps 48 precision.

Walk abreast to add/subtract and carry out addition and subtraction on two signed 16 amounts of instruction in remaining on 32 bit registers in pairs.The one-level condition code flag is from high 16 setting as a result, and the secondary sign is then from half renewal of low level.Can only specify the source of 32 bit registers, though these values can be exchanged by half-word as these instructions.With each register each half treat as signed value.Calculating and calibration not loss of accuracy are finished.Therefore ADD ADD X0, X1, X2, ASR#1 will be at the high position and low level of the X0 correct mean value of generation in half.For must respectively instructing of set Sa position providing select for use saturated.31?30?29?28?27?26?25?24 23?22?21?20?19?18?17?16?15?14?13?12?11?10?9?8?7?6?5?4?3?2?1?0

?0

OPC

Sa

?F

?S ?D

DEST

S1

R 1

SRC1

SRC2

OPC defining operation operation (OPC):

000 dest.h＝(src1.h+src2.h)->>{scale}，

dest.l＝(src1.l+src2.l)->>{scale}

001 dest.h＝(src1.h+src2.h)->>{scale}，

dest.l＝(rc1.l-src2.l)->>{scale}

100 dest.h＝(src1.h-src2.h)->>{scale}，

dest.l＝(src1.l+src2.l)->>{scale}

101 dest.h＝(src1.h-src2.h)->>{scale}，

dest.l＝(rc1.l-src2.l)->>{scale}

If set the Sa position, each and/difference be independence saturated.Memonic symbol:

000 {S}ADDADD <dest>，<src1_32>，<src2_32>{，<scale>}

001 {S}ADDSUB <dest>，<src1_32>，<src2_33>{，<scale>}

100 {S}SUBADD <dest>，<src1_32>，<src2_32>{，<scale>}

101 {S}SUBSUB <dest>，<src1_32>，<src2_32>{，<scale>}

S before the order represents saturated.Assembly routine is also supported

CMNCMN <dest>，<src1_32>，<src2_32>{，<scale>}

CMNCMP <dest>，<src1_32>，<src2_32>{，<scale>}

CMPCMN <dest>，<src1_32>，<src2_32>{，<scale>}

CMPCMP＜dest 〉,＜src1_32 〉,＜src2_32 〉,＜scale〉} they are not to be with the stereotyped command that writes back to generate.

Sign:

If C two high 16 one halfs of addition from the position 15 carries, just set.

16 half sums are 0 if Z is high, just set.

If high 16 half sums of N are for negative, just set.

If V is high 16 half signed 17 and can not pack into (calibration back) in 16, just set.

Be low 16 half set SZ, SN, SV and SC similarly.

The reason that comprises:

It is parallel that to add with subtracting instruction be useful for carrying out computing on the plural number in remaining on single 32 bit registers.They are used in FFT (Fast Fourier Transform (FFT)) core.It also is useful for the simple vector addition/subtraction of 16 bit data, allows to handle in one-period two elements.

Shift the condition changing in (condition) instruction permission control stream.Piccolo takies three cycles and carries out the transfer of being got.31 30?29?28?27?26 25?24?23 22?21?20 19 18 17 16 15 14 13 12 11 10?9?8?7-6?5?4 3?2?1?0

0

11111

100

000

IMMEDIATE_16

COND

Operation:

If according to one-level sign＜cond〉set up, shift with side-play amount.

Side-play amount is signed 16 numbers of words.The scope of current skew is limited in-32768 to+32767 words.

The address computation of carrying out is

Destination address=jump instruction address+4+ side-play amount

Memonic symbol:

B<cond><destination_label>

Sign: unaffected.

The reason that comprises:

Highly useful in most of routines.

The condition instruction that adds deduct is added in src1 src2 conditionally goes up or deduct src2 from src1.31?30?29?28?27 26?25?24 23?22?21?20?19 18?17 16?15?14?13?12?11?10?9?8?7?6?5?4?3?2?1?0

?1

0010

?O ?P ?C

?F

?S ?D

DEST

S1

?R ?1

SRC1

SRC2

The type of OPC designated order.

Operation (OPC):

(if carry set) temp=src1-src2 otherwise temp=src1+src2

dest＝temp{->>scale}

(if carry set) temp=src1-src2 otherwise temp=src1+src2

Dest=temp{-〉〉 if scale} but the calibration be to shift left

New value that then will (from src1-src2 or src1+src2) carry is shifted in the bottom.

Memonic symbol:

0 CAS <dest>，<src1>，<src2>，{，<scale>}

1 CASC?<dest>，<src1>，<src2>，{，<scale>}

Sign: above seeing:

The reason that comprises:

Condition adds deduct to instruct efficient division code can be constituted.

Example 1: with 32 among the X0 not signed value divided by 16 among the X1 not signed value (suppose X0＜(X1＜＜16) and X1.h=0).

LSL X1, X1, #15; On remove number

SUB X1, X1, #0; The set carry flag

REPEAT#16

CASC?X0，X0，X1，LSL#1

Example 2: with 32 among the X0 on the occasion of divided by 32 among the X1 on the occasion of, band early finishes.

MOV X2, #0; Remove the merchant

LOG Z0, X0; The displaceable figure place of X0

LOG Z1, X1; The displaceable figure place of X1

SUBS Z0, Z1, Z0; The X1 upward displacement is 1 coupling therefore

BLT div_end; X1＞X0 so answer are 0

LSL X1, X1, Z0; 1 of coupling front

ADD Z0, Z0, #1; The test number that carries out

SUBS Z0, Z0, #0; The set carry

REPEAT Z0

CAS X0，X0，X1，LSL#1

ADCN X2，X2，X2

In end, X2 keeps the merchant and remainder can recover from X0.

The instruction of counting bit preamble makes the normalization of data energy.31?30?29?28?27?26 25 24?23?22?21?20?19?18?17?16?15?14?13?12 11?10?9?8?7?6?5?4?3?2?1?0

011011

F

S D

DEST

S1

R 1

SRC1

101110000000

Operation:

Dest is set in order to make position 31 and figure place that value in src1 must move to left different with 30.This is a value among the scope 0-30, but except src1 be-1 or 0 special circumstances, at this moment return 31.

Memonic symbol:

CLB <dest>，<src1>

Sign:

If Z result is 0, just set.

N eliminates.

If C src1 is one of-1 or 0, just set.

V keeps.

The reason that comprises:

The step that normalization needs.

Be provided with the execution 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210 that stops to be used to stop Piccolo with break-poing instruction

?1

11111

11

OP

00000000000000000000000

The type of OPC designated order.

Operation (OPC):

0 Piccolo carry out be stopped and in the Piccolo status register set stop

The position.

1 Piccolo carries out and to stop, and in the Piccolo status register set interrupt bit,

And interruption ARM report has arrived breakpoint.

Memonic symbol:

0 HALT

1 BREAK

Sign: unaffected.

Logic instruction actuating logic computing on 32 or 16 bit registers.Operand is treated as signed value not.31?30?29?28?27?26 25?24?23?22?21?20?19?18 17?16?15?14?13?12?11?10?9?8?7?6?5?4?3?2?1?0

?1

000

?OPC

?F

?S ?D

DEST

S1

?R ?1

SRC1

SRC2

The logical operation that the OPC coding will be carried out

Operation (OPC):

00 dest＝(src1&src2){->>scale}

01 dest＝(src1src2){->>scale}

10 dest＝(src1&-src2){->>scale}

11 dest＝(src1^src2){->>scale}

Memonic symbol:

00 AND <dest>，<src1>，<src2>{，<scale>}

01 ORR <dest>，<src1>，<src2>{，<scale>}

10 BIC <dest>，<src1>，<src2>{，<scale>}

11 EOR <dest>，<src1>，<src2>{，<scale>}

Assembly routine is supported following operational code:

TST <src1>，<src2>

TEQ <src1>，<src2>

TST be disable register write " with " .TEQ is " EOR " that disable register is write.

Sign:

If Z result is complete 0, just set

N, C, V keep

SZ, SN, SC, SV keep

The reason that comprises:

The voice compression algorithm adopts the combination bit field to come coded message.These fields of extraction/combination are assisted in the bit mask instruction.

Max and Min operational order are carried out maximum and minimum operation.31?30?29?28 27?26?25?24 23?22?21?20?19 18 17?16?15?14?13?12 11?10?9?8?7?6?5?4?3?2?1?0

0

101

?O ?P ?C

?I

?F

?S ?D

DEST

S1

?R ?1

SRC1

SRC2

The type of OPC designated order.

Operation (OPC):

0 dest＝(src1<＝src2)？src1：src2

1 dest＝(src1>src2)？src1：src2

Memonic symbol:

0 MIN <dest>，<src1>，<src2>

1 MAX <dest>，<src1>，<src2>

Sign:

If Z result is 0, just set.

If N result is for negative, just set.

C is for Max: if src2 〉=src1 (dest=src1 situation), set C

For Min: if src2 〉=src1 (dest=src2 situation), set C

V keeps

The reason that comprises:

In order to find out signal intensity, many algorithm scanned samples are found out the maximum/minimum value of the absolute value of sample.To this, MAX and MIN are priceless treasures.Depend on that will find out in the signal first still is last maximal value, operand src1 and src2 can exchange.

MAX X0, X0, #0 convert X0 to the positive number that prunes away from below.

MIN X0, X0, #255 prunes away from above.This is useful for graphics process.

Maximal value and minimum operation are carried out in Max in the parallel instruction and Min computing on 16 parallel bit data.31?30?29?28 27?26?25 24?23?22?21?20?19 18 17 16?15?14?13?12 11?10?9?8?7?6?5?4?3?2?1?0

0

111

?O ?P

?1

?F

?S ?D

DEST

S1

?R ?1

SRC1

SRC2-PARALLEL

The type of OPC designated order.Operation (OPC):

0 dest.l＝(src1.l<＝src2.l)？src1.l：src2.l

dest.h＝(src1.h<＝src2.h)？src1.h：src2.h

1 dest.l＝(src1.l>src2.l)？src1.l：src2.l

dest.h＝(src1.h>src2.h)？src1.h：src2.h

Memonic symbol:

0 MINMIN <dest>，<src1>，<src2>

1 MAXMAX <dest>，<src1>，<src2>

Sign:

If high 16 of Z result is 0, just set.

If high 16 of N result is negative, just set.

C is for Max: if src2.h 〉=src1.h

(dest=src1 situation), set C

For Min: if src2.h=src1.h

(dest=src2 situation), set C.

V keeps.

SZ, SN, SC, SV are low 16 half set similarly.

The reason that comprises:

About 32 Max and Min.

Transmitting the long operational orders of counting immediately allows register is arranged to the value that any signed 16, symbol extend.Article two, this instruction 32 bit registers can be arranged to any value (by sequential access high-order with low level half).See selection operation for the transmission between the register.31 30?29?28?27?26 25?24 23?22?21?20?19 18?17?16?15?14?13?12?11?10?9?8?7?6?5?4 3 2?1?0

1

11100

?F

?S ?D

DEST

IMMEDIATE_15

-1 -

000

Memonic symbol

MOV <dest>，#<imm_16>

Assembly routine utilizes the MOV instruction that non-interlocking NOP (blank operation) operation is provided, that is, NOP is equivalent to MOV, #0.

Sign: indicate unaffected.

The reason that comprises:

Initialization register/counter.

The multiply accumulating operational order is carried out signed multiplication and is added up or tired subtract (de-accumulation), and calibration is with saturated.31?30?29?28?27?26?25 24?23?22?21?20?19 18?17 16?15?14?13?12?11?10?9?8?7?6?5?4?3?2?1?0

?1

10

?OPC

Sa

F

S D

DEST

A 1

R 1

SRC1

SRC2_MULA

The type of field OPC designated order.

Operation (OPC):

00 dest＝(acc+(src1*src2)){->>scale}

01 dest＝(acc-(src1*src2)){->>scale}

In each situation, if set the Sa position, before writing the destination that the result is saturated.

Memonic symbol:

00 {S}MULA <dest>，<src1_16>，<src2_16>，<acc>{，<scale>}

01 {S}MULS <dest>，<src1_16>，<src2_16>，<acc>{，<scale>}

S indication before the order is saturated.

Sign: see and go up joint.

The reason that comprises:

Need lasting MULA of monocycle for FI R code.MULS is used in the FFT butterfly circuit.The multiplication MULA that rounds up for band also is useful.For example can in one-period, finish A0=(X0*X1+16384) by remaining in another totalizer (for example A1) with 16384〉〉 15.Also need different＜dest for the FFT core〉with＜acc 〉.

Take advantage of double computing (Multiply Double Operation) instruction fill order sign multiplication, add up or tiredly subtract, calibration and saturated before the result is doubled.31?30?29 28 27?26?25?24 23?22?21?20?19 18 17?16?15?14?13?12 11 10 9 8?7?6?5?4 3?2?1?0

?1

10

?1

?O ?P ?C

?1

?F

?S ?D

DEST

A 1

?R ?1

SRC1

0

A 0

R 2

SRC2

SCALE

The type of OPC designated order.

Operation (OPC):

0 dest＝SAT((acc+SAT(2*src1*src2)){->>scale})

1 dest＝SAT((acc-SAT(2*src1*src2)){->>scale})

Memonic symbol:

0 SMLDA <dest>，<src1_16>，<src2-16>，<acc>{，<scale>}

1 SMLDS <dest>，<src1_16>，<src2_16>，<acc>{，<scale>}

Sign: see and go up joint.

The reason that comprises:

G.729 reach other algorithm that makes arithmetical operation decimally and need the MLD instruction.Most of DSP provide can add up or write back before move to left in the output at multiplier one little digital modeling.It provides bigger programming dirigibility as specific instruction support.The name that is equivalent to some G series fundamental operation is called:

L_msu＝>SMLDS

L_mac＝>SMLDA

They utilize the saturated of multiplier moving to left one the time.The decimal multiply accumulating of a sequence and loss of accuracy not can adopt MULA if desired, itself and remain in 33.14 forms.In case of necessity, can when finishing, utilization move to left and saturated 1.15 forms that are transformed into.

Signed multiplication is carried out in the multiplying instruction, and the calibration of selecting for use/saturated.(just 16) treat as signed number with source-register.31?30?29?28?27?26 25?24?23?22?21?20?19 18 17?16?15?14?13?12 11?10?9?8?7?6?5?4?3?2?1?0

00011

?O ?P ?C

?F

?S ?D

DEST

S1

?R ?1

SRC1

SRC2

The type of OPC designated order.

Operation (OPC):

0 dest＝(src1*src2){->>scale}

1 dest＝SAT((src1*src2){->>scale})

Memonic symbol:

0 MUL <dest>，<src1_16>，<src2>{，<scale>}

1 SMUL <dest>，<src1_16>，<src2>{，<scale>}

Sign: see and go up joint.

The reason that comprises:

Many processing need tape symbol and saturated multiplication.

The array of registers table handling is used for executable operations on one group of register.Provide empty and zero instruction be used for before the routine or between the reset register of selection.The content stores that the register of output order with listing is provided is in output FIFO.31 30?29?28?27?26 25 24?23?22 21?20 19?18?17?16?15?14?13?12?11?10?9?8?7?6?5?4 3?2?1?0

1

11111

?0

OPC

00

REGISTER_LIST_16

SCALE

The type of OPC designated order.

Operation (OPC):

000 for (k=0; K＜16; K++) if set the position k of register tabulation,

Then register k is labeled as sky.

001 for (k=0; K＜16; K++) if set the position k of register tabulation,

Then register k is arranged to comprise 0.

010 is undefined

011 is undefined

100 for (k=0; K＜16; K++) if set the position k of register tabulation,

Then incite somebody to action (register k-〉〉 scale) write among the output FIFO.

101 for (k=0; K＜16; K++) if set the position k of register tabulation,

Then incite somebody to action (register k-〉〉 scale) be written among the output FIFO and register

K is labeled as sky.

110 for (k=0; K＜16; K++) if set the position k of register tabulation,

Then SAT (register k-〉〉 scale) is write among the output FIFO.

111 for (k=0; K＜16; K++) if set the position k of register tabulation,

Then SAT (register k-〉〉 scale) is write among the output FIFO and will deposit

Device k is labeled as sky.

Memonic symbol:

000 EMPTY <register_list>

001 ZERO <register_list>

010 Unused

011 Unused

100 OUTPUT <register_list>{，<scale>}

101 OUTPUT <register_list>^{，<scale>}

110 SOUTPUT <register_list>{，<scale>}

111 SOUTPUT <register_list>^{，<scale>}

Sign:

Unaffected

Example:

EMPTY {A0，A1，X0-X3}

ZERO {Y0-Y3}

OUTPUT {X0-Y1}^

Assembly routine is also supported grammer

OUTPUT Rn

In this case, utilize MOV^, one of Rn instruction output is posted

Storage.EMPTY instruction will stop up to

All registers that will empty comprise valid data

(promptly not empty).

The array of registers table handling must not be used in the REPEAT that remaps (repetition) circulation.

Output (OUTPUT) instruction can only be specified 8 registers of output at most.

The reason that comprises:

After routine finished, next routine expected that all registers are empty so that it can receive data from ARM.Need EMPTY to instruct and accomplish this point.Before carrying out FIR or filtrator, need all totalizers and partial results zero clearing.ZERO (zero) instruction assists to accomplish this point.By replacing a series of single register transfers, the both is designed to improve code density.Comprise OUTPUT (output) instruction by replacing a series of MOV^, Rn instructs and improves code density.

The register that provides the parameter move instruction RMOV that remaps the to allow configure user definition parameter that remaps.

This order number is as follows: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

11111

101

00

ZPARAMS

YPARAMS

XPARAMS

Each PARAMS field comprises following item:

6 5 4 3 2 1 0

BASEWRAP

BASEINC

0

RENUMBER

These implication is as follows:

Parameter	Explanation
Parameter	Explanation	RENUMBER	To carry out the 16 bit register numbers that remap thereon, but value 0,2,4,8, the following register of RENUMBER remaps above direct access.
BASEINC	The amount that the base pointer increases during each loop ends.But value 1,2 or 4.	RENUMBER
BASEINC		BASEWRAP	But basic ring winding mold value 2,4,8.

Memonic symbol:

RMOV<PARAMS>，[<PARAMS>]

＜PARAMS〉field has following form;

w<BASEWRAP>

<BANK> ∷＝[XYZ]

<BASEINC>?∷＝[+++1+2+4]

<RENUMBER>∷＝[0248]

<BASEWRAP>∷＝[248]

If it is movable using the RMOV instruction to remap simultaneously, its behavior is UNPREDICTABLE (unpredictable).

Sign: unaffected

Repetitive instruction provides 4 circulations null cycle in the hardware.The hardware loop that the repetitive instruction definition is new.Piccolo utilizes hardware loop 0 for article one repetitive instruction, for the repetitive instruction that is nested in first repetitive instruction is utilized hardware loop 1 or the like.Repetitive instruction does not need to specify is using for which circulation.Repetitive cycling must be strict nested.If attempt nested loop to greater than 4 the degree of depth, then behavior is uncertain.

Instruction number in each repetitive instruction designated cycle (be right after repetitive instruction back) and by round-robin number of times (it is constant or reads the register from Piccolo).

If the circulation in instruction number less (1 or 2) Piccolo could set up circulation with additional cycles.

If cycle count is the register appointment, then contains 32 accesses (S1=1), but only think that 16 of bottoms are effective and numeral is not signed.If cycle count is 0, then the round-robin operation is undefined.Therefore take duplicating of cycle count, can reuse this register (even recharging) immediately and do not influence circulation.

Repetitive instruction provides the mechanism of the mode of revising the register manipulation number in the designated cycle.Described above the details.

The coding of repetition that has the period of register appointment: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

?1

11110

?0

RFIELD_4

00

?0

R 1

SRC1

0000

#INSTRUCTIONS_8

The coding of the repetition of the period that band is fixing: 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9876543210

1

11110

?1

?RFIELD_4

#LOOPS_13

#INSTRUCTIONS_8

The RFIELD operand specifies in and uses any of 16 kinds of parameter configuration that remap in the circulation.

RFIELD	The operation of remapping
RFIELD	The operation of remapping	0	Do not carry out and remap
1	User-defined remapping	0	Do not carry out and remap
1	User-defined remapping	2..15	The configuration TBD that remaps that presets

Assembly routine provides two operational code REPEAT and NEXT to define hardware loop, and REPEAT at the beginning of the cycle and NEXT defines round-robin and finishes allows the instruction number in the assembly routine computation cycles body.As for REPEAT, it need only be as constant or register designated cycle number of times.For example:

REPEAT X0

MULA A0，Y0.l，Z0.l，A0

MULA A0，Y0.h^，Z0.h^，A0

REPEAT #10

MULA A0，X0^，Y0^，A0

Assembly routine is supported grammer:

REPEAT#iterations[,＜PARAMS 〉] repeat the used parameter that remaps with appointment.If the required parameter that remaps equals one of predefined parameter group, then use suitable REPEAT coding.If not, then assembly routine will generate RMOV and load user-defined parameter, and the REPEAT instruction is followed in the back.See the RMOV instruction in the top joint and the details of the parameter format that remaps.

If the round-robin multiplicity is 0 then the operation of REPEAT is uncertain.

If the numeral of instruction field is set to 0 then the operation of REPEAT is uncertain.

Circulation only comprises an instruction and this instructs when shifting, and then has uncertain performance.

It out-of-bounds is uncertain transferring to this round-robin in REPEAT circulation circle.

The saturated absolute value in source 1 is calculated in saturated absolute value instruction.31?30?29?28?27?26?25?24?23?22?21?20?19?18 17?16?15?14?13?12?11?10?9?8?7?6?5?4?3?2?1?0

0

10011

?F

?S ?D

DEST

S1

?R ?1

SRC1

10000000000

Operation:

dest＝SAT((src1>＝0)？src1：-src1)。This value is always saturated.

Memonic symbol:

SABS<dest>，<src1>

Sign:

If Z result is 0, just set.

N keeps.

If C is src1＜0 (dest=-src1 situation), just set.

V is if saturated, just set.

The reason that comprises:

It is useful in many DSP use.

Selection operation (condition transmission) is used for conditionally source 1 or source 2 being sent in the destination register.Select always to be equivalent to transmission.Also have parallel add/subtract after the parallel work-flow of use.

The reason of attention in order to realize can read two source operands, if one of them is empty, instruction will stop, no matter whether this operand is strict the needs.31?30?29?28 27?26 25?24?23?22?21?20?19 18 17?16?15?14?13?12 11?10?9?8?7?6?5?4?3?2?1?0

1

011

?OPC

?F

?S ?D

DEST

S1

R 1

SRC1

SRC2_SEL

The type of OPC designated order.

Operation (OPC):

If 00＜cond〉the one-level sign is set up then dest=src1 otherwise dest=

src2

If 01＜cond〉to the one-level sign set up then dest.h=src1.h otherwise

dest.h＝src2.h

If＜cond〉to two pole marks will set up then dest.l=src1.l otherwise

dest.l＝src2.l

If 10＜cond〉to the one-level sign set up then dest.h=src1.h otherwise

dest.h＝src2.h

If＜cond〉to the secondary sign set up then dest.l=src1.l otherwise

dest.l＝src2.l

11 keep

Memonic symbol

00 SEL<cond> <dest>，<src1>，<src2>

01 SELTT<cond> <dest>，<src1>，<src2>

10 SELTF<cond> <dest>，<src1>，<src2>

11 need not

If register tagging for recharging, is unconditionally recharged it.Assembly routine also provides following memonic symbol:

MOV<cond> <dest>，<src1>

SELFT<cond> <dest>，<src1>，<src2>

SELFF <cond><dest>，<src1>，<src2>

MOV＜cond〉A, B is equivalent to SEL＜cond〉A, B, A.By exchanging src1 and src2 and using SELTF, SELTT to obtain SELFT and SELFF.

Sign: keep all signs so that can carry out a sequence selection.

The reason that comprises:

Be used for onlinely making simple decision and need not relying on transfer.Being used for the Viterbi algorithm reaches when sample or vector scanning greatest member.

The shifting function instruction provides logic left and moves to right the amount of arithmetic shift right and circulation appointment.Think shift amount be take from content of registers least-significant byte-128 and+signed integer or counting immediately in scope+1 to+31 between 127.The displacement of negative amount causes superior displacement ABS (shift amount) in the other direction.

With input operand sign extended to 32; Thereby 32 output symbols that will draw before writing back expand to 48 and write the performance of 48 bit registers rationally.31?30?29?28 27?26 25?24 23?22?21?20?19 18?17 16?15?14?13?12 11?10?9?8?7?6?5?4?3?2?1?0

1

010

?OPC

?F

?S ?D

DEST

S1

R 1

SRC1

SRC2_SEL

The type of OPC designated order.

Operation (OPC):

00 dest＝(src2>＝0)？src1<<src2：src1>>-src2

01 dest＝(src2>＝0)？src1>>src2：src1<<-src2

10 dest＝(src2>＝0)？src1->>src2：src1<<-src2

11 dest＝(src2>＝0)？src1?ROR?src2：src1?ROL-src2

Memonic symbol:

00 ASL <dest>，<src1>，<src2_16>

01 LSR <dest>，<src1>，<src2_16>

10 ASR <dest>，<src1>，<src2_16>

11 ROR <dest>，<src1>，<src2_16>

Sign:

If Z result is 0, just set.

If N result is for negative, just set.

V keeps

The value (with the same on ARM) of last that C is arranged to be shifted out

The behavior of the displacement of register appointment is:

-LSL displacement 32 obtains a result 0, and C is set to the position 0 of src1.

-LSL displacement obtains a result 0 more than 32, C is set to 0.

-LSR displacement 32 obtains a result 0, and C is set to the position 31 of src1.

-LSR displacement obtains a result 0 more than 32, C is set to 0.

-ASR displacement build 32 or later draws the position 31 that is set to src1 with position 31 fillings of src1 and C.

-ROR displacement 32 has the position 31 that the result equals src1 and C is arranged to src1.

-ROR displacement n position, wherein n provides and carries out the ROR displacement n-32 identical result in position greater than 32; Therefore from n, repeat to deduct 32 till this amount is in 1 to 32 scope, on seeing.

The reason that comprises:

Power with 2 is taken advantage of/is removed.The position is extracted with field.Serial register.

Undefined instruction is stated as above in the instruction set inventory.As if their execution will cause Piccolo to stop to carry out, and the U position in the SM set mode register, and forbid itself (having removed the E position in the control register).This allows to intercept and capture any following of instruction set and expands and emulation selectively on existing the realization.

As follows from ARM visit Piccolo state.The conditional access pattern is used for observing/revising the state of Piccolo.Be that two kinds of purposes are provided with this mechanism:

-context switches.

-debugging.

By carrying out the PSTATE instruction Piccolo is placed the conditional access pattern.This pattern allows with a sequence STC and LDC instruction preservation and recovers all Piccolo states.When getting the hang of access module, the use of Piccolo coprocessor ID PICCOLO1 is modified as the state of permission visit Piccolo.7 groups of Piccolo states are arranged.Can load and all data of storing in the particular group with single LDC or STC.

Group 0: special register.

-one 32 word comprises the value (read-only) of Piccolo ID register.

-one 32 word comprises the state of control register.

-one 32 word comprises the state of status register.

-one 32 word comprises the state of programmable counter.

Group 1: general-purpose register (GPR)

-16 32 words comprise the general-purpose register state.

Group 2: totalizer

-4 32 words, comprise accumulator registers high 32 (note, for the purpose of recovering, with the GPR state duplicate be necessary-otherwise can contain another time on this registers group write enable).

Group 3: register/Piccolo ROB/ exports fifo status.

Which register tagging-one 32 word indicates for recharging (2 of each 32 bit registers).

-8 32 words comprise the state (storing 87 items on the throne 7 to 0) of ROB label.

-3 32 words comprise the state (position 17 to 0) of the ROB latch that does not line up.

-one 32 word, which groove comprises valid data (position 4 expressions are empty, the number that position 3 to 0 codings are used) in the indication Output Shift Register.

-one 32 word comprises the state (position 17 to 0) that output FIFO keeps latch.

Group 4:ROB input data.

-8 32 bit data value.

Group 5: output data fifo.

-8 32 bit data value.

Group 6: loop hardware.

-4 32 words comprise the circulation start address.

-4 32 words comprise loop end address.

-4 32 words comprise cycle count (position 15 to 0).

-one 32 word comprises user-defined parameter and other state that remaps of remapping.

LDC instruction is used for loading the Piccolo state during in the conditional access pattern at Piccolo.Which group the indication of BANK field is loading.31?30?29?28?27?26?25?24 23 22 21 20?19?18?17?16 15?14?13?12 11?10?9?8 7?6?5?4?3?2?1?0

COND

110

P

U

0

W

1

BASE

BANK

?PICCOLO1

OFFSET

Following sequence loads all the Piccolo states from the address among the register R0.

LDP B0, [R0], #16! Special register

LDP B1, [R0], #64! Load general-purpose register

LDP B2, [R0], #16! Load totalizer

LDP B3, [R0], #56! Bit load registers/ROB/FIFO state

LDP B4, [R0], #32! Load the ROB data

LDP B5, [R0], #32! Load the output data fifo

LDP B6, [R0], #52! Loaded cycle hardware

STC instruction is used for the storage Piccolo state during in the conditional access pattern at Piccolo.The BANK field is specified and is being stored which group.31?30?29?28 27?26?25 24?23 22?21?20?19?18?17?16 15?14?13?12 11?10?9?8 7?6?5?4?3?2?1?0

COND

110

?P

?U

?0

W

?0

BASE

BANK

PICCOLO1

OFFSET

Following sequence is with the address of all Piccolo state storage in the register R0.

STP B0, [R0], #16! Preserve special register

STP B1, [R0], #64! Preserve general-purpose register

STP B2, [R0], #16! Preserve totalizer

STP B3, [R0], #56! Save register/ROB/FIFO state

STP B4, [R0], #32! Preserve the ROB data

STP B5, [R0], #32! Preserve the output data fifo

STP B6, [R0], #52! Preserve loop hardware

Debugging mode-Piccolo need respond the identical debug mechanism of being supported with ARM, and promptly software is by Demon and Angel, and the hardware that has the ICE of embedding, is some mechanism of debugging Piccolo system below:

-ARM instruction breakpoint.

-data breakpoint (observation point).

-Piccolo instruction breakpoint.

-Piccolo software breakpoint.

ARM instruction and data breakpoint are the ICE resume module that embedded by ARM; The Piccolo instruction breakpoint is the ICE resume module that embedded by Piccolo; The Piccolo software breakpoint is handled by Piccolo nuclear.

The Hardware Breakpoint system can be configured to make ARM and Piccolo both that breakpoint is arranged.

As if software breakpoint is handled by Piccolo instruction (shutting down or interruption), causes Piccolo to stop to carry out, enter debugging mode (the B position in the SM set mode register) and forbid itself (instructed with PDISABLE and forbidden Piccolo).Programmable counter is remained valid, and allows to recover breakpoint address.Piccolo no longer executes instruction.

Single step is advanced Piccolo and can be connect a breakpoint and finish by set a breakpoint on the Piccolo instruction stream.

The basic function that software debugging-Piccolo provides is to load and the ability of all states of preservation in the storer by coprocessor instruction in the conditional access pattern.This allows debugged program that all states are kept in the storer, reads and upgrades it and return among the Piccolo.Piccolo store status mechanism right and wrong are destructive, i.e. the store status of Piccolo operation can not destroy any Piccolo internal state.This means that Piccolo does not recover it at first again and just can reset after its state of dump.

Determine to find out the mechanism of the state of Piccolo cache memory.

Hardware debug-hardware debug is provided by the scan chain on the coprocessor interface of Piccolo.Piccolo can be placed the conditional access pattern then and pass through this its state of scan chain check/modification.

The Piccolo status register comprises the single position break-poing instruction of having indicated its executed.When carrying out break-poing instruction, the B position in the Piccolo SM set mode register, and stop to carry out.In order to inquire about Piccolo, debugged program must start Piccolo and be placed in the conditional access pattern by write its control register before the access that can occur subsequently.

Fig. 4 illustrates the high/low position of response and size position suitable half of the register selected is switched to multiplexer configuration on the Piccolo data routing.If 16 of size position indications, the then symbol expanded circuit is with the high position in 0 or 1 suitable padding data path.

Claims

1. one kind is used a digital signal processing device to carry out the method for digital signal processing to being stored in signal data word in the data storage device (8), said method comprising the steps of:

The microprocessor unit that utilization is operated under the control of microprocessor unit programmed instruction word (2) produces address word, is used for the storage unit at the described signal data word of described data storage device addressable storage;

Under the control of described microprocessor unit, provide described signal data word to the digital signal processing unit of under the control of digital signal processing unit programmed instruction word, operating (4);

The described digital signal processing unit that utilization is operated under the control of digital signal processing unit programmed instruction word is carried out described signal data word and is comprised convolution operation at least, and the arithmetical logic operation of one of associative operation and map function is with the data word that bears results; And

2. according to the method for claim 1, also be included under the control of described microprocessor, the data word that described microprocessor is produced offers the digital signal processing unit of operating under the control of digital signal processing unit programmed instruction word.

3. according to the method for claim 1, further comprising the steps of:

Under the control of described microprocessor unit, be created in the address word of the storage unit of the described result data word of addressable storage in the described data storage device;

4. according to the method for one of claim 1,2 or 3, wherein said signal data word table shows at least one input simulating signal.

5. according to the method for claim 4, wherein said at least one input simulating signal is a real-time input signal of continually varying.

6. according to the method for one of claim 1,2 or 3, wherein said result data word is represented at least one output simulating signal.

7. according to the method for claim 6, wherein said at least one output signal is a continually varying real time output.

8. one is carried out the device of digital signal processing to being stored in signal data word in the data storage device, and described device comprises:

A microprocessor unit, it carries out the address word of addressing to the storage unit in described data storage device operating under the control of microprocessor unit programmed instruction word to produce, and controls described signal data word and be used for the described device of combine digital signal Processing and the transmission between the described data storage device described; And

9. device is according to Claim 8 wherein write described data storage device by described microprocessor unit with described result data word.

10. according to Claim 8 or 9 device, wherein said signal data word table shows at least one input simulating signal.

11. according to the device of claim 10, wherein said at least one input simulating signal is a real-time input signal of continually varying.

12. according to Claim 8 or 9 device, wherein said result data word is represented at least one output simulating signal.

13. according to the device of claim 12, wherein said at least one output signal is a continually varying real time output.

14. device according to Claim 8, wherein said microprocessor unit response more than provides instruction word that the signal data word of a plurality of sequential addressings is provided to described digital signal processing unit.

15. device according to Claim 8, wherein said digital signal processing unit comprise a multiword input buffer (12).

16. device according to Claim 8, wherein said microprocessor unit response more than is taken out instruction word and is taken out the result data word of a plurality of sequential addressings from described digital signal processing unit.

17. device according to Claim 8, wherein said digital signal processing unit comprise a multiword output buffer (18).

18. device according to Claim 8, one of them multiplexed data has been connected described data storage device and described digital signal processing device to transmit described signal data word with instruction bus, described microprocessor unit programmed instruction word and described digital signal processing unit programmed instruction word are to described digital signal processing device.

19. device according to Claim 8, wherein said digital signal processing unit comprises that a digital signal processing unit registers group (10) is used to preserve data word, can carry out arithmetic logical operation to these data words, described DSP program instruction word comprises the register specific field.

20. according to the device of claim 15, wherein for each data word that is stored in the described input buffer, the destination data of a purpose digital signal processing unit register of described input buffer stores sign.

21. device according to claim 20, the digital signal processing unit programmed instruction word that wherein reads a digital signal processing unit register comprises a sign, and indication is stored in the data word that a data word in the described digital signal processing unit register can be stored in the described input buffer with the destination data that is complementary and replaces.

22. device according to claim 21, if wherein described input buffer comprises a plurality of data words with the destination data that is complementary, then refill described digital signal processing unit register with such data word, described data word has first will be stored in the destination data that is complementary in the described input buffer.

23. device according to claim 14 or 20, one of them provides instruction word to specify a destination data that is used for one first data word more, as the described results that instruction is provided, increase progressively described destination data for each data word subsequently that is stored in described input buffer more.

24. device according to claim 23, providing instruction word also to specify a limit destination data value wherein said more, for each data word subsequently, increase progressively described destination data up to reaching described limit purpose value, therefore, before further increasing progressively described destination data, described destination data is reset to the described destination data of described first data word.

25. device according to Claim 8, if a data word that provides from described microprocessor unit can not be provided wherein described digital signal processing unit, then described microprocessor unit is deadlocked.

26. device according to Claim 8, if wherein described digital signal processing unit can not provide a data word will being taken out by described microprocessor unit, then described microprocessor unit is deadlocked.

27. according to the device of claim 15, if wherein described digital signal processing unit attempts to read a non-existent data word in the described input buffer, then described digital signal processing unit is deadlocked.

28. according to the device of claim 17, if wherein described digital signal processing unit attempt a data word is write described output buffer, and described output buffer full, then described digital signal processing unit is deadlocked.

29. according to the device of claim 27 or 28, if wherein described digital signal processing unit is deadlocked, then described digital signal processing unit enters energy saver mode.

30. according to Claim 8 to 11 and the device of one of 14-22, comprising a digital signal processing unit high-speed cache that is used to store the digital signal processing unit instruction word.

31. according to the device of claim 30, wherein, respond a prefetched instruction, the digital signal processing unit instruction can be pre-fetched in the described digital signal processing unit high-speed cache.

32. according to the device of claim 20, wherein, the instruction of one of following at least situation is carried out in described digital signal processing unit response:

(i) be labeled as sky; And

(ii) export the content of a plurality of registers of described digital signal processing unit.

33. device according to Claim 8, wherein, described microprocessor and described digital signal processing unit form an integrated circuit.

34. according to the device of claim 19, wherein, described a plurality of digital signal processing unit registers comprise at least one X bit data register and at least one Y position accumulator register, wherein Y is greater than X.