CN100351782C

CN100351782C - Method and apparatus for jump delay slot control in pipelined processor

Info

Publication number: CN100351782C
Application number: CNB2005100535515A
Authority: CN
Inventors: P·沃恩斯; C·格林汉姆
Original assignee: ARC INTERNAT U S HOLDINGS Inc
Current assignee: Synopsys Inc
Priority date: 1999-05-13
Filing date: 2000-05-12
Publication date: 2007-11-28
Anticipated expiration: 2020-05-12
Also published as: EP1194835A2; CN1384934A; CN1661547A; WO2000070446A2; WO2000070446A3; CN1198208C; AU4848100A; TW482978B

Abstract

A method of managing the configuration, design parameters, and functionality of an integrated circuit (IC) design using a hardware description language (HDL). Instructions can be added, subtracted, or generated by the designer interactively during the design process, and customized HDL descriptions of the IC design are generated through the use of scripts based on the user-edited instruction set and inputs. The customized HDL description can then be used as the basis for generating 'makefiles' for purposes of simulation and/or logic level synthesis. The method further affords the ability to generate an HDL model of a complete device, such as a microprocessor or DSP. A computer program implementing the aforementioned method and a hardware system for running the computer program are also disclosed.

Description

The method and the device that are used for jump delay slot control in pipelined processor

Present patent application is one and divides an application, the application number of its female case is 00808462.9, the applying date is on May 12nd, 2000, this mother's case application is one and enters domestic PCT application that the application number of this PCT application is that PCT/US00/13198, international filing date are on Mays 12nd, 2000.

Technical field

The present invention relates to the design of digital processing unit, relate in particular to the design of the processor of customization.

Background technology

RISC in computer realm (Reduced Instruction Set Computer) processor is a called optical imaging.Risc processor is compared with non--RISC (usually said " CISC ") processor, has the fundamental characteristics that utilizes abundant reduced instruction set computer usually.General risc processor machine instruction is not microcoding entirely, can carry out immediately and need not decode, thereby can provide significant economic benefit on processing speed.This " fairshaped " instruction process ability also allows the design (comparing with non-RISC equipment) of further simplified processor, thereby allows littler silicon chip and production cost still less.

In addition, risc processor typically is characterised in that some or all following Column Properties: (i) pack into/the memory structure and (just, only packing into and must access memory during storage instruction; Other instruction is through the operation of the internal register in the processor); (ii) the monocycle of most instructions carries out; (iii) regular length is easy to the decoded instruction form; The (iv) unitarity of processor and compiler, an and compiler simple and that be easier to write; (v) hardwired control; (vi) less addressing mode; (vii) static relatively order format; (viii) stream line operation.

RISC packs into/storage organization

As mentioned above, by restriction access memory and just packing into and storage instruction, the packing into of risc processor/storage organization has significantly been simplified the operation of equipment; Other operation is " register is to register ".Therefore, typical R ISC processor also uses a large amount of internal registers to handle such operation.The below computing of a support addition of simply packing into/storing of explanation:

Computing a=b+c

Instruction

Load r3, a value load register 3 from source location a

Load r4, b value load register 4 from source location b

Add r5, r3, r4 add register 3 and 4, and in register 5 event memory

Store e, r5 is in the content of destination locations e storage register 5

As above shown in the example, the RISC of prior art normally use a distributor (for example, r5) keep whole to pack into/the storage operation process in to the data of storer.Because most of risc processors rely on such packing into/memory mechanisms to come access and revise the value of storer, the efficient of instructing when a kind of simple storage access of expectation is impaired.

Addressing mode

A kind of addressing mode is one of access a kind of method of found operand anywhere.Usually, can be arranged in storer or CPU register or they can be the literal value that is defined within code itself to operand.The addressing mode that may use in microprocessor especially comprises, " implying " addressing, wherein operational code specifies operands; " immediately " addressing, wherein instruction itself comprises operand; " directly " addressing, wherein operand is a storage address or register destination; " non-direct " addressing, wherein address of operands specify desired operation number; " index " addressing is wherein added and is operated two or more values in other words to obtain the address of operand.

In the above-mentioned addressing mode of listing, " immediately " addressing is often used in a risc processor, because directly comprise operand in instruction.As mentioned above, an immediate instruction (being derived from by grammer " imm " expression or its grammer usually) generally comprises operand in instruction itself.An immediate instruction has such operand usually, and a literal value is followed a special character such as " # " symbol.The form of operand can change.For example, instruction can have a following operand:

$1234; Operand is Wen Zizhi $1234

Buffer; Operand is the literal value that is attached to " Buffer "

' Y '; Operand is the American Standard Code for Information Interchange capital Y

Short immediately with long immediately

When using immediate addressing/data, the less data word of in parents' instruction word of specifies operands, can frequently encoding (be less than usually instruction word size half).This method often is called " weak point immediately " addressing/data; Yet, this method significant limitation the allowed address/operand that can in single instruction word, use.On the contrary, long immediate addressing/data need be more than one instruction word, but removed many with lack the relevant restriction of direct mode.

The register coding

Register in the risc processor closely relies on instruction set, because frequently stipulated as operand by these registers of instruction, or the interchangeable address that is used to produce about operand.

Prior art usually like this constitutes command coding scheme: will instruct one or two purposes that is used to represent immediate operand in whole figure places, perhaps imply the purposes of immediate operand by a kind of interchangeable instruction type, and as follows.Usually---perhaps utilize and be used in those positions of describing a source data register under other situation---encodes to immediate data in instruction word to utilize a fixing position collection.

Add r0, r1, r2; R0=r1+r2; Register-register addition

Addi r0, r1,10; R0=r1+10; Register-addition immediately

This method meets the trend of attempting minimizing instruction word and register length, finishes the desired silicon that the processor design needs thereby reduced.Yet utilizing instruction operation code to infer to remain on immediate data in the instruction to reduce can be in order to the dirigibility of coded order.For example, use above-mentioned prior art Methods for Coding, can not effectively use a single instruction to produce a plurality of length constant immediately.In addition,, and need allow immediate data be expressed as subtracter or subtrahend, just need the register-register version of two of instruction version and instructions immediately if a computing (such as a subtraction) is non-swappable.Fig. 1 a-1c example the register coded format that is used for register-register, register-immediately and immediately-register instruction of typical prior art.Use the part of the instruction word immediate data of encoding, perhaps allow the instruction word of a back all to be used for immediate data, in these two, usually can not select so that can allow the wider value of encoding

Usually, the processor of prior art does not allow the combination of all immediate datas is used with all instructions.Use the part of the instruction word immediate data of encoding, perhaps allow the instruction word of a back all to be used for immediate data, in these two, usually can not select so that can allow the wider value of encoding.But in a processor that has user-expansion instruction set, how this just uses a new instruction to be provided with a restriction to a programmer, if the function of especially new instruction is non-swappable.

Stream line operation

Stream line operation is a kind of technology, and being used for increases performance of processors by the order of processor computing is divided into fragment, and these fragments can be performed effectively with parallel mode when possibility.In a typical pipeline processor, with the processor calculating running program (such as addition, multiplication, division or the like) relevant arithmetic element is usually by " segmentation ", so that a concrete part of executable operations in the given fragment of whole inherent this unit of any clock period.Therefore, can be in any given these unit of clock period in the enterprising row operation of the result of a various computing.As an example, two number A and B are sent to multiplication unit 10 and carry out section processes by first fragment 12 of unit in first clock period.At second clock in the cycle, the partial results that multiplies each other from A and B is sent to second fragment 14, and first fragment 12 receives two the new numbers (for example C and D) that begin to handle simultaneously.Net result is after an initial startup period, and each clock period is carried out multiplication operation by arithmetic element 10.

The degree of depth from a structure to another streamline can change.In this article, term " degree of depth " refers to the quantity of the separate stage that exists in streamline.Usually, have very fast than the streamline executive routine of multistage, if but the effect of streamline is obvious visible words, for the programmer programming also more the difficulty.Most of pipeline processors are that three stages (instruction fetch, decoding and execution) or quadravalence section (are extracted and carried out such as instruction fetch, decoding, operand, perhaps interchangeable, instruction fetch, decoding/operand extracts, carries out and write back), although can use more or the stage still less.

When the instruction set of a pipeline processor is developed, must consider several dissimilar " risks ".For example, in order to fight for the danger of identical resource (such as bus, register, or other functional unit) meeting appearance so-called " structural " or " contention for resource " from overlap instruction, this generally utilizes one or more streamlines to block and solves.Occur the danger of so-called " data " streamline under the situation of read/write collision, this conflict can change the order of storer or register access." control " is taken a risk to result from usually the transfer in program flow or is similarly changed.

Use pipeline organization need handle these risks with interlocking usually.For example, consider this situation, promptly in a previous flow line stage follow instruction (n+1) need be from the result of last stages instruction n.A straightforward procedure that addresses the above problem is to calculate by the operand of one or more clock cycle delays in instruction decode stage.Yet a result of this delay is that the execution time of a given instruction partly is determined by the instruction around it in the streamline on processor.This makes the code optimization be used for processor become complicated, because the interlocking situation in the code that often is difficult to fix a point for the programmer.

In processor, can use " scoring plug " to realize interlocking, in the method, a position be appended to each processor register so that as a sign of this content of registers; Particularly, the content of (i) register has been updated and has therefore prepared to use, and perhaps (ii) this content is standing to revise, such as being write by other handling procedure.This scoring plug also is used to produce in some structures and prevents to instruct the interlocking of carrying out, and this instruction depends on the content from the register of the scoring plug of carrying out, till scoring plug shows that register is ready to.Such method is called " hardware " interlocking, because through the hardware in the processor, call this interlocking by the checking of scoring plug purely.Such interlocking produces " obstruction ", prevents the execution (thereby blocking streamline) of data dependent instruction, till register is ready to.

Interchangeable, NOP (blank operation operational code) can be inserted in the code so that when the slow suitable flow line stage of expectation time delay.The latter's method is called " software " interlocking, has the shortcoming that has increased code size, and uses the instruction that needs interlocking to increase the complicacy of program.In addition, use the design of software interlock to be not easy to their code structure is comprehensively optimized in a large number.

Shift and jump instruction

Another important consideration is program jump or " redirect " in the processor design.All processors are all supported the transfer instruction of some type.In simple terms, program flow be interrupted or reformed situation under relate to transfer.In addition, also interrupt in an identical manner or reprogramming stream such as circulation setting and subroutine call instruction.Term " jump delay slot " be commonly referred to as one shift or jump instruction decoded after time period in a streamline.Finishing of to be transferred/load such as shifted when (or packing into), instruction afterwards was performed.Transfer can be the true value or the value of one or more parameters (just based on) with good conditionsi or unconditional.Its also can be absolute (for example), or relative (for example, based on relative address and irrelevant) with any special memory address based on an absolute memory address.

On pipeline system, shift and have a kind of very significant effect.When a transfer instruction is inserted into and by the instruction decode stage of processor when decoded (the expression processor must begin to carry out a different address), the next instruction word in instruction sequence is extracted and inserts in the streamline.A scheme that addresses this problem is the instruction word of remove extracting and suspends or block other extraction operation and be performed up to transfer instruction and finish.Yet, this method need be in several instruction cycles the execution result of transfer instruction, the number of instruction cycle equals the degree of depth of the streamline that uses in the processor design usually.This result is disadvantageous for the speed and the efficient of processor, because processor can not be implemented other computing during this.

Interchangeable, can use a kind of transfer method of delay.In the method, streamline is not eliminated when a transfer instruction arrives decode phase, and carries out the instruction that is present in the back in the streamline previous stage usually before transfer is performed.Therefore, the transfer of the appearance that is delayed by instruction cycles when this transfer instruction is decoded need be carried out in all streamlines instruction subsequently.Compare with above-mentioned multicycle transfer, this method has increased the efficient of streamline, but has also increased the complicacy (programmer's easy to understand) of basic code.

Based on foregoing, in the risc processor of a streamline and interlocking, need a kind of improved method for the register coding.A kind of so improved method can make in the code registers of programmer/deviser in processor increases dirigibility, and overcome with pack into/relevant some shortcomings of storage organization (for example, need to use one immediately register store immediate value), thereby optimize instruction set and processor performance.And, infer the short immediate data of using (remaining in the instruction word) or grow immediate data (in the instruction word of a back) in the source field of the instruction word that programmer can what processor in office.

Ideally, also compatible other the processor design consideration method of this improved method wherein especially comprises interlocking and shift control mechanism.In addition, in a concrete mode of using, this improved pipeline processor design of overall treatment easily, and use these obtainable synthesis tools, be effectively actual for deviser and programmer.

Summary of the invention

The application's right of priority is in the 60/134th of application on May 13rd, 1999, No. 253 U.S. Provisional Patent Application, its name is called " Method And Apparatus For Synthesizing AndImplementing Intergrated Circuit Designs ", and common pending trial in the 09/418th of application on October 14th, 1999, No. 663 U.S. Patent applications (now authorize by this application, become United States Patent (USP) No. 6862563), its title is " Method And Apparatus ForManaging The Configuration And Functionality Of A SemiconductorDesign ", its right of priority is identical the 60/104th, No. 271 U.S. Provisional Patent Application of applying on October 14th, 1998 of title.

The present invention is used for by providing a kind of that the improved method and apparatus of code registers and execution command satisfies above-mentioned needs in the pipeline processor structure.

A kind of method of " loose " code registers number of the purposes of representing the register immediate operand is disclosed in one aspect of the invention.In one embodiment, in the CLIW of processor, use (for example six) register field of a plurality of expansions, thereby in spendable instruction and operand format, provide enhanced flexibility.In addition, this method is had the ability immediate value directly stored in the storer and is not used an intermediate store.Can also infer short immediate data (remaining in the instruction word) or the long immediate data (in the instruction word of a back) used in the source field of what processor instruction in office.In addition, utilize this method can more effectively handle non-swappable computing.

In aspect second of the present invention, improving one's methods of a kind of comprehensive integration circuit design in conjunction with above-mentioned jump delay slot method disclosed.In an one exemplary embodiment, this method comprises the configuration of the relevant design that obtains user's input; Generate hardware description language (HDL) functional block of customization based on user's input and existing function storehouse; Input and routine library based on the user are determined the design level and are produced a level file, new library file, and program-described file; The HDL and the manuscript of working procedure description document generating structure; The manuscript that operation generates generates a program-described file and a comprehensive manuscript that is used for simulated program; And design is carried out comprehensively based on design that produces and comprehensive manuscript.

In aspect the 3rd of the present invention, a kind of improved computer program that is used for overall treatment device design and the method for specific implementation foregoing are disclosed.In one embodiment, computer program comprises that object code on the magnetic storage apparatus that is stored in a microcomputer expresses formula, and is suitable for moving on therein the central processing unit.Computer program further comprise one interactively, the graphic user interface of menu control (GUI), thereby easy to use.

Aspect the 4th of the present invention, disclose and realized above-mentioned " loose " register coding and functional gate logic, and the gate logic that utilizes the overall treatment of the method that above-mentioned overall treatment device designs.In one embodiment, be used in register, selecting the gate logic of first source field to comprise a string eight 4 bit multiplexed devices.

Aspect the 5th of the present invention, a kind of improved processor structure that has utilized the coding method of above-mentioned " loose " register is disclosed.In one embodiment, processor comprises the Reduced Instruction Set Computer (RISC) with a multistage streamline, this multistage streamline utilizes " loose " register architecture, wherein effectively immediate value is stored into storer immediately and does not use distributor.In another embodiment, processor comprises processor chips, and dsp chip has a storer of a plurality of memory banks and is used to make a memory interface of the memory bank side-by-side docking DSP function in the storer.

In aspect the 6th of the present invention, disclose a kind of improved device that is used to move aforementioned calculation machine program, this computer program is used for the comprehensive logic relevant with pipeline processor.In an one exemplary embodiment, system comprises an independently microcomputer system, and this microcomputer system has a display, central processing unit, data storage device, and input equipment.

Description of drawings

Fig. 1 a-1c example be used for the register encoding scheme of a typical prior art of risc processor.

Fig. 2 is a logical flow chart, example according to the commonsense method of locator data in the present invention's " loose " code registers in a pipeline processor.

Fig. 3 a-3c illustrates the register coding structure of the first embodiment of the present invention.

Fig. 4 is a logical flow chart, example according to the present invention the processor logic that combines " loose " register coding is carried out comprehensive commonsense method.

Fig. 5 is a synoptic diagram, example be the embodiment that first field of the instruction word of Fig. 3 is selected the integrated logic that data source uses.

Fig. 6 is a synoptic diagram, example be used to realize Fig. 5 data source select first embodiment of integrated logic (unconfined) of 4 bit multiplexed devices of logic.

Fig. 7 is a synoptic diagram, example be used to realize Fig. 5 data source select second embodiment of integrated logic (constrained) of 4 bit multiplexed devices of logic.

Fig. 8 is a synoptic diagram, example be used to realize that sign of the present invention is provided with first embodiment of functional integrated logic (unconfined).

Fig. 9 is a synoptic diagram, example be used to realize that sign of the present invention is provided with second embodiment of functional integrated logic (constrained).

Figure 10 is the block scheme according to a processor design of combination of the present invention " loose " register coding.

Figure 11 is the functional-block diagram in conjunction with a computing equipment of hardware description language of the present invention, is used for the logical unit of synthesizing map 5-9.

Embodiment

With reference now to the accompanying drawing that provides,, wherein identical numeral relates to whole identical part.

Employed at this, term " processor " means other the equipment that can carry out an operation at least one instruction word that comprises any integrated circuit or other, comprise, but be not limited to, reduced instruction set chip (RISC) processor, such as the chip of the ARC user structure of making by patent assignee, central processing unit (CPU), and digital signal processor (DSP).

In addition, as those of ordinary skills, should understanding as used herein, term " stage " relates to the interior various successive stages of a pipeline processor; Be the stage 1 to be equivalent to first-class last pipeline stages, the stage 2 is equivalent to second flow line stage, and the rest may be inferred.Although following discussion emphasis is at one three stage streamline (being instruction fetch, decoding and execute phase), but should know that method and apparatus disclosed herein can be widely used in the processor structure that has one or more streamlines, streamline had greater or less than three stages.

It should be noted in addition,, yet also can be used to describe of the present invention various embodiment with same function such as other hardware description language of Verilog  although following description is only relevant with VHSIC hardware description language (VHDL).And, though example the comprehensive engine of Synopsys , be used for overall treatment various embodiment described herein such as Design Compiler 1999.05 (DC99), but also can use other comprehensive engine, such as from Cadence Design Systems, Inc. and other local obtainable Buildgates .IEEEstd.1076.3-1997, IEEE Standard VHDL Synthesis Packages has described language and the comprehensive treatment capability that a kind of a kind of industry that is used to stipulate a kind of design based on hardware description language is accepted, this is available to those skilled in the art.

At last, although should know that present assignee is described below utilizes above-mentioned overall treatment engine and VHSIC hardware description language to come the specific embodiment of overall treatment logic, this specific embodiment suffers restraints by different way, but these embodiment are as just design example of the present invention.

Describe according to loose register Methods for Coding and the device of being used for of the present invention now.Usually, the present invention utilizes a kind of multidigit register field of expansion to represent the purposes of register immediate operand.In brief, the present invention includes the register number that uses in the processor and represent short (" shimm ") immediately and long (" limm ") operand immediately." loose " that is called in this method is because it is expanded effectively or takes usually required figure place of expression information apart.For example, the embodiment of the instruction word of cpu chip of the present invention uses 6 bit register word territories to represent register AND immediate operand purposes (for example, shimm/limm).On the contrary, the instruction word of typical prior art only utilizes 1 or 2 to represent this information, perhaps utilizes the existence of the operational code indication immediate data of instruction.Therefore, this method is a little somewhat counterintuitive, comes expression information because it has used more than minimum required bit capacity.

Yet " loose " of the present invention register coding structure has many benefits to the processor (such as the ARC chip of previous described application) based on RISC, comprising: (i) overall enhanced programming dirigibility; (ii) can directly store immediate value into storer and need not a distributor; (iii) can in first source register (" source 1 ") or second source register (" source 2 "), use short or long immediate data, this for have can not commutativity instruction be useful; (iv), can indicate the result that abandon an instruction by using ' immediate data ' register in the destination address field (DAF) of an instruction.This can make the programmer make comparisons between two values and be provided as result's sign, and does not cause that the register of any general objects changes in the processor; (v) can use short and long immediate data as the source data in the instruction.Can be added in extended instruction under the situation of a processor, the latter's ability proof is useful in the operation of design and special instruction.Because most risc processors rely on a kind of memory mechanism of packing into to come access and revise memory value (just, only pack into and storage operation can the access memory space), the efficient of instructing when the simple storer storage of expectation is impaired.The loose encoding scheme of the application of the invention, program storage can be optimised, realizes intrinsic simplicity simultaneously in risc architecture.

With reference now to Fig. 2,, an embodiment according to the present invention's commonsense method of locator data in the register of " loose " coding is described.The first step 202 of method 200 comprises and determines whether that the register number of being concerned about in the present instruction specified a general-purpose register (for example, the r0-r31 among the embodiment of following table 1).If register number has been specified a general-purpose register really, then from the chip of appointment, select data, and finish the processing procedure 200 that is used for that register number through step 204.If do not specify a general-purpose register, register number then is verified so that determine whether it through step 206 and has specified an immediate data value.If specified an immediate data value, the type of immediate data value then, promptly short (shimm) immediately or long (limm) immediately is determined in step 208.If in step 206, do not specify an immediate data value, from the source of quoting as proof, obtain suitable data designated value in step 210.

If in step 208, specified short immediate data, from the relevant portion extraction data of present instruction word.If in step 208, specified long immediate data, then extract suitable data in the instruction word from behind.

In as following table 1, the register of the present invention that uses above-mentioned method and first embodiment of order structure have been described:

Table 1

Register	Immediate operand	Explanation
Register	Immediate operand	Explanation	R0-r31		Register value
R32-r59		Extended register (special use)	R0-r31		Register value
R32-r59		Extended register (special use)	R60	Loopcnt	The cycle count register
R61	Shimmf	The weak point that use has a sign from 9 of instruction word immediately and Status Flag is set on the result.	R60	Loopcnt	The cycle count register
R61	Shimmf		R62	Limm	Use is immediately long from 32 of next instruction word
R63	Shimm	The weak point that use has a sign from 9 of instruction word immediately	R62	Limm	Use is immediately long from 32 of next instruction word

As shown in table 1, specified whole 64 registers (being r0-r63).First group 32 registers (r0-r31) are the general-purpose registers that is used to reflect register value.28 registers (r32-r59) then are the extended registers of specifying special applications.Register (r60) then is the cycle count register, and it is partly as zero-overhead loop mechanism, so that safeguard the counting that is retained in repeated number in the loop structure in the ARC processor.Utilize last three registers (r61-r64) to represent immediate operation data (being respectively shimmf, limm or shimm).Owing to need be used for being provided with the position of the instruction word of the sign short immediate data of encoding, therefore the version of two shimm is arranged; One has symbol setting (being shimmf) and another does not have symbol setting (being shimm).Fig. 3 a-3c illustrates the foregoing description according to register coding structure of the present invention.

Above-mentioned method makes programmer/deviser specify various order format fully flexibly, comprises the order format of following eight kinds of examples:

Table 2

The form numbering	Grammer
The form numbering	Grammer		1.	op.<cc>.<f>a，b，c
2.	op.<cc>.<f>a，b，l		1.	op.<cc>.<f>a，b，c
2.	op.<cc>.<f>a，b，l	3.	op.<cc>.<f>a，l，c
4.	op.<cc>.<f>a，l，l	3.	op.<cc>.<f>a，l，c
4.	op.<cc>.<f>a，l，l	5.	op.<cc>.<f>a，b，c
6.	op.<cc>.<f> a，b，s	5.	op.<cc>.<f>a，b，c
6.	op.<cc>.<f> a，b，s	7.	op.<cc>.<f>a，s，c
8.	op.<cc>.<f> a，s，s	7.	op.<cc>.<f>a，s，c

Wherein:

The op=instruction manipulation

＜cc 〉=the optional conditions code that is used to carry out

＜f 〉=the optional Status Flag that is provided with

The a=destination register

B=source 1 register

C=source 2 registers

S=shimm (weak point of 9 bit strip symbols immediately)

L=limm (32 immediately long)

Should be clear, eight order formats of the table 2 of afore mentioned rules only are for example at this, depend on special application and can use other form.For example, can use to have and be less than or more than a kind of order format of the register number of 64 registers of above-mentioned example.And the present invention can be by specific to a kind of order format that only has two source operands, perhaps a source and a destination operand.In addition, also be noted that and order format of the present invention can be implemented to such an extent that make the grammer of word be different from above-mentioned diagram; For example, the order of source and destination field can be changed or sequence changes.

Table 3 provides second embodiment according to order format of the present invention, is used in combination " ARC " risc chip of application:

Table 3

The form numbering	Grammer	Explanation
The form numbering	Grammer	Explanation	9.	op b，c	Two source fields, the destination is by implicit
10.	op b，s	A source field, a shimm	9.	op b，c	Two source fields, the destination is by implicit
10.	op b，s	A source field, a shimm	11.	op b，l	A source field, a limm
12.	op s，c	Shimm, a source field	11.	op b，l	A source field, a limm
12.	op s，c	Shimm, a source field	13.	op l，c	Limm, a source field
14.	op s，l	shimm，limm	13.	op l，c	Limm, a source field
14.	op s，l	shimm，limm	15.	op l，s	limm，shimm
16.	op s，s	shimm，shimm	15.	op l，s	limm，shimm
16.	op s，s	shimm，shimm	17.	op l，l	limm，limm
18.	op a，b	A destination field, a source	17.	op l，l	limm，limm
18.	op a，b	A destination field, a source	19.	op a，s	A destination, shimm
20.	op a，l	A destination, limm	19.	op a，s	A destination, shimm

It should be noted that in second embodiment of table 3, only specified two fields (rather than instruction manipulation " op ").And both do not specified the field of having ready conditions not have specified sign that field is set yet, but it is evident that such condition and/or sign field is set can be used to these forms.

Should be specifically noted that following two forms of table 2:

4. op.<cc>.<f> a，l，l

8. op.<cc>.<f> a，s，s

By using the AND computing, these two forms are used in particular for providing a kind of MOV (data movement instruction) immediate instruction.In the ARC processor, use the short coding of register immediately from instruction word, to extract short immediate value.If use the short coding of register immediately in two source fields, two source fields will be got the value of short immediate field, but two different short immediate values can not be encoded.By using the long coding of register immediately, the data in the instruction word subsequently can be used to one or two source field, but can not use two different long immediate values.Yet in the present invention, can have one short one long two immediate values, this advantage can make an immediate value store in the storer position immediately into.

Therefore, instruction AND.a, l, l is sent to destination register " a " to the content of subsequently long immediate instruction word.Computing actuating logic and the identical value that has itself, the result forms original value.

In addition, also can use two kinds of above-mentioned forms (4. and 8.) by shift order so that use a single word instruction to produce a plurality of length constant immediately, in the following example shown in:

ASL.a，s，s ；a＝s＜＜(s&31)

(shift order is only used minimum 5 an of immediate value)

(short immediate data is 9 a length)

In above-mentioned example, minimum 59 short immediate values that are used to be shifted whole of the short immediate value in source are so that use a single instruction word rather than by the MOV (AND) that uses above-mentioned just now 9 short immediate datas with non-displacement the immediate value of relative broad range is placed in the register.

As discussed previously, " loose " of the present invention structure can also be used for immediate value directly stored into storer and not use a distributor in the RISC equipment of prior art, in the following example shown in:

ST s, | b, s|; | b+s|=s (shimms must mate)

ST 1, | b, s|; | b+s|=l (wherein " l " specifies long immediate data)

ST s, | s, s|; | s+s|=s (shimms must mate)

In addition, by using register r63 (table 1) as a destination, result's register write back causes being dropped, and this situation for the result who only needs these Status Flags is useful (such as being used for test/comparison), and does not consider any MOV instruction.The assembler syntax that is used for this function uses the destination of an immediate value " 0 " as instruction, and is as follows:

op.<cc>.<f> 0，b，c

op.<cc>.<f> 0，b，l

op.<cc>.<f> 0，l，c

op.<cc>.<f> 0，l，l

op.<cc>.<f> 0，b，c

op.<cc>.<f> 0，b，s

op.<cc>.<f> 0，s，c

op.<cc>.<f> 0，s，s

In this embodiment, comprise multiplexer with a file, these multiplexers are selected the data how to obtain selecting on source 1 and source 2 buses.These buses especially are used as the input of ALU (ALU) on the stage 3 of streamline, in the following example shown in:

Stage 2 is multiplexer as a result

Source 1 field;

Select with sla:

sl_direct＜＝qd a whcn

r0|r1|r2|r3|r4|r5|r6|r7|

r8|r9|r10|r11|r12|r13|r14|r15|

r16|r17|r18|r19|r20|r21|r22|r23|

r24|r25|r26|r27|r28|r29|r30|r31|

loopcnt when rlcnt，

shimmex when rfshimm|rnshimm，

pliw when rlimm，

xldata when others；

In example, it should be noted, use the result in " sla " field initial selected stage 2, add simplifying the operation subsequently.

Source 2 fields;

Select with s2a:

s2_direct＜＝qd_b when

r0|r1|r2|r3|r4|r5|r6|r7|

r8|r9|r10|r11|r12|r13|r14|r15|

r16|r17|r18|r19|r20|r21|r22|r23|

r24|r25|r26|r27|r28|r29|r30|r31|

loopcnt when rlcnt，

shimmex when rfshimm|rnshimm，

pliw when rlimm，

xldata when others；

Owing at the overlapping sign of the order format of having ready conditions short-and-medium immediately (shimm) field the position is set, additional logic is used to the controlled flag setting.In an embodiment of this logic, or use instruction " .f " position or or use the value that imply by short immediate data register number, perhaps be set to " vacation ", if instruct sign (for example, packing/store transfer/redirect into) can not be set.The sign of independent processing special circumstances is provided with device (Jcc.F and FLAG) in independent file.If one 3 operand extended instruction is used, this is the zone of having used short immediate instruction for a purpose, rather than in order to lack immediate data by the represented coding of xshimm signal, this sign is not set up.By the following sign of example this specific character of the present invention of having calculated further example has been set:

Stages 3 sign is provided with calculating:

ip3setflags＜＝‘0’WHEN f_no_fset(ip3i)＝‘1’

or(xshimm AND x_idecode3AND xt_aluop)＝‘1’ELSE

ip3shimmf WHEN ip3shimm＝‘1’ELSE

ip3_fbit；

It should be noted that can use the embodiment of various above-mentioned multiplexer to constitute the present invention, this depends on the concrete grammar of the VHDL that encodes.Based on above-mentioned functional, the coding of the embodiment of the multiplexer that these are different is known for the those of ordinary skill in programming field, therefore here will not be further described.

In addition, the streamline control that can use in a pipeline processor together with (individually or common) and the method for interlocking come together to use valuably method and apparatus of the present invention, comprise especially that wherein those U.S. Patent application titles at the common pending trial of application are " MethodAnd Apparatus For Jump Control InAPipelined Processor; " " MethodAnd Apparatus For Jump Delay Slot Control In A Pipelined Processor; " " Method And Apparatus For Processor Pipeline Segmentation AndRe-assembly; " they are meanwhile declared, at this in conjunction with full content with reference to them.

The method of overall treatment

With reference to figure 4, the method 400 in conjunction with jump delay slot pattern overall treatment logic of previous discussion has been described.U.S. Patent Application Serial Number 09/418 at the common pending trial of applying for, 663 titles are for disclosing the commonsense method that overall treatment has the integrated circuit (IC) logic that a customization (i.e. " software ") instructs in " Method And Apparatus For Managing The Configuration AndFunctionality Of A Semiconductor Design ", it was declared on October 14th, 1999, at this in conjunction with full content with reference to it.

Though following description relates to algorithm or computer program on a microcomputer or other similar treatment facilities, but should know and to use other hardware environment (to comprise small-size computer, workstation, network computer, " supercomputer " and mainframe computer) put into practice this method.In addition, if necessary, can be in hardware or firmware with respect to software one or more part concrete manifestations of computer program, the embodiment of this replacement knows in field of computer technology.

At first, in step 402, obtain user's input of relevant design configurations.Particularly, select the module or the function of the expectation be used to design by the user, and add, deduct design-related instruction, or produce the instruction that needs.For example, in the signal processing applications program, preferably allow CPU comprise that single " multiply accumulating " (MAC) instructs usually.In the present invention, revise the instruction set of comprehensive Design so that combination above-mentioned jump delay slot pattern (or other jump delay slot control structures that are equal to) wherein.Especially, in an embodiment of the present invention, expression specifies one of a plurality of predetermined values of jump delay slot pattern to represent by two data bit with reference to the above-mentioned jump instruction word of figure 1.In addition, the technology bank position that is used for each VHDL file defined by the user in step 402.Technology bank file storage among the present invention all be used for the relevant information in the required unit of overall treatment, for example comprise logic function, I/O is related constraint regularly and arbitrarily.In the present invention, each user can define his/her oneself library name and position, thereby further increased dirigibility.

Next step in step 403, is created in the step 402 hardware description language (HDL) functional block based on the customization in user's input and existing function storehouse of regulation.

In step 404, determine the design level based on user's input and above-mentioned library file.Sequentially produce a level file, new library file, and program-described file based on the design level.Term " program-described file " is equivalent to general UNIX program-described file function or is equivalent to the similar function of the computer system that the computer realm those of ordinary skill knows as used herein.The program-described file function causes other program or the algorithm resident program in the computer system to be performed with order specified.In addition, the Name ﹠ Location of its further specified data file and other information that needs are so that operate the program of appointment effectively.But it should be noted that invention disclosed herein can utilize file structure rather than " program-described file " to produce the functional of expectation.

Produce among the embodiment of processing procedure at program-described file of the present invention, be to inquire alternatively that through display reminding the user imports the information relevant with the design of expecting, such as the type of " member " (for example, integral device or system configuration), the width of external memory system data bus, expand the dissimilar of time slot, type/size of cache memory or the like.If but meet the present invention, many other the configuration and resources of input information also can be used.

In step 406, the program-described file that operates in generation in the step 404 is so that the HDL of generating structure.The HDL of this structure matches so that finish a design with the functional block that disperses.

Then, in step 408, operate in the manuscript that produces in the step 406 is used for simulated program with generation a program-described file.In addition, in step 408, also move the manuscript that produces a comprehensive manuscript.

On this point of program, make the decision (step 410) of a whether comprehensive or design of Simulation.If selection emulation, the user utilizes design and simulated program description document (and user program) the operation simulated program that produces in step 412.Interchangeable, if select overall treatment, the user utilizes the comprehensive manuscript in step 414 and the design and operation overall treatment of generation.After finishing comprehensive/emulation manuscript, in suitable the designing program of step 416 assessment.For example, an overall treatment engine can generate the physical layout of a concrete design, and it meets the performance standard of global design handling procedure, but does not meet desired chip size.In the case, the deviser will be to control documents, database, or other element changes, and they can influence chip size.The results set of design information is used to rerun the overall treatment manuscript subsequently.

If the design that produces is acceptable, then finish the designing treatment program.If design is unacceptable, re-execute treatment step from step 402 beginning up to obtaining an acceptable design.In this mode, round-robin method 400.

Comprehensive logic

With reference now to Fig. 5-9,, example be used to be implemented in the logic of this before described " loose " register encoding function, and the integrated approach that uses Fig. 4 has been described.

Fig. 5 example be used for an embodiment of the top level phase logic that loose register coding source 1 selects.In the embodiment of Fig. 5, top phase logic illustration eight 4 bit multiplexed devices that are equal to formed whole 32.[noticing that be clear expression, the logic illustration among Fig. 5 is divided into two layer stages].This logic can be used for the selection in source 2 too.

Fig. 6 example be used for first embodiment of above-mentioned 4 bit multiplexed devices of the loose register coding in source 1.In the multiplexer of Fig. 6, in overall treatment, place operation or design constraint.

Fig. 7 example be used for first embodiment of above-mentioned 4 bit multiplexed devices of the loose register coding in source 1, except logic is restrained, so that to providing the shortest path to output bus (' sl_direct ') from long immediate data input bus (' pliw ').

Fig. 8 example according to sign of the present invention first embodiment (unconfined) of logic is set.

Fig. 9 example according to sign of the present invention second embodiment of logic is set, just retrain in order to minimize the zone.

With reference now to Figure 10,, example make an example of good pipeline processor, use a 1.0um handling procedure and in conjunction with the logic of Fig. 5-9.As shown in figure 10, processor 1000 is ARC of a picture microprocessor CPU equipment, wherein especially has processor chips 1002, chip memory 1004 and an external interface 1006.Utilize customization VHDL to manufacture and design this equipment, use method 400 of the present invention to obtain the VHDL design of customization, this equipment is become a logic level to represent by overall treatment subsequently, and then utilize the known compiling of semiconductor applications, layout and manufacturing technology are condensed into a physical equipment to it.

The those of ordinary skill in this area is noted that, the processor of Figure 10 can comprise any general obtainable peripheral equipment, such as serial communication device, and parallel port, timer, counter, high current driver, analog to digital (A/D) converter, digital to analogy (D/A) converter, interrupt handler, lcd driver, storer and other similar equipment.In addition, processor can also comprise physical circuit system self-defined or that use.The present invention is not limited to the type of peripheral equipment, quantity or complexity, and other can use the inventive method and the combined Circuits System of device.Say exactly, can use any restriction by force by the physical capability that can improve overtime existing semiconductor processes program.Therefore, can use the present invention to predict integrated complicacy and degree of difficulty, this will further improve the semi-conductive handling procedure of improvement.

Be also to be noted that current many IC design microprocessor chips of use and a dsp chip.Yet DSP only can be used for the DSP function of required limited quantity, perhaps is used for the quick DMA structure of IC.The present invention disclosed herein can support many DSP command functions, and its quick local ram system can the immediate access data.By using the CPU﹠amp of the IC of being used for disclosed herein; The method of DSP function can realize cost saving significantly.

Interchangeable, the processor 1000 of Figure 10 can be by overall treatment so that in conjunction with a memory interface, this memory interface is used for docking between the memory array of one or more IC (for example DSP) function and processor 1000, as with the common pending trial of the application, declare on March 10th, 2000, title for described in " Memory interface and Method ofIntcrfacing Between Integratcd Circuita " U.S. Patent application, at this in conjunction with full content with reference to it.

In addition, please note foregoing method (and corresponding computer programs) production technology easily here relatively simply comprehensively to be adapted to again upgrade, 0.35,0.18 or 0.1 micron technology for example---but not when using " hard " original microtechnology system, in order to adapt to the processing that this class technology will adopt tediously long costliness usually.

With reference now to Figure 11,, especially described can the overall treatment device embodiment of a computing equipment of the logical organization of Fig. 5-9 wherein.Computing equipment 1100 comprises a motherboard 1101, random-access memory (ram) 1104 and the Memory Controller 1105 with central processing unit (CPU) 1102.A memory device 1106 (such as a hard disk drive or CD-ROM) is provided in addition, input equipment 1107 (such as a keyboard or mouse) and display device 1108 are (such as a CRT, plasma, and support main frame and the required bus of peripheral devices work or TFT display).In whole design synthesis processing procedure, above-mentioned VHDL describes and comprehensive engine represents to be stored in that cause CPU1102 uses in RAM 1104 and/or the memory device 1106 with an object code of a computer program, and the latter is that everybody knows in computer realm.In the total system operating process, user's (not shown) is come the overall treatment logical design by the configuration specification of design is input to through program display and input equipment 1107 in the overall treatment program.Be stored in the comprehensive Design of calling after being used in the memory device 1106 that produces by program and be displayed on graphic display device 1108, perhaps if necessary, can output to an external unit through a string or parallel port 1112, such as a printer, data storage cell, other peripheral devices.

Although describe in detail above, and use various embodiment and pointed out novelty of the present invention, but should understand and not break away under the category of the present invention, those skilled in the art can be to the details of equipment of the present invention or processing procedure to omit, substitute, or the various forms that changes is made variation.Above-mentioned content only is to realize best mode of the present invention.Described content also do not mean that and can be restricted, and only as the example of general principle of the present invention.Scope of the present invention should be determined according to the content of claim.

Claims

1. method that is used to design the digital processing unit of the user customizable with an instruction set comprises:

Receive the user input relevant with the design configurations of described processor, at least a portion of described input is relevant with an extended instruction;

Select at least one module or functional description, for use in the digital processing unit design that produces described user customizable; And

At least in part based on described user's input and described at least one selecteed module or functional description, and generating the hardware description language model of at least one customization of described processor, described model has the described extended instruction as the part of described instruction set;

Wherein, the short immediate data of described extended instruction utilization and long these two kinds of data of immediate data are as source data; And

Wherein said extended instruction comprises an operational code and a plurality of data field, and described data field is represented at least the first and second register numbers, and described register number can be represented described short immediate data or described long immediate data.

2. the method for claim 1 further comprises at least in part and importing and described at least one selecteed module or functional description based on described user, and generates the functional block of at least one customization.

3. method as claimed in claim 2, the action of wherein said reception user input comprises the selection of reception user for one or more required processor configuration parameters.

4. method as claimed in claim 2, wherein said at least one module or functional description comprise one or more technology bank files.

5. method as claimed in claim 4, wherein the position of each described technology bank file is also provided by described user.

6. method as claimed in claim 4, wherein said technology bank file can be used for storing all and are the required relevant information in unit of overall treatment.

7. method as claimed in claim 6, wherein said information is selected from: (i) logic function (ii) inputs or outputs regularly and (iii) related constraint.

8. method as claimed in claim 4 further comprises at least in part at least one based on described user's input and described technology bank file, and generates the hardware description language functional block of at least one customization.

9. method as claimed in claim 4 comprises further generating at least one design system that described design system is at least in part based on described user's input and described technology bank file.

10. the method for claim 1 further comprises generating at least one design system, and described design system is at least in part based on described user's input and described module or functional description.

11. method as claimed in claim 10 further comprises at least in part based on described design system, and generates at least one system file and new module or functional description file.

12. method as claimed in claim 10 further comprises at least in part and importing and described module or functional description based on described user, and generates the functional block of at least one customization.

Receive the selection of user for memory interface 13. the method for claim 1, the action of wherein said reception user input comprise, this memory interface has:

A plurality of function ports, described function port are operably connected to one of a plurality of functional units of the correspondence of described processor; With

A plurality of port memories, described port memory are operably connected to separately in a plurality of memory banks in the storer that interrelates with described processor;

Described memory interface is convenient to by each functional unit of described processor each described memory bank be carried out access;

Described memory interface is the part as the action that generates described model, and is integrated in the described processor.

14. method as claimed in claim 2 further comprises the generating structure coding, so that in the digital processing unit design of described user customizable, the functional block of described at least one customization and a plurality of other functional block are relative to each other.

15. an equipment that is used to design the digital processing unit of the user customizable with an instruction set comprises:

A computer system is in order to the operation computer program; With

At least one computer program, wherein said equipment further comprises:

Be used to receive the device of the user input relevant with the design configurations of described processor, at least a portion of described input is relevant with an extended instruction;

Be used to select at least one module or functional description device for use in the digital processing unit design that produces described user customizable; And,

At least in part based on described user's input and described at least one selecteed module or functional description, and generating the device of hardware description language model of at least one customization of described processor, described model has the described extended instruction as the part of described instruction set;

Be used at least in part based on described user's input and described at least one selecteed module or functional description 16. equipment as claimed in claim 15, wherein said equipment further comprise, and generate the device of the functional block of at least one customization.

17. equipment as claimed in claim 16, wherein said user's input comprises that wherein this input media is operably connected to described equipment by the selection of an input media for one or more required processor configuration parameters.

18. equipment as claimed in claim 16, wherein said at least one module or functional description comprise one or more technology bank files, described technology bank file storage with memory device that described device data is connected in.

19. equipment as claimed in claim 18, wherein the position of each described technology bank file is also provided by described user.

20. equipment as claimed in claim 18, wherein said technology bank file can be used for storing all and are the required relevant information in unit of overall treatment.

21. equipment as claimed in claim 20, wherein said information is selected from: (i) logic function (ii) inputs or outputs regularly and (iii) related constraint.

22. equipment as claimed in claim 18, wherein said equipment further comprises: be used at least in part based on described user's input and described at least one described technology bank file, and generate the device of the hardware description language functional block of at least one customization.

23. equipment as claimed in claim 18, wherein said equipment further comprises: be used to generate the device of at least one design system, described design system is at least in part based on described user's input and described technology bank file.

24. equipment as claimed in claim 15, wherein said equipment further comprises: be used to generate the device of at least one design system, described design system is at least in part based on described user's input and described module or functional description.

25. equipment as claimed in claim 24, wherein said equipment further comprises: be used at least in part based on described design system, and generate the device of at least one system file and new module or functional description file.

26. equipment as claimed in claim 24, wherein said equipment further comprises: be used at least in part based on described user's input and described module or functional description, and generate the device of the functional block of at least one customization.

27. equipment as claimed in claim 26, wherein said user's input comprises the selection for one or more required processor configuration parameters.

28. equipment as claimed in claim 16, wherein said equipment further comprises: be used for the generating structure coding, to make the functional block of described at least one customization and the device that a plurality of other functional block is relative to each other in the digital processing unit design of described user customizable.

29. a method that is used to design the digital processing unit of the user customizable with an instruction set and a plurality of registers comprises:

Wherein, described at least one model of described generation comprises: generate processor design, this processor design can make described processor when carrying out data communication, immediate operand and data value are directly stored in the storer, and need not to use in the described register one as distributor; And

Wherein said extended instruction can use short immediate data or long immediate data as source data.

Receive the selection of user for memory interface 30. method as claimed in claim 29, the action of wherein said reception user input comprise, this memory interface has: