CN1355900A

CN1355900A - Method and apparatus for processor pipeline segmentation and re-assembly

Info

Publication number: CN1355900A
Application number: CN00808458A
Authority: CN
Inventors: J·R·H·黑克威尔; J·桑德斯
Original assignee: ARC INTERNAT U S HOLDINGS Inc
Current assignee: Synopsys Inc
Priority date: 1999-05-13
Filing date: 2000-05-12
Publication date: 2002-06-26
Anticipated expiration: 2020-05-12
Also published as: CN1217261C; WO2000070483A3; TW589544B; WO2000070483A2; AU4848700A; EP1190337A2

Abstract

An improved method and apparatus for implementing instructions in a pipelined central processing unit (CPU) or user-customizable microprocessor. In a first aspect of the invention, an improved method of controlling the operation of the pipeline in situations where one stage has been stalled or interrupted is disclosed. In one embodiment, a method of pipeline segmentation (''tearing'') is disclosed where the later, non-stalled stages of the pipeline are permitted to continue despite the stall of the earlier stage. Similarly, a method which permits instructions present at earlier stages in the pipeline to be re-assembled (''catch-up'') to later stalled stages is also described. A method of synthesizing a processor design incorporating the aforementioned segmentation and re-assembly methods, and a computer system capable of implementing this synthesis method, are also described.

Description

The method and the device that are used for processor pipeline segmentation and assembling again

Present patent application requires following application with right of priority: the U.S. Provisional Patent Application that applies on May 13rd, 1999, application number is U.S.Serial No.60/134,253, title is " Method And Apparatus For Synthesizing And Implementing IntergratedCircuit Designs ", with on the October 14th, 1999 that applies for of common pending trial, application number is U.S.Serial No.09/418,663, title is the patented claim of " Method And Apparatus For ManagingThe Configuration And Functionality Of A Semiconductor Design ", it is U.S.Serial No.60/102 that this application requires to apply on October 14th, 1998 application number, 271, and have same title the right of priority of U.S. Provisional Patent Application.

Background of invention

1. invention field

The present invention relates to the integrated circuit (IC) design field, special design uses a kind of hardware description language (HDL), to execute instruction in the microprocessor of the central processing unit (CPU) of pipelining or user customizable.

2. description of Related Art

RISC (or claiming Reduced Instruction Set Computer) processor is widely known by the people in computing technique.Risc processor has usually that utilization significantly simplifies---than non-RISC (being commonly referred to " CISC ") processor---and the essential characteristic of instruction set.Usually, the risc processor machine instruction is not whole microcodingizations, but can directly carry out, need not decoding, thereby tangible high-level efficiency is provided on travelling speed.And the further simplification (comparing) of permission again in the design of processor of this " fairshaped " instruction process ability, thereby littler silicon chip and lower manufacturing cost are provided with non-RISC device.

Risc processor also has such feature usually: (i) pack into/the storing memory architecture (that is only pack into and save command has inlet to storer, other instruction is then operated via the internal register in the processor); (ii) processor and program compiler combination; (iii) pipelining.

In case pipelining is that a kind of by the sequence of operation in the processor is divided into discrete parts---might promptly be carried out simultaneously efficiently by these discrete parts---is to improve the technology of processor performance.In the processor of typical pipelining, corresponding to the arithmetic element of processor arithmetical operation (as ADD, MULTIPLY, DIVIDE or the like) usually by " segmentation ", thereby a clock in office is in the cycle, and a specific part that can make operation is finished to certain portions of this unit.Fig. 1 represents an exemplary processor architecture with arithmetic element of such segmentation.So these unit can carry out computing to different result of calculation in arbitrary given clock period.For example in first clock period, two number A and B are fed to multiplication unit 10 and are partly handled by the first 12 of this unit.In the second clock cycle, came from the partial results that A and B multiply each other and be sent to second portion 14, and first 12 received two new numbers (such as C and D) this moment, began to handle.Net result is after the initial start-up period, and each clock period is all finished multiplication operation by arithmetic element 10.

Pipeline depth can be different according to architecture.Here in the context, " degree of depth " speech means the number that comes across the separate stage in the streamline.Generally speaking, the stage of a streamline, the more working procedure was faster, though also can be difficult to further programme---and the streamline effect is apparent concerning the programmer.The processor of most of pipelinings is three stages (instruction fetch, decoding, carry out) or quadravalence section (fetch operand is carried out or also be can be instruction fetch for for example instruction fetch, decoding, decoding/fetch operand, execution writes back), but also uses more or less stage.

Although aforementioned " discrete method " of computing in the processor arranged, the instruction in the streamline of original technique processor generally is continuous.Specifically, the instruction in stage generally with minimum empty slot, NOP sign indicating number and so on directly thereupon with the instruction in the later phases.And the instruction in a later phases gets clogged (for example the information from an extract operation is waited in the instruction of in the operation phase), and then preceding slightly in this streamline, the stage after a while also all gets clogged.In this way, streamline is easy to move in " lock-step " mode basically.

When the instruction set of a pipelining processor of exploitation, must consider some " risks ".For example, so-called " structure " or " contention for resource " the same resource of overlap instruction contention (for example bus, register, or other functional unit) of taking a risk to result from, these take a risk to solve with one or more streamline obstructions usually.So-called " data " streamline takes a risk to betide the situation of read/write collision, and this conflict can change the order of storer or register access." control " takes a risk then generally to be produced by transfer in the program flow or similar variation.

The pipelining processor need be interlocked usually and be taken a risk to solve many these classes.For example, consider this situation: one slightly before a successor instruction (n+1) in the flow line stage need come from the result of the instruction n of a later phases.A simple answer to the problems referred to above is to calculate with the operand that the one or more clock period postpones to be in the decode phase.Yet a result of this delay, the execution time that is a given instruction of processor is partly determined by the instruction around this instruction in the streamline.This just makes the code optimization of processor complicated, because make the programmer determine that in coding the interlocking situation is normally difficult.

Can in processor, realize interlocking with " scoreboard "; In this method, for additional one of each processor register, in order to designator as content of registers; Specifically, indicate whether that (i) content of registers has been updated and therefore can use that perhaps (ii) content is carrying out being write such change by other register.This scoreboard also is used for generating interlocking in some architecture, and this interlocking prevents to carry out the instruction that will depend on the scoreboard content of registers, till scoreboard indicates this register to use.This method is called as " hardware " interlocking, and this is because the inspection by scoreboard purely of this interlocking, be called via the hardware in the processor.This interlocking generates " obstruction ", and they have hindered the execution (thereby having blocked streamline) that relies on instruction for data, till register can be used.

In addition, also NOP (blank operation operational code) can be inserted in the coding, to postpone corresponding flow line stage when needed.The shortcoming of complicacy that this a kind of method in back is called as " software " interlocking, has code length and the program of strengthening---having used the program that requires the interlocking instruction---.With regard to its coding structure, used the design of software interlock equally often can not fully optimize in a large number.

Another important consideration is program jump or " redirect " in the processor design.All processors are all supported certain type transfer instruction.In brief, shift the situation that program flow is interrupted or changes that refers to.Other operation---for example circulation is provided with and the subroutine call instruction---is also interrupted or reprogramming stream in a similar manner." jump delay slot " speech is illustrated in the streamline through being usually used in, and is in the transfer in the decoding---or claiming redirect---time slot afterwards.When etc. to be transferred/when load is finished, carry out and to shift (or packing into) instruction afterwards.That transfer can be is with good conditionsi true the or numerical value of one or more parameters (that is based on) or unconditional.Also can be absolute (for example based on absolute memory address), or relative (for example based on relative address and do not rely on arbitrary specific memory address).

Transfer can have profound influence to pipeline system.A transfer instruction be inserted into and by the instruction decode stage of processor decode (indicating this processor must begin to carry out other access) preceding, the next instruction word in this instruction sequence promptly has been removed and has inserted in this streamline.An answer for this problem is further extract operation of instruction word and time-out---or title blocks---that removing is taken out, and is performed until transfer instruction to finish, as shown in Figure 2.But this method is owing to carry out the needs of transfer instruction in some instruction cycles, and is equal to employed pipeline depth in the processor design usually.This result is harmful to for processor speed and efficient, because processor can not carry out other operation during this.

In addition, also can use the delay transfer method.In this method, when transfer instruction arrived decode phase, streamline was not removed, and generally is before this transfer is performed, and carried out the come across streamline subsequent instructions of last stage slightly.Therefore when transfer instruction was decoded, this transfer promptly showed to such an extent that be to postpone with the number of required instruction cycle of all subsequent instructions in the execution pipeline.Transfer method shifts with the above-mentioned multicycle and compares, and has improved pipeline efficiency, yet has also increased the complicacy (also having reduced programmer's understanding) of basic coding.

Based on above-mentioned, processor designer and programmer must be relative with noninterlocked architecture, carefully weigh corresponding to the compromise proposal of utilizing hardware or software interlock.And, must consider in instruction set the reciprocation of transfer instruction (and delay or multicycle transfer) with selected interlocking pattern.

To streamline and interlocking, need a kind of improving one's methods, this method had both been optimized the processor pipeline performance, and provided additional coding dirigibility for the programmer simultaneously.And, to add to advance in the processor design with the more pipeline stage (many multistage streamlines even), the benefit of improving track performance and code optimization in this processor can be multiplied.In addition, with certain ad hoc fashion, comprehensive these improved pipeline processors designs and the ability of using existing synthesis tool easily, also has obvious benefit for designer and programmer.

Summary of the invention

The present invention is by providing improved, be used for the method and apparatus that the processor architecture in pipelining executes instruction, and satisfied above-mentioned these requirements.

In first aspect present invention, announced in a kind of improved, processor controls the method for operating of or one or more streamline.In one embodiment, announced a kind of pipeline segmentation (" tearing "), wherein in (i) stage prior to the stage that gets clogged, instruction also gets clogged, and the instruction of the follow-up phase of the instruction that (ii) gets clogged allows to finish.Be interrupted or claim " tearing " thereby on purpose in streamline, generated.Empty slot (or NOP) is inserted the follow-up phase of streamline, currently be subjected to repeatedly carrying out in the being performed instruction that is torn in the stage to hinder.Similarly, announced a kind of method, this method allows instruction---last stage gets clogged slightly in streamline in these instruction meetings under other situation---to be assembled in the stage of blocking after a while again, is interrupted thereby repaired any streamline of tearing or existing effectively.

In second aspect present invention, the announced a kind of improved comprehensive integration circuit design method of---in design, having adopted aforesaid jump delay slot method---.In an one exemplary embodiment, this method has comprised the user's input that obtains about project organization; Generate special-purpose HDL mac function based on this user input and existing capability storehouse; Determine the design level aggregated(particle) structure based on this user's input and storehouse and generate a hierarchy file, new library file and program-described file (makefile); The working procedure description document is with generating structure HDL and manuscript; The manuscript that operation generates is thought simulated program and comprehensive manuscript and is generated a program-described file; Based on the design that generates and comprehensive manuscript and comprehensively this design.

In third aspect present invention, announced a kind of improved computer program, this program can be used for the design of overall treatment device and implements said method.In an one exemplary embodiment, this computer program comprises in the magnetic memory device that is stored in a microcomputer and an object code being suitable for operating in thus on its central processing unit is represented.This computer program has further comprised the graphic user interface (GUI) of interactively a, menu-drive, therefore is convenient to light use.

In fourth aspect present invention, announced the improved device of a cover, to move above-mentioned computer program---this program be used to corresponding to the logical add of pipelining processor with comprehensively.In an one exemplary embodiment, this system has comprised a stand-alone microcomputer system with display, central processing unit, data storage device and input media.

In fifth aspect present invention, announced a kind of improved processor structure system, this processor structure system uses above-mentioned streamline to tear and pick up method.In an one exemplary embodiment, this processor comprises a Reduced Instruction Set Computer (RISC), this computing machine has one three stage streamline---and comprise instruction fetch, decoding and execute phase, these stages are partly controlled by the above-mentioned streamline pattern of tearing/pick up.Also announced comprehensive gate logic, both comprised affined also comprise unconfined.

The accompanying drawing summary

Fig. 1 is typical original technique processor structural system block diagram of a kind of use " segmentation " arithmetic element.

Fig. 2 illustrates the operation of an original technology quadravalence section pipeline processor, and this processor is just carrying out a multicycle jump operation.

Fig. 3 is a pipeline flow chart, illustrate one corresponding to multistage streamline of the present invention in the notion of " tearing ".

Fig. 4 is a logical flow chart, illustrates according to the present invention, use " tears " to control the generalized method of a streamline.

Fig. 5 is a pipeline flow chart, illustrate one corresponding to multistage streamline of the present invention in the notion of " picking up ".

Fig. 6 is a logical flow chart, illustrates according to the present invention, use " picks up " to control the generalized approach of a streamline.

Fig. 7 is a logical flow chart, and the comprehensive in addition generalized approach of processor logic according to the present invention, to having comprised that streamline is torn/picked up has been described.

Fig. 8 a-8b is a synoptic diagram, and an one exemplary embodiment (being respectively unconfined and affined) that realizes that streamline of the present invention " is torn " gate logic of function is described, and is in addition comprehensive with the method for Fig. 7.

Fig. 8 c-8d is a synoptic diagram, illustrates that one is realized that streamline of the present invention " picks up " one exemplary embodiment (being respectively unconfined and affined) of the gate logic of function, carries out comprehensively with the method for Fig. 7.

Fig. 9 has comprised the block diagram of tearing/pick up the processor design of pattern corresponding to streamline of the present invention.

Figure 10 is the mac function synoptic diagram of a computing equipment, and this computing equipment has used a computer program that comprises Fig. 7 method to carry out comprehensive with the processor design to a pipelining.

Detailed description of the Invention

Now accompanying drawing is numbered, all same numbering refers to same part in the accompanying drawing.

Mean any integrated circuit or other can finish the electron device of single job according at least one instruction word at this used " processor " speech, comprise that the ARC user that---but being not limited to---such as present assignee is produced disposes the such Reduced Instruction Set Computer of computing machine (RISC) processor, central processing unit (CPU), and digital signal processor (DSP).

In addition, those of ordinary skills will appreciate that " stage " speech used herein refers to each successive stages in the pipeline processor, and promptly the stage 1 refers to first-class last pipeline stages, and the stage 2 refers to second flow line stage, or the like.Although following discussion is carried out with regard to one three stage streamline (being instruction fetch, decoding and execute phase), yet it should be understood that the method and apparatus that announces in this place can be widely used in having one or one or more processor structure system greater or less than triphasic streamline is arranged.

Carry out with regard to VHSIC hardware description language (VHDL) although it will also be appreciated that following discussion, yet also can use other hardware description language as Verilog , equally successfully describe each embodiment of the present invention.And, although used an exemplary comprehensive engine of Synopsy ---as Design Compiler 1999.05 (DC99)---herein so that each embodiment that proposes is carried out comprehensively, but also can use other comprehensive engine---as can be from Cadence Design Systems, the Buildgares  that Inc. buys." ieee standard 1076.3-1997 ", IEEE Standard VHDL Synthesis Packages has stipulated the language that a kind of industry is accepted, be used for regulation hardware definition language base design and integration capability---perhaps a those of ordinary skill in the art wishes and can be used this.

At last, will be appreciated that, the specific embodiment of---it is comprehensive that this logic uses above-mentioned comprehensive engine and VHSIC hardware description language to carry out by present assignee---although following description has illustrated logic, this class embodiment suffers restraints in some aspects, yet these embodiment only are exemplary and illustrative for design process of the present invention.

Pipeline segmentation (" tearing ")

Architecture of the present invention comprises a free-pouring substantially streamline.If a stage in this streamline gets clogged, if then also get clogged with the last stage---they comprise instruction.Although but, make later phases in the streamline (i.e. " downstream ")---do not applied interlocking in addition---and still have some advantages if continue with the last stage obstruction.These advantages comprise, and---except that other advantage---some instruction process that (i) is able to therefrom to continue in streamline has caused comparing with " obstruction " whole piece streamline, better handling property; (ii) handle the ability that the sign instruction is set that is positioned at the streamline later phases continuously, guaranteed therefrom that in redirect or transfer instruction---execution of these instructions can be indicated the influence of state---is provided with sign before carrying out, (iii) make the scoreboard load be able to send request in a later phases of streamline to storer, certain that depends on that this instruction of packing into then is held in streamline is the last stage slightly.This is packed into and must be allowed to send, otherwise promptly can cause deadlock state.

Draw attention to, corresponding to the continuous processing that the sign instruction is set, the applicant handles simultaneously with the application, title is in the U.S. Patent application of " method and the device that are used for the redirect control of pipelining processor ", a kind of method and device have been announced, be used for interlocking to the sign instruction is set with follow-up redirect/transfer instruction, these redirect/transfer instructions can be provided with the influence of the sign instruction sign that is provided with.

An example as said method, consider a processor with three stage streamlines (take out, decoding is carried out), one of them instruction was got clogged in the stage 2, but was allowed to " tear apart " and continue its stroke by all the other stages of streamline downwards from last stage slightly in the instruction in stage 3.Fig. 3 illustrates this principle (supposition does not apply interlocking).

Now, describe and use streamline of the present invention to tear notion to control the method for a multistage streamline referring to Fig. 4.First step 402 of method 400 comprises instruction set of generation, and this instruction set comprises a plurality of instruction words that will move on processor.This instruction set is stored on the chip that its type is widely known by the people in the art in the program storage device (as a program RAM or ROM storer) usually, though also can use the equipment of other type, comprises memory chip.The generation of instruction set self is widely known by the people equally in the art, just on scope it is improved, and has comprised that streamline tears function, can describe this improvement in more detail below.

Below in step 404, instruction set (program) is by---particularly---programmable counter (PC) takes out from memory device successively with specified order, and runs on the processor, and the instruction of being taken out obtains handling in each stage of streamline in proper order.Note that in the context of a risc processor only have to pack into/the addressable program's memory space of save command, therefore, in such processor, can use a plurality of distributors physically to receive and keep taking from the command information of program storage.A kind of like this packing into/storage system structure in processor and the use of register architecture are well-known in the art, so be not further described.

In step 406, take out by logical block in the congestion condition of streamline in stage, to determine whether to have taken place conflict, this conflict is normally in order to visit certain data value or other resource with signal combination for these logical blocks.A detection that example is this condition of this step: just a register that reads for certain order register is marked as " having gone up scoreboard ", and meaning processor must wait for, until this register till new value is upgraded by one.Another example is that certain state machine has generated blocking period when carrying out in multicycle operation (as a displacement and add take advantage of).

In step 408, the existence of effective instruction is checked in the streamline N+1 stage (N=has called the stage No. in the stage of blocking through step 406 here).Here in the context, one " effective instruction " refers to one not because any former thereby be marked as engineering noise (step 410) and formerly (N) stage has successfully been finished the instruction of processing (step 412).For example, in an embodiment corresponding to the applicant's ARC Core, " p3iv " signal (i.e. " stages 3 instruction effectively ") is used to promptly represent that the stage 3 of streamline comprises an effective instruction.Instruction in stage 3 comprises because some former thereby may be invalid:

When instruction during shift-in stage 2 this instruction be marked as invalid (p2iv=' 0 '), and therefore continued as 3 o'clock its shift-in stages invalid;

2. to tear logical tab a previous cycle by streamline be invalid to the instruction in the stage 3, but replaced by an instruction from the 3 shift-in stages 2 of stage subsequently.

Note that " stopping " condition that draws from step 410 comes from condition " invalid=as to be ", this is because only just can tear when effective instruction occurring simultaneously in stage 2 and stage 3.

Note this situation: the instruction that comes across the stage 2 is confirmed as can not finishing processing (above second) in step 412, and can finish processing in the instruction in stage 3, must allow the instruction in stage 3 break away from streamline (or moving on to next stage) and with the stage 3 be labeled as be in invalid, to fill up the interval of each step 414.Another kind method is that a NOP or other dummy instruction are injected the stage 3, and the stage 3 is labeled as effectively.Maybe this stage is labeled as invalidly if do not insert this blank, then instruction---this instruction can not be finished when handling in the stage 2, promptly in stages 3 processing---promptly can be carried out once more in the next instruction cycle, and this is undesirable.

Please further note, interlocking for the applicant ARC Core corresponding to " v6 " embodiment---this is described in detail in that the applicant handles simultaneously with the application, the U.S. Patent application of title for " method and the device that are used for the redirect control of pipelining processor ", if jump instruction and stage 3 comprise one the sign instruction is set, then the stage 2 of streamline promptly can block.Tear function to be used for v6 redirect interlocking so need streamline of the present invention.

At last, in step 418, the effective instruction that comes across the stage 3 (and a follow-up phase that has in 5 or the more multistage streamline) is according to next time cycle and be performed, and keeps simultaneously coming across the stage 2, blocking the instruction in this stage.Please note that according to the subsequent clock cycle processing to the instruction that gets clogged in the stage 2 can occur, this depends on the state of the obstruction/interlocking signal that causes obstruction.In case should lose efficacy by obstruction/interlocking signal, the processing of the instruction that then should get clogged in the stage promptly can begin in the forward position in next instruction cycle.

The exemplary code of below selecting from the application's appendix I is used for combining aforesaid to realize " tearing " function with the applicant ARC Core (three stage streamline variants):

n_p3iv＜＝ip3iv WHEN ien3＝‘0’            ELSE　　     ‘0’  WHEN ien2＝‘0’ANDien3＝‘1’  ELSE　　      ip2iv；p3ivreg；PROCESS(ck，clr)　　  BEGIN　　  IF clr＝‘1’THEN　　          ip3iv＜＝‘0’；　　  ELSIF(ck‘EVENT AND ck＝‘1’)THEN　　         ip3iv＜＝n_p3iv；　　  END IF；　　END PROCESS；

But to recognize, be different from the coding mode that goes out mentioned herein that---no matter being used for same still other processor---can be used for also realizing that streamline of the present invention tears function.

Streamline assembles (" picking up ") again during obstruction

Tear outside the notion at above-mentioned streamline, the present invention also handles reverse situation with mechanism; Promptly when occurring empty slot or blank between each stage, allow the continuation processing of last stage slightly or " picking up " of streamline to arrive later phases, otherwise streamline is " being torn ".This function is also referred to as " pipeline conversion startup ".

As an example of above-mentioned notion, please consider the situation of aforesaid three stage streamlines, one of them instruction was got clogged in the stage 3, and stage 2 is empty or instruction/length of comprising cancellation word (later be referred to as " not using time slot ") immediately here.Use the function of picking up of the present invention, by make stage 1 instruction is continued to handle, until finishing---when finishing this instruction enter 2, one of stages newly instruct and enter the stage 1---and the permission stage 1 is picked up the stage 2 according to the clock edge.Use this processing, cancelled any empty slot or blank between stage that gets clogged 3 and stage 1.Fig. 5 illustrates this notion.

Referring to Fig. 6, the method for utilizing " picking up " technology of the present invention and controlling a multistage processor pipeline has been described.In first step 602 of this method 600, determine the validity of the instruction on certain phase one (stage 2 in institute's example).Pick up in the context at streamline, effective instruction is defined as simply when it and enters its current generation when (stage 2 in institute's example), is not marked as invalid instruction.If it is invalid through step 602 to instruct, then the pipeline conversion enabling signal promptly is placed in " very " through step 602, as following institute is discussed in detail.Described this pipeline conversion enabling signal steering order word enters the stage 2 from the stage 1 conversion.If the instruction in the stage 3 can not be finished processing, promptly can occur streamline in this incident and " pick up ".Invalid time slot in stage 2 promptly can be replaced by the instruction that moves ahead from the stage 1, and the instruction on the stage 3 promptly can remain in the stage 3.

If the instruction in the stage 2 is effectively through step 602, in the stage 2, finishes the ability of this effective instruction of processing and promptly determine in step 604 subsequently.If this effective instruction can not be finished processing and shift out the stage 2 at next cycle, conversion starting signal promptly is placed in " puppet " through step 606, thereby pipeline conversion was lost efficacy.This has just prevented that effective, pending instruction from being substituted (Fig. 1) by the instruction that moves ahead from previous stage.Secondly if this effective instruction in the stage 2 can be finished processing, promptly determine whether in the stage 2, to have one to interrupt pseudoinstruction and waiting for that a unsettled instruction fetch finishes processing in step 608.As like this really, then conversion starting signal promptly is changed to " puppet " once more, thereby this effective instruction that has hindered once more in the stage 2 is replaced, and this is because effectively (but not finishing) instruction can not advance to the stage 3 in following one-period.If this effective instruction in the stage 2 can be finished processing in following one-period, and do not wait for unsettled instruction fetch, then conversion starting signal promptly is changed to " very " through step 610, thereby permission stage 1 instruction proceeds to the stage 2---thereupon with the shift-in stage 3 while of the instruction in the stage 2.

So according to above-mentioned logic, when processor moved, the pipeline conversion enabling signal always was changed to " very ", unless work as: (i) effective instruction in the stage 2 is former thereby can not finish because of certain; Perhaps (ii) suppose in the stage 2, to have an interruption waiting for that a unsettled instruction fetch finishes.Note that then conversion starting signal promptly is changed to " very " and the instruction shift-in stage 2 in the permission stage 1 if an illegal command in the stage 2 is held (particularly because the secondary stricture on the stage 3).Therefore, this invalid stages 2 instruction will be replaced by this effective stage 1 instruction.

" picking up " of the present invention or pipeline conversion enabling signal (en1) can---in one embodiment---utilize the following exemplary code in this place (selecting from appendix II) and generate:

　　ien1＜＝‘0’WHEN en＝‘0’　　                OR(p2int＝‘1’AND ien2＝‘0’)　　                OR(p2int＝‘1’AND ien2＝‘0’)  ELSE　　        ‘1’；

Also please note, method can combine with other method of streamline control and interlocking (perhaps individually or jointly) is torn and picked up to streamline of the present invention, those methods have especially comprised to be handled with the application simultaneously the applicant, the method of being announced in the U.S. Patent application of title for " method and the device that are used for the redirect control of pipelining processor ", and handle simultaneously with the application the applicant, the method of being announced in the U.S. Patent application of title for " method and the device that are used for the jump delay slot control of pipelining processor ", this two application is submitted to therewith together, and the two includes into, draw fully at this and to be reference.In addition, various register coding modes---encode as " loose " register, that this coding is described in is that the applicant handles simultaneously with the application, title is the U.S. Patent application of " being used for loose register Methods for Coding and device in the pipelining processor ", this application is submitted to together therewith, and includes, draws fully at this and be reference---and can tear and/or pick up invention with streamline as described herein and be used in combination.

Integrated approach

Referring now to Fig. 7,, describes and tear and/or pick up function, logic is carried out comprehensive method 700 in conjunction with aforesaid streamline.The generalized method of this comprehensive integration circuit logic has a customization (i.e. " soft ") instruction set, being published in the applicant handles, is U.S.Patent ApplicationSerial No.09/418 on October 14th, 1999 submission, application number with the application, 663, title is the patented claim of " being used for the structure of managing semiconductor design and the method and the device of function ", here this patented claim is included fully, draws and be reference.

Though following description is with regard to running on algorithm on computing machine or other the similar treating apparatus or computer program and carry out, recognize that other hardware environment (comprises microcomputer, workstation, the computing machine of networking, " supercomputer ", and mainframe computer) also can be used for carrying out this method.In addition, if the part of computer program or more parts also may be implemented on the hardware or firmware with respect to software---be ready that this class alternative is fully in the skill of computer technician.

Beginning in first step 702, obtains user's input according to project organization.Specifically, be chosen as the module or the function of this design, and add, subtract with the need or generate design-related instruction by the user.For example, in signal processing applications, (MAC) instruction is normally useful to make CPU comprise one single " take advantage of and add up ".In the present invention, the instruction set of comprehensive Design is changed, make it in comprising aforesaid streamline tear and/or pick up pattern (or another comparable streamline control structure system).The technology bank position of each VHDL file is also stipulated in step 702 by the user.---logic function for instance---that the technology bank file stores all and for the relevant information in the necessary unit of overall treatment, comprises in the present invention.I/O timing, and all related constraints.In the present invention, each user can stipulate his library name and position, thereby has increased more flexibility.

Secondly in step 703, the user reaches the HDL functional block that has the function storehouse and create customization based on user's input of defined in the step 702.

In step 704, determine the design level aggregated(particle) structure based on user input and above-mentioned library file.Hierarchy file, new library file and program-described file sequentially generate based on this design level aggregated(particle) structure.Here " program-described file " speech is used in reference to UNIX program-described file function commonly used or is the similar functions of the known computer system of computer programming personnel.This program-described file function resides in the computer system other program or algorithm, is performed according to specified order.In addition, it also further is appointed as the name and the position of the necessary data file of the designated program of successful operation and other data.But the present invention who note that here to be announced can utilize the file structure that is different from " program-described file " type to produce required function.

Generate among the embodiment of process at program-described file of the present invention, interactively ask the user to import information about designing via display prompts, type (for example total equipment or system architecture) as " foundation ", the external memory system data bus, different expansion types, cache types/size, or the like.Can use many other input information structure and information sources, however still consistent with the present invention.

In step 706, the user operates in the program-described file of step 704 generation with generating structure HDL.Functional block discrete during this structure HDL will design combines, to make a complete design.

In step 708, operate in the manuscript that step 706 generates then, for simulated program generates a program-described file.The user also moves this manuscript to generate a comprehensive manuscript in step 708.

At this moment making decision in program, is comprehensively or this design (step 710) of emulation.As selecting emulation, the user carries out emulation with regard to using the design and the simulated program description document that generate in step 712.In addition, as select comprehensively, the user carries out comprehensively with regard to the design of using comprehensive manuscript and generated in step.After having finished comprehensive/emulation manuscript, whether appropriate in step 716 assessment design.For example, a comprehensive engine can generate a specific physical layout of this design, and it satisfies the performance condition of overall design process but does not satisfy the die size requirement.In the case, the designer promptly can change control documents, storehouse or other composition that can influence die size.The results set of this design information promptly is used to move once more comprehensive manuscript subsequently.

If the design that generates is acceptable, then this design process is promptly finished.If the design that generates is unacceptable, each step process of process that then starts from step 702 re-executes, until obtaining an acceptable design.By this way, but method 700 is an iteration.

Referring now to Fig. 8 a-8b,, an embodiment (comprising " p3iv " signal with reference to the VHDL of appendix I) of exemplary gate logic has been described, this gate logic has carried out comprehensively with Synopsy  Design Compiler and the method for above-mentioned Fig. 7.Note, during the combined process that is used to generate Fig. 8 a logic is carried out, stipulated a LSI 10k 1.0 μ m, technology, and design is not imposed restriction.Fig. 8 b has been used same process; But on the path from len3 to the clock, retrained design.Appendix III has comprised the coding of the exemplary gate logic that is used to generate Fig. 8 a-8b.

Referring to Fig. 8 c-8d, an embodiment (comprising " ien1 " signal with reference to the VHDL of appendix II) of exemplary gate logic has been described, this gate logic has carried out comprehensively with the method for Fig. 7.Note, during the combined process that is used to generate Fig. 8 c logic is carried out, stipulated a LSI 10k1.0 μ m technology, and design has not been imposed restriction.Fig. 8 d has been used same process; But retrained design to prevent to use the AND-OR door.The appendix IV has comprised the coding of the exemplary gate logic that is used to generate Fig. 8 c-8d.

Fig. 9 has represented the processor of an exemplary pipelining, and this processor is with 1.0 μ m explained hereafter, and has comprised that described in front streamline tears and pick up pattern here.As shown in Figure 9, processor 900 is the CPU device of an ARC microprocessor class, and it especially has a processor core 902, on-chip memory 904, and an external interface 906.This device is produced with the VHDL design of customization, and this design obtains with method 900 of the present invention, comprehensively be a logic level expression formula with it subsequently, be reduced to a physical device that uses compiling, layout and production technology---these technology are widely known by the people---in semiconductor technology then.

One skilled in the art will realize that, the processor of Fig. 9 can comprise any common peripherals that gets, serial communication apparatus for example, parallel port, timer, counter, high current driver, modulus (A/D) converter, digital-to-analogue (D/A) converter, interrupt handler, lcd driver, storer and other similar device.In addition, processor also can comprise the circuit component of user's special use or application specific.The present invention is not limited to peripherals and other type that can use this method and install the circuit component that is made up, quantity, or complicacy.Otherwise any restriction that physical capability applied by existing semiconductor technology all can improve in time.Therefore can expect,, may use integrated complicacy of the present invention and quality and will further improve with the progress of semiconductor technology.

Also note that many IC designs use microprocessor chip or dsp chip at present.But DSP only can be required for the DSP function (as limited pulse response analysis or speech coding) of limited quantity, or is used for the quick DMA architecture of IC.Here the present invention who is announced can support many DSP command functions, and its local fast ram system provides the immediate access to data.By the method that will be published in this be applied to the CPU of IC and DSP function the two, can save considerable cost.

In addition, please note foregoing method (and corresponding computer programs) production technology easily here relatively simply comprehensively to be adapted to again upgrade, 0.18 or 0.1 micron technology for example---but not when using " hard " original microtechnology system, in order to adapt to the processing that this class technology will adopt tediously long costliness usually.

Referring now to Figure 10,, an embodiment of the computing equipment of tearing/pick up signal, integrated logic that can correspondingly be published in this is described.This computing equipment 1000 comprises a motherboard 1001, and this motherboard has a central processing unit (CPU) 1002, random access memory (RAM) 1004, and Memory Controller 1005.A memory device 1006 (as hard disk drive or CD-ROM) also is provided, input equipment 1007 (as keyboard or mouse), with display device 1008 (as CRT, plasma or TFT display), and necessary bus is to support the operation of main frame and peripherals parts.The form that aforesaid VHDL describes and comprehensive engine is expressed formula with a computer program object code is stored in RAM 1004 and/or memory device 1006, and to be used by CPU 1002 during design synthesis, the latter is known by the people in computing technique.User's (not shown) is at system's run duration, by by procedure display and input equipment 1007, the project organization standard is imported into synthesizer and the integrated logic design.Be stored in the memory device 1006 so that retrieval later on by the comprehensive design of the process that program generated, be shown in graphic display device 1008, or output to an external unit via a serial or parallel interface 1012, as printer, data storage device, other peripherals---if necessary.

Be applied to novel characteristics on each embodiment though above detailed description has shown, described and pointed out the present invention, yet will be appreciated that those skilled in the art can make various omissions, replacement or change and not depart from the present invention the equipment of being explained or the form and the details of process.This description never mean the restriction and only should be with its explanation as General Principle of the present invention.Scope of the present invention should be determined with reference to claims.Appendix I is used to streamline to tear and generates the VHDL of integrated logic

library ieee；use ieee.std_logic_1164.all；entity v007a isport(ck：in std_ulogic；　　 clr：in std_ulogic；　　 ien2：in std_ulogic；　　 ien3：in std_ulogic；　　 ip2iv：in std_ulogic；　　 p3iv：out std_ulogic)；end v007a；architecture synthesis of v007a is　　 signal n_p3iv：std_ulogic；　　 signal ip3iv：std_ulogic；begin　　 n_p3iv＜＝ip3iv WHEN ien3＝‘0’               ELSE　　          ‘0’WHEN ien2＝‘0’AND ien3＝‘1’  ELSE　　           ip2iv；p3ivreg：PROCESS(ck，clr)BEGIN　　 IF clr＝‘1’THEN　　          ip3iv＜＝‘0’；　　 ELSIF(ck‘EVENT AND ck＝‘1’)THEN　　          ip3iv＜＝n_p3iv；　　 END IF；END PROCESS；　　 p3iv＜＝ip3iv；end synthesis；

Appendix II is used to streamline to pick up and generates the VHDL of integrated logic

library ieee；use ieee.std_logic_1164.all；entity v007b isport(en：in std_ulogic；　　 p2int：in std_ulogic；　　 ien2：in std_ulogic；　　 ip2iv：in std_ulogic；　　 ien1：out std_ulogic)；end v007b；architecture synthesis of v007b isbegin　　 ien1＜＝‘0’WHEN en＝‘0’　　                OR(p2int＝‘1’AND ien2＝‘0’)　　                OR(ip2iv＝‘1’AND ien2＝‘0’)    ELSE　　         ‘1’；end svnthesis；

Appendix III is used to the comprehensive manuscript of tearing logic and generating the sample synoptic diagram

/* Analyze VHDL */analyze-library user-format vhdl vhdl/v007a.vhdl/* Unconstrained logic */elaborate-library user v007acompilewrite-format db-hierarchy-output db/v007a_uc.dbcreate_schematic-schematic_viewplot-output v007a_uc.psremove_design-all/* Constrained logic */elaborate-library user v007acreate_clock-name＂ck＂-period 10-waveform{05}ckset_input_delay-clock ck 8 ien3compilewrite-format db-hierarchy-output db/v007a_c.dbcreate_schematic-schematic_viewplot-output v007a_c.ps

Appendix IV is used to the comprehensive manuscript of picking up logic and generating the sample synoptic diagram

/* Analyze VHDL */analyze-library user-format vhdl vhdl/v007b.vhdl/* Unconstrained logic */elaborate-library user v007bcompilewrite-format db-hierarchy-output db/v007b_uc.dbcreate_schematic-schematic_viewplot-output v007b_uc.psremove_design-all/* Constrained logic */elaborate-library user v007bset_max_area 0set_dont_use find(cell，lsi_10k/AO*)compile-map_effort highwrite-format db-hierarchy-output db/v007b_c.dbcreate_schematic-schematic_viewplot-output v007b_c.ps

Claims

1. a processing has the method for the processor of a streamline, comprising:

The first-class last pipeline stages that can handle first instruction is provided;

Second flow line stage is provided, and the downstream that described second flow line stage is first-class last pipeline stages also further makes it be suitable for handling second instruction;

Block described first instruction in described first-class last pipeline stages; And

After described first-class last pipeline stages gets clogged, handle described second instruction at described second flow line stage.

2. method as claimed in claim 1, wherein said streamline comprise one three stage streamline, and provide the action of described first and second flow line stages to comprise an instruction decode stage and an execution phase are provided respectively.

3. method as claimed in claim 1, wherein block action and comprise:

Detect the interlocking situation; And

Generate an interlocking signal, make described signal be suitable for blocking first-class last pipeline stages.

4. method as claimed in claim 3 has further comprised the validity of determining described instruction before handling wherein said second instruction at described second flow line stage.

5. an operation has the method for the processor of a streamline, and described streamline comprises phase one, subordinate phase and phase III at least, comprising:

An instruction was provided in each stage of described streamline;

Block an instruction in the described phase one;

After the described phase one gets clogged, in described subordinate phase, handle an instruction;

The instruction that to handle in described subordinate phase moves to the described phase III; And

An empty slot is injected the described subordinate phase of described streamline, in case the current instruction of handling is repeatedly carried out in described subordinate phase.

6. method as claimed in claim 5, the wherein said phase one comprises an instruction fetch phase, described subordinate phase comprises an instruction decode stage, and the described phase III comprises an execution phase.

7. method as claimed in claim 6, the action of wherein said obstruction phase one comprises:

Detection is the interlocking situation between at least one other stage in described phase one and described streamline; And

Described interlocking situation is responded and blocks the described phase one.

8. method as claimed in claim 5, wherein said streamline have further comprised follows quadravalence section afterwards of described phase III.

9. method as claimed in claim 8 further comprises:

After the described phase one gets clogged, in the described phase III, handle an instruction; And

When the instruction of handling in described subordinate phase was moved to described phase III, the instruction that will handle in the described phase III moved to described quadravalence section.

10. method as claimed in claim 7 further comprises:

Provide a sign setting to instruct in described subordinate phase, and a jump instruction is in the described phase one;

Detection is provided with at least a situation of set one or more signs of instruction by described at least one sign, and this situation can influence the follow-up execution of described at least one jump instruction; And

Be blocked in described phase one of described streamline, the execution of described at least one jump instruction, at least till all set signs of instruction will be set by described at least one sign all be set up.

11. the method for an overall treatment device design comprises:

Generate first file, comprise a plurality of instruction words specific to described design;

Information is imported described first file comprising an instruction set, thereby make the execution of at least one instruction word in described phase one of described processor, proceed after being able in described a plurality of instruction words another flow line stage having got clogged before slightly;

Stipulate the position of at least one library file;

Use described first file, described library file and user's input information and generate a manuscript;

Move described manuscript to generate the descriptive language model of a customization; And

Based on described descriptive language model and comprehensive described design.

12. as the method for claim 11, wherein comprehensive action comprises based on the descriptive language model of described customization and moves comprehensive manuscript.

13., further comprised generating one second file to be used for emulation and to use described second file and the action of the described design of emulation as the method for claim 12.

14., further comprised the action of assessing the acceptability of this design based on described emulation as the method for claim 13.

15., comprised that further revised design is to produce a revised design and comprehensive more described action through revised design as the method for claim 14.

16. as the method for claim 11, wherein Shu Ru action has comprised and selects a plurality of input parameters corresponding to described design, described parameter comprises:

(i) cache arrangement; And

(ii) memory interface configuration.

17. a machine readable data storage device comprises:

A kind of data storage medium adapts to and stores a plurality of data bit; And

A computer program, be expressed as a plurality of data bit and be stored in the described data storage medium, described program adapts on the processor that operates in computer system and the comprehensive integration circuit logic, and to be used to have the processor of a streamline, described processor logic further adapts to:

Phase one at described streamline is detected the instruction of blocking;

Subordinate phase at described streamline detects effectively instruction; And

Proceed the execution of described effective instruction and keep institute simultaneously in described subordinate phase

Stating subordinate phase blocks.

18. a processor comprises:

At least one streamline, this streamline has phase one and subordinate phase at least;

Be used to detect device in the obstruction instruction of described phase one;

Be used to detect device in the effective instruction of described subordinate phase; And

Be used for moving described effective instruction, and make subordinate phase keep the device that blocks simultaneously in described subordinate phase.

19. a digital processing unit comprises:

Processor core with a multistage instruction pipelining, this streamline have at least first, second and phase III, and described core adapts to decoding and carries out an instruction set that comprises a plurality of instruction words;

Data-interface between a described processor core and an information storing device; And

An instruction set that comprises a plurality of instruction words, described processor and described instruction set further adapt to:

(i) detection is blocked in first instruction of the described subordinate phase of described streamline;

When (ii) detect described the 3rd rank that an effective instruction comes across described streamline

Section; With

(iii) after described subordinate phase gets clogged, carry out effective finger in the described phase III

Order.

20. as the processor of claim 19, described processor and described instruction set further adapt to:

One that (iv) detects in the described phase III that comes across described streamline is blocked instruction;

(v) detect in described phase III and of coming across between the instruction of described phase one of described streamline and do not use time slot; And

(vi) handle and come across the described instruction of described phase one and described instruction is advanced to described subordinate phase to eliminate the described time slot of not using.

21. a digital processing unit comprises:

Processor core with a multistage instruction pipelining, this streamline have first, second and phase III at least, and described core adapts to decoding and carries out an instruction set that comprises a plurality of instruction words;

(i) obstruction that detects the described phase III that comes across described streamline instructs;

(ii) detect in described phase III and the described phase one that comes across described streamline

Instruction between one do not use time slot;

(iii) processing comes across the described instruction of described phase one and makes described the 3rd rank simultaneously

Section keeps blocking, thereby eliminates the described time slot of not using.

22. as the processor of claim 21, wherein said with time slot do not comprise from as next the group a selected time slot:

(i) empty slot;

(ii) one has comprised a time slot that is canceled instruction; And

(iii) time slot that comprises a long word immediately.

23. digital processing unit with a corresponding data storage device and at least one streamline, this streamline comprises first, second and phase III at least, and wherein the execution of instructing in described at least one streamline is controlled by the method that may further comprise the steps:

An instruction set that comprises a plurality of instruction words is provided;

In described memory device, store at least a portion of described instruction set;

At least a portion of the described instruction set of operation on described processor;

First instruction in the described subordinate phase of described streamline is blocked in detection;

When detect the described phase III that an effective instruction comes across described streamline; And

Execution keeps blocking described first instruction in described subordinate phase simultaneously in the effective instruction of described phase III.

24. an operation has the method for the processor of a streamline, described streamline comprises at least and comprising first, second and phase III:

Be provided at an instruction in each stage of described streamline;

Block an instruction in described subordinate phase;

In the instruction of aftertreatment in the described phase III that get clogged of described subordinate phase;

The described phase III is shifted out in the instruction of described processing; And

The described phase III that an empty slot is inserted described streamline, repeatedly handled to prevent to instruct after the processing in the described phase III.