US20010014940A1 - Dynamic allocation of resources in multiple microprocessor pipelines - Google Patents
Dynamic allocation of resources in multiple microprocessor pipelines Download PDFInfo
- Publication number
- US20010014940A1 US20010014940A1 US09/842,108 US84210801A US2001014940A1 US 20010014940 A1 US20010014940 A1 US 20010014940A1 US 84210801 A US84210801 A US 84210801A US 2001014940 A1 US2001014940 A1 US 2001014940A1
- Authority
- US
- United States
- Prior art keywords
- instructions
- instruction
- memory
- data
- pipelines
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 230000015654 memory Effects 0.000 claims abstract description 124
- 230000004044 response Effects 0.000 claims abstract description 9
- 238000000034 method Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 8
- 238000004590 computer program Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000001419 dependent effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline, look ahead using instruction pipelines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Advance Control (AREA)
Abstract
Three parallel instruction processing pipelines of a microprocessor share two data memory ports for obtaining operands and writing back results. Since a significant proportion of the instructions of a typical computer program do not require reading operands from the memory, the probability is high that at least one of any three program instructions to be executed at the same time need not fetch an operand from memory. The two memory ports are thus connected at any given time with the two of the three pipelines which are processing instructions that require memory access, the pipeline without access to the memory processing an instruction that does not need it. To do so, the added third pipeline need not have all the same resources as the other two pipelines, so its stages are made to have a reduced capability in order to save space and reduce power consumption. The stages of the three pipelines are also dynamically interchanged in response to the specific combination of three instructions being processed at the same time, in order to increase the rate of processing a large number of instructions.
Description
- This is a continuation-in-part of copending patent application Ser. No. 09/062,804, filed Apr. 20, 1998, which application is expressly incorporated herein in its entirety by this reference.
- This invention relates generally to the architecture of microprocessors, and, more specifically, to the structure and use of parallel instruction processing pipelines.
- A multi-staged pipeline is commonly used in a single integrated circuit chip microprocessor. A different step of the processing of an instruction is accomplished at each stage of the pipeline. For example, one important stage generates from the instruction and other data to which the instruction points, such as data stored in registers on the same chip, an address of the location in memory where an operand is stored that needs to be retrieved for processing. A next stage of the pipeline typically reads the memory at that address in order to fetch the operand and make it available for use within the pipeline. A subsequent stage typically executes the instruction with the operand and any other data pointed to by the instruction. The execution stage includes an arithmetic logic unit (ALU) that uses the operand and other data to perform either a calculation, such as addition, subtraction, multiplication, or division, or a logical combination according to what is specified by the instruction. The result is then, in a further stage, written back into either the memory or into one of the registers. As one instruction is moved along the pipeline, another is right behind it so that, in effect, a number of instructions equal to the number of stages in the pipeline are being simultaneously processed.
- Two parallel multi-stage pipelines are also commonly used. Two instructions may potentially be processed in parallel as they move along the two pipelines. When some interdependency exists between two successive instructions, however, they often cannot be started along the pipeline at the same time. One such interdependency is where the second instruction requires for its execution the result of the execution of the first instruction. Each of the two pipelines has independent access to a data memory through one of two ports for reading operands from it and writing results of the instruction execution back into it. The memory accessed by the pipelines is generally on the integrated circuit chip as cache memory, which, in turn, accesses other semiconductor memory, a magnetic disk drive or other mass storage that is outside of the single microprocessor integrated circuit chip.
- It continues to be a goal of processor design to increase the rate at which program instructions are processed. Therefore, it is the primary object of the present invention to provide an architecture for a pipelined microprocessor that makes possible an increased instruction processing throughput.
- It is another object of the present invention to provide such a pipelined microprocessor that minimizes the additional amount of power consumed and integrated circuit space required to obtain a given increase the rate of processing program instructions.
- These and additional objects are accomplished by the various aspects of the present invention, wherein, briefly and generally, according to one such aspect, three or more parallel pipelines are provided without having to use more than two data memory ports to retrieve operands or store the results of the instruction processing. It is undesirable to use a memory with more than two ports, or to use two or more separate data memories, since the complexity, power consumed and space taken by such many ported memories is highly undesirable. It has been recognized, as part of the present invention, that since a significant proportion of the individual instructions of most programs do not need access to data memory in order to be executed, an extra pipeline without such access still results in a significant increase in processing speed without a disproportionate increase in the amount of circuitry or power consumption. In a specific implementation of this aspect of the invention, three instructions are processed in parallel in three pipelines at one time so long as one of those instructions does not need access to the data memory. The two ports of the data memory are made available to the two pipelines processing instructions that need access to the data memory, while the third pipeline processes an instruction that does not require such access.
- A three pipeline architecture is preferred. If all three instructions queued for entry into the three pipelines at one time all need access to the data memory, then one of the instructions is held. In this case, the third pipeline is not fully utilized for at least one cycle, but this does not occur excessively because of the high proportion of instructions in most operating systems and programs that do not need access to the data memory. A fourth pipeline may further be added for use with a two port data memory if that proportion of instructions not needing data memory access is high enough to justify the added integrated circuit space and power consumed by the additional pipeline circuitry.
- According to another aspect of the present invention, the third pipeline is made simpler than the other two, since there is also a high enough proportion of instructions that do not need the complex, high performance pipeline stages normally supplied for processing the most complex instructions. A preferred form of the present invention includes two pipelines with stages having the normal full capability while at least some of the stages of the third pipeline are significantly simplified. In a specific implementation of this aspect of the present invention, the address generation stage of the third pipeline is made simpler than the address generation stage of the other two pipelines. The third address generation stage may, for example, be especially adapted to only calculate instruction addresses in response to jump instructions. The ALU of the execution stage of the third pipeline is also, in a specific implementation, made to be much simpler than the ALUs of the other two pipelines. The third ALU, for example, may be dedicated to executing move instructions. The simpler third pipeline stages minimize the extra integrated circuit space and power required of the third pipeline. Yet, a significant increase in through put of processing instructions is achieved.
- According to a further aspect of the present invention, individual ones of the multiple stages of each of the pipelines are interconnectable with each other between the pipelines in order to take advantage of a multiple pipelined architecture where the capability and functions performed by a given stage of one pipeline is different than that of the same stage of another pipeline. This allows the pipelines to be dynamically configured according to the need of each instruction. Stages capable of processing a given instruction are connected together without having to use stages with excessive capability in most cases. One instruction, for example, may require a full capability address generator but then only needs the simplest ALU, so the instruction is routed through these two stages. For another instruction, as another example, no address generator may be necessary but a full capability ALU may be required.
- The ideal operation which is sought to be achieved is to have three pipelines operating on three instructions all the time with no more circuitry (and thus no more space or power consumption) than is absolutely necessary to process each instruction. Each of the various aspects of the present invention contributes to moving closer to that ideal, the most improvement being obtained when all of these aspects of the present invention are implemented together.
- Additional objects, advantages, and features of the present invention will become apparent from the following description of its preferred embodiments, which description should be take in conjunction with the accompanying drawings.
- FIG. 1 is a block diagram of a prior art two pipeline microprocessor architecture;
- FIG. 2 illustrates, in a simplified form, a three pipeline microprocessor architecture utilizing the various aspects of the present invention;
- FIG. 3 illustrates the major stages of a detailed example of a three pipeline microprocessor utilizing the various aspects of the present invention;
- FIG. 4 is a block diagram showing additional details of the ID and IS stages of the microprocessor of FIG. 3;
- FIGS. 5A and 5B illustrate the structure of the queue register and form of data stored in it, respectively, if the ID stage shown in FIG. 4;
- FIG. 6 is a block diagram illustrating the AG and OF stages of the microprocessor of FIG. 3;
- FIG. 7 is a block diagram of the EX and WB stages of the pipeline of FIG. 3;
- FIG. 8 is a flowchart illustrating a preferred operation of the multiple pipeline microprocessor shown in FIGS.3-7;
- FIG. 9 is a flowchart showing the operation of the
block 411 of the flowchart of FIG. 8; and - FIG. 10 is a flowchart showing the operation of the
block 413 of the flowchart of FIG. 8. - As background, a prior art architecture of a single chip microprocessor with two pipelines, each having multiple stages, is described with respect to FIG. 1. What is shown in FIG. 1, is provided on a single integrated circuit chip. That includes some on-board memory, usually cache memory, such as an
instruction cache 11 and adata cache 13. The instruction cache 11 stores instructions that are frequently being executed, and thedata cache 13 stores data that is frequently being accessed to execute the instructions. The instruction anddata cache memories - Addresses of instructions and memory are generated in a circuit15 by an instruction fetch
block 17. A main component of the instruction fetchblock 17 is a program counter that increments from a starting address within thecache memory 11 through successive addresses in order to serially read out in acircuit 19 successive instructions stored at those addresses. The instruction fetchblock 17 is also responsive to an address in acircuit 21 to jump out of order to a specified beginning address from which the program counter then counts until another jump address is received. - The instructions read one at a time out of the
cache memory 11 are stored in abuffer 23 that decodes them sufficiently so that one instruction is passed throughcircuits 25 and another instruction is passed through circuits 27 at the same time. Thecircuits 25 and 27 are the beginnings of the parallel pipeline stages, with theinstruction buffer 23 providing an initial stage to each of these pipelines.Latches - Each of these instructions is also connected with a
control unit 33 having outputs that are connected (not shown for simplicity) to most of the other blocks of the pipeline in order to control their operation. Thecontrol unit 33 decodes each of the instructions presented in thecircuits 25 and 27 in order to specify how each of the stages of the two pipelines is to operate to execute that instruction. For example, a signal from thecontrol unit 33 normally latches the instructions in thecircuits 25 and 27 in therespective latches circuit 25. Thus, the instruction in the circuit 27 is not stored in thelatch 31 at the same time as the instruction is stored in thelatch 29. Rather, the instruction in the circuit 27 is entered into a pipeline in a subsequent cycle, so the result of the execution of the first instruction is available to it when required. - Each of the pipelines includes an address generation stage, their primary
components being adders data cache memory 13 where an operand is to be found that is necessary to execute the instruction. The address is calculated by each adder from information provided in the instruction itself or data read from one ofseveral registers 39 that are also provided as part of the microprocessor integrated circuit. According to one architectural standard, eight such registers r1 through r8 are included, while more registers are used in other architectural standards. An instruction often requires data to be read from at least one of the registers in the course of calculating the address. - The calculated memory addresses of the two instructions being processed in parallel are then stored in
latches data cache memory 13 through interfaces 45 and 47 to retrieve operands from the address locations incircuits 49 and 51. These operands are then temporarily stored inlatches 53 and 55 at the beginning of the next stage of the pipelines. - This next stage is the execution stage that includes two
ALUs data cache memory 13, other data stored in theregisters 39, and data provided in the instruction itself are all used by theALUs latches - That final stage includes
blocks cache memory 13 or one of theregisters 39. The pipeline utilizing theblock 65 writes to thecache memory 13 through its port A, and the second pipeline, through theblock 67, writes to thecache memory 13 through its port B. - It will be recognized that the prior art two pipeline architecture, as illustrated in FIG. 1, includes the maximum capability in each stage that may be required to process each instruction. As a result, many instructions do not use that capability. For example, any instruction that does not need to fetch an operand from the
data cache 13 will skip over the address generation and operand fetch stages ofadders ALUs - As part of the present invention, these characteristics of the operation of a two pipelined microprocessor have been recognized to allow the addition of a third pipeline without having to provide access to the
data cache memory 13 by that third pipeline. The addition of another port to thedata cache 13 requires a different memory that, when implemented, takes much more space and power than is practical. Thus, according to the present invention, a third pipeline without data memory access is utilized to process in parallel with the two main pipelines those instructions that do not need such access. And since all the instructions do not need the full power of a typical high-performance address generation stage adder or execution stage ALU, the third pipeline also implements these stages with a less complex, lower performance adder and ALU that are sufficient for a large proportion of instructions being processed. These instructions are then implemented in much less space and with the use of much less power than the full performance stages provided in the other two pipelines. - In addition, the present invention provides for switching stages between pipelines so that a given instruction has just enough resources that it needs for its processing but without the need to consume additional unnecessary resources.
- An implementation of these various aspects of the present invention are conceptually illustrated in the three pipeline microprocessor of FIG. 2, wherein blocks performing functions substantially as in the prior art system of FIG. 1 are given the same reference numbers. A first stage of the pipelines, common to all three, is an instruction decoding (ID) stage including an
instruction queue 71. In this stage, the serial stream of instructions being read out of theinstruction cache 11 are separated into their individual instructions, which are usually of variable length. Processing and predicting of target addresses of branch instructions as part of the instruction fetch 17 are given in copending patent application entitled “Improved Branch Prediction Mechanism,” of Sean P. Cummings et al., filed Sep. 4, 1998, which application is incorporated herein in its entirety by this reference. - A next stage, also common to each of the three pipelines, is an instruction issue (IS) stage including a
circuit block 73 that receives the instructions from thequeue 71 and outputs three at a time oncircuits respective latches control unit 87 that decodes them and provides control signals to other stages and blocks of the microprocessor in order to configure them appropriately to provide the proper resources and operation to process each set of instructions. - The address generation stage of each of the three pipelines includes
respective adders 89, 91 and 93. Theadders 89 and 91 are full performance adders that are capable of generating an address for any of the known set of instructions, while the adder 93 is made to have less capability but remaining capable of performing the adder function with some subset of the full set of instructions that are frequently encountered. This allows the third adder 93 to be efficiently utilized with the other two. In a specific implementation, the third adder 93 is especially designed to respond to jump instructions for calculating an address to which the instruction fetchunit 17 should jump. The jump address calculated by the third adder 93, after being delayed for two operational cycles by being moved throughlatches circuits 99 as an address to the instruction fetchblock 17. - In the implementations of the various aspects of the present invention being described with respect to the drawings, instructions are issued by the
block 73 so that three successive instructions are stored in order by thelatches adder 89 is provided with aninput switch 101 that allows it to be connected to receive an instruction from either of theregisters switch 103 to the instructions in either of thelatches switch 105 to the instructions in any of the three latches 81, 83 or 85. Thus, it can be seen that two of the three instructions stored in thelatches adders 89 and 91 while the remaining instruction, if it can be processed by the third adder 93, is connectable to the adder 93 from any of thelatches - The outputs of the
full adders 89 and 91 are addresses that are stored inlatches respective interface circuits data cache memory 13. The resulting operands read from thememory 13 are stored inrespective latches latches latches - The execution units of the two primary pipelines include
full capability ALUs logic unit 127 having lesser capability, in this example, being dedicated to moving data from one location to another. Each of theALUs move unit 127 have accompanying input switches 129, 131 and 133, respectively. Each of theswitches control unit 87 that result from decoding the instructions being executed. - The input of the
move unit 127 is connectable through itsswitch 133 to either of the two operands read from thememory 13 and stored in thelatches latches 121. The switch 131 connects the input to thefull capability ALU 125 to any one of four of those same inputs, connection to the instruction which has come through theregister 81 being omitted. Similarly, theALU 123 is connectable through itsinput switch 129 to four of the same five inputs, the instruction coming through theregister 85 being omitted. Theswitches - Outputs of the
ALUs move unit 127, are connected withrespective multiplexers control unit 87 consistent with the instructions that have been executed. Similarly, these two outputs of theALUs move unit 127, are submitted torespective latches data cache memory 13 through a write backcircuit 147 for port A of the memory and 149 for its port B. Switches 151 and 153 are operated to connect data from two of the threelatches data cache 13. It can be seen that only two of the three pipelines may access thedata memory 13 at one time. But since a large proportion of instructions of a usual program do not require data memory access, this limitation does not prevent execution of three instructions at the same time in most instances. - It will be recognized that, as with all pipelines, instructions are executed in sequence as they move through the pipelines from left to right of the block diagram of FIG. 2. One set of instructions stored in the
registers adders 89, 91 and 93 in one operating cycle, with the results stored in thelatches latches 119, a second set of instructions is then loaded into thelatches registers register 39 or moved to the output latches 141, 143 and 145 for writing back into thedata memory 13 in a fourth operating cycle, during which a fourth set of instructions is loaded into theregisters - Although the architecture conceptually illustrated in FIG. 2 has been described as three distinct pipelines it will be recognized that, because of the three sets of
switches 101/103/105, 129/131/133 and 151/153, that a given instruction can travel through one stage in one pipeline, and through a subsequent stage in a different pipeline. This, in effect, dynamically creates, in response to thecontrol unit 87 decoding the instructions and knowing the resources that each instruction needs, a separate pipeline for that instruction made up of one of the three possibilities for each stage that is consistent with the requirements of the instruction. - Some examples of the configuration of the various stages of FIG. 2 to process various types of instructions will now be described in general. An adder of the AG stage, and thus also the path taken in the OF stage, are selected for a given instruction independently of selecting the ALU in the EX stage. For example, if an instruction requires an arithmetic operation, one of the
full capability ALUs full capability adders 89 or 91, and their respective access to the ports of thedata cache memory 13, are required, depends on whether an operand to be used by a selected ALU is to come from thememory 13. In many cases, however, the operands used by the selected ALU will come from the instruction itself, and/or theregisters 39. In this latter case, the instruction reaches the ALU through thelatches adders 89 or 91. - Another example is an instruction for a move of data, in which case the
move unit 127 is selected in the EX stage, if available, thereby leaving thefull capability ALUs registers 39, then thecontrol unit 87 causes the instruction to be sent directly to themove unit 127 through theregisters data memory 13, then one of theadders 89 or 91, with its access to the memory interfaces 111 and 113, respectively, is used in order to provide that read data to the input of themove unit 127 through theswitch 133. In this case, the instruction flows through one of the two major pipelines until data is read from thecache memory 13, at which time that data is then given to themove unit 127 of the third, reduced capability pipeline. - Similarly, if data is to be written into the
cache memory 13 as part of a move instruction, one of the two write backunits control unit 87 decoding the individual instructions and setting the switches appropriately. Yet another example is the processing of a jump instruction, which is processed almost entirely by the lesser capability adder 93. - It will be noted, as mentioned earlier, that the instructions are loaded into the
latches control unit 87 setting the various switches, as described. Alternatively, thecontrol unit 87 could cause these instructions to be loaded into thelatches switches - The embodiment of a three pipeline microprocessor conceptually described in FIG. 2 is given in more detail with respect to FIGS.3-7. An overview of that implementation is given in FIG. 3. The stages of the pipeline include initial instruction decode (ID) and instruction issue (IS) stages that are common to each of the three parallel pipelines. A set of three instructions is provided through
circuits registers 39 if so designated by an instruction being processed. Outputs 157-164 of the AG stage are applied to the operand fetch (OF) stage which in turn provides any read operands, instructions and other data to an execution stage (EX) through circuits 167-174. The execution stage also receives data from one or more of theregisters 39 if designated by an instruction being processed. The results of the processing of each set of three instructions is provided atcircuits block 17. The WB stages cause the results of the instruction processing to either be written back to thecache memory 13 throughcircuits circuit 185 back to the instruction fetchblock 17, or some combination of these possibilities among the three instructions that have been processed. The results of the instruction processing of the EX stage could be written back to one or more of theregisters 39 in the WB stage but the implementation being described writes to theregisters 39 in the EX stage. - Further details of the structure and operation of the
cache memories - Referring to FIGS. 4, 5A and5B, the instruction decode (ID) stage of the FIG. 3 microprocessor is given in more detail. Instructions are serially read from the
instruction cache 11 and into aqueue register 201. The system being described provides for the instructions having a variable number of bytes, depending primarily upon whether and individual instructions includes one or more bytes of address and/or one or more bytes of operand. It is therefore necessary to separate the steady stream of bytes into individual instructions. This is accomplished by tagging the bytes within thequeue register 201 and then decoding the stream of bytes by decodingcircuitry 201 in order to group the bytes of each instruction together as a unit. Anoutput 205 of thedecoding circuitry 203 carries the bytes of individually identified instructions to the next pipeline stage. - FIGS. 5A and 5B illustrate how this level of decoding is accomplished. One or more bytes of
instruction 207 is inputted at a time into one end of a logically definedshift register 201 from theinstruction cache memory 11. The instruction bytes are read out of theshift register 201, one ormore bytes 209 at a time. As instruction bytes are read out of theregister 201, other bytes in it are shifted up through the register and new ones added to the bottom from theinstruction cache 11. Theregister 201 in FIG. 5A is shown to have a width sufficient to contain a word illustrated in FIG. 5B that includes abyte 211 of instructions, avalidity bit 213 andseveral control bits 215. Thecontrol bits 215 identify the first byte of each instruction and designate the number of bytes in the instruction. As these bytes are individual read out of theregister 201, thedecoder 203 identifies the beginning and ending byte of each instruction. - Various specific alternative structures of the
queue register 201, and their operation, are given in copending patent application entitled “Improved Instruction Buffering Mechanism,” of Kenneth K. Munson et al., filed Sep. 4, 1998, which application is incorporated herein in its entirety by this reference. - These instructions are then arranged by the instruction issue (IS) stage in their order of execution. Shown in the IS stage of FIG. 4 are six latches217-222, each of which is capable of storing the maximum number of bytes forming any instruction that is expected to be received by the stage. The three latches 217-219 present one set of three decoded instructions at a time to
respective circuits instruction decoder 203, instructions are first loaded into the latches 220-222 and then individually moved up into the latches 217-219 as instructions are sent from the latches 217-219 out along the remaining stages of the pipeline. This shifting of instructions upward among the latches 217-222 as instructions are moved out of the latches 217-219 is accomplished by a set of multiplexers 225-229. - Although it is a goal to send a set of three instructions each cycle from all of the latches217-219 along the pipeline, there will be situations where one or two instructions of a set may be held and sent down the pipeline in the next cycle. Thus, for example, if only one instruction in the
latch 217 is sent down the pipeline in one cycle, the instructions in each of the remaining 218-222 are moved upward as part of that same cycle in order to reside in the latches 217-221, respectively. A new set of three instructions is then readied for entry into the next stage of the pipelines. Another instruction is then loaded into the nowempty latch 222 through thecircuit 205. In a case where all three instructions in the latches 217-219 are sent down the pipeline in a single cycle, the instructions residing in the remaining latches 220-222 are then moved up into the respective latches 217-219 in position to be sent down the pipeline during the next cycle. - Each set of three instructions that is poised in the latches217-219 for being sent down the pipeline are also inputted to the
control unit 87. The control unit decodes the instructions in order to ascertain how many of the three instructions may be sent down the pipeline at the same time and to determine the resources that must be allocated in the subsequent stages down stream of the IS stage for processing each instruction. This is possible since there is a known set of instructions although the number of instructions is rather large. In determining the resources required to process each instruction, and thus routing them individually through the subsequent stages, the control unit also notes and takes into account whether the instruction includes any address and/or operand bytes. - The set of three instructions in the latches217-219 is made available to
respective latches control unit 87 causes those individual instructions to be latched, and thus stored, within the individual latches 231-235 that are to be sent down the pipeline together during that cycle. Any remaining instructions not latched into thelatches - The primary components of the AG stage are three adders, a four
input port adder 237, another fourinput port 239 and a much simpler, twoinput port adder 241. The results of the address calculations of each of these adders occurs inrespective outputs adders respective multiplexers multiplexer 243 selects, in response to a control signal from thecontrol unit 87, the instruction in either of thelatches adder 237. Themultiplexer 245 serves a similar function with respect to theadder 239, selecting the instruction in either of thelatches multiplexer 245 selects from any three of the instructions stored in thelatches input 249 to theadder 241. - Each of the
adders component 253 of a selected instructions operates amultiplexer 255 to present at one of theinput ports 257 to theadder 237 the contents of one ofmany registers 251 that are part of a standard microprocessor. Each of these registers contains a base address for a segment of memory in which certain types of data are stored. For example, a “CS” register contains the base address for a block of memory containing code, a “DS” register designating a base address of a block of memory for data, a register “SS” containing a base address for a block of memory used for a stack, and so forth. - A
second input port 259 to theadder 237 receives a displacement component of the instruction, if there is such an address component to the instruction being processed during a given cycle. Athird input port 261 receives the content of one of the eightregister 39 as selected by amultiplexer 263 in response to a base offsetportion 265 of the instruction. Similarly, afourth input port 267 to theadder 237 is connnectable to another one of theregisters 39 through amultiplexer 269 in response to anindex pointer 271 component to the instruction. - The result at the
output 158 of theadder 237 is an address within thecache 13 where an operand is to be found that is required to execute the instruction. This address is stored in a latch 273 within the next stage, the operand fetch (OF) stage. Theadder 239 receives the same four inputs, although for a different one of the set of three instructions that are in the AG stage at the time, and similarly calculates another address in anoutput 161 that is stored in a latch 275. - Another adder (not shown) can optionally be included within the AG stage as an auxiliary address generator to assist the
adders - The
third adder 241 shown in FIG. 6 is, in this specific example, dedicated to calculating an address within theinstruction cache memory 11 from a jump instruction. Thus, one of itsinput ports 277 receives the contents of the CS register within the group ofregisters 251 while asecond input 249 receives a relative offset component of an address within the code segment of memory. A jump address calculated by theadder 241 appears that at its output 164 which is then stored in a latch 279 at the beginning of the next OF stage. - In addition, the AG stage selects by a multiplexer281 the data from one of the instructions stored in the
latches latches latches registers 39. - The primary operation occurring in the OF stage is to read up to two operands from the
data memory 13 located at the addresses stored in the latches 273 and 275. Memory interface circuits 295 and 297 provide such access respectively to the A and B ports of thedata cache 13. A result in thecircuits data memory 13, if indeed a given set of instructions present in the OF stage calls for two such operands. There may be cases where only one operand is fetched, or more unusually, when no operand is fetched by these stages. - It will be noted that the address outputs of the
principal adder data cache memory 13. No multiplexing is provided to alter this connection since that element of flexibility is not required. The entiredata cache memory 13 may be accessed through either of its ports A or B. Thethird adder 241, of course, does not form an address for thememory 13. - The next processing stage, the execution (EX) stage, has eight input latches301-308 that store, in the next operational cycle, the contents of the circuits 167-174. This stored information is available for use by
full capability ALUs specialized unit 315 to move data between theregisters 39 and thedata cache 13, or between individual ones of theregisters 39. Themove unit 315, in effect, is a single input port, limited capability ALU. TheALU 311 has twoinput ports respective multiplexers ALU 313 has corresponding twoinput ports respective multiplexers ALU 313 is provided, in this particular sample, with athird input port 333 that is also connected to the output of themultiplexer 323, for reasons described below. The data moveunit 315 has asingle input port 335 from an output of themultiplexer 337. - The inputs to each of the
multiplexers ALU 311 are the same.Multiplexers registers 39 as one of the respective inputs to each of themultiplexers multiplexer 343. The remaining four inputs to each of themultiplexers latches - Each of the
multiplexers ALU 313 are similarly connected in order to provide that ALU with a similar range of potential inputs.Multiplexers multiplexers registers 39 that is selected by one of the instructions within thelatches multiplexer 349. The remaining four inputs of each of themultiplexers latches - The
multiplexer 337, which selects aninput 335 to themove unit 315, similarly has an input connected to amultiplexer 351 that selects data from one of theregisters 39 as one of its inputs, in response to the contents of any one of the three instructions stored in thelatches multiplexer 337 are the same as the other multiplexers described above, namely, the contents of thelatches - The data outputs of each of the
ALUs move unit 315, are stored in a next cycle in individual ones oflatches units latches respective multiplexers latch 361 may receive the data output of either theALU 311 or themove unit 315. Thelatch 363 may receive the output from any three of theunits multiplexer 363. Thelatch 365 receives the data output of either of theALU 313 or themove unit 315. - Since the outputs of the ALUs and move unit can be directed to any of the
latches move unit 315, it can be routed to themove unit 315 without tying up a morecomplex ALU latch 361 to take its place in the same order as when launched by the IS stage. - In the last WB stage of the pipeline, one of the two executed results stored in the
latches multiplexer 373 for writing back intodata cache memory 13 through its port A. Similarly, amultiplexer 375 can connect either of the executed results within either of theregisters cache memory 313 port B. Of course, the executed data results are sent to thememory 13 only when the are to be stored in it. - If any of the data results are to be stored in the
registers 39, this occurs within the EX stage. The resultant data selected by each of themultiplexers registers 39 throughrespective multiplexers - As previously noted, the
ALU 313 is unusual in that it has athird input port 333 rather than the more conventional twoinput port ALU 311. This added input port allows successive instructions to be processed together in parallel through two different pipelines when the second instruction requires data for its execution that is the result of executing the first instruction. - For example, consider a first instruction that calls for adding the value of a number in register r1 to the value of a number at a given location in the
data memory 13 and then write the result back into the register r1, and a second instruction that requires reading that new result from the register r1 and then subtracting it from the value stored in register r4. Since the second instruction is dependent upon the first, the second instruction is typically held at the beginning of the pipeline for one operational cycle while the first instruction is processed. Enough time must elapse to allow the first instruction to write the new value in the register r1 before the second instruction causes it to be read. - However, by providing the
third port 333 to theALU 313 and by allowing it to be connected to a data source through themultiplexer 323 that is different than its other twoinput ports ALU 313 the two operands that are specified by to used by the first instruction. That is, rather than the ALU receiving an input that is the result of execution of the first instruction, it receives in two inputs the operands which were used to generate that result. In the example given above, two of the inputs of theALU 313 are given the original data in r1 plus that in memory which are called for by the first instruction, plus the data in the register r4. Both instructions are then executed at the same time by theALU 313. This technique of using a three input port ALU provides these advantages with a microprocessor having only two pipelines as well as in the improved three pipeline architecture being described. This feature is described in more detail in copending patent application Ser. No. 09/128,164, filed Aug. 3, 1998, which application is expressly incorporated herein in its entirety by this reference. - As can be seen from the foregoing description of a multi-pipeline microprocessor architecture, there is an extreme amount of flexibility available to the
control unit 87 for routing instructions in order to maximize the throughput of the microprocessor. With reference to the flow chart of FIG. 8, a preferred operation of the microprocessor embodiment of FIGS. 3-7 is given. In afirst step 401, the latches 217-219 of the IS stage (FIG. 4) are loaded with a set of three instructions that are candidates for being executed in parallel through three different pipelines of the microprocessor. Thecontrol unit 87 examines each of the three instructions, in astep 403, to determine whether any of the three instructions depend upon the results of any of the other three instructions in a manner that would prevent all three instructions from being executed in parallel. This is commonly done now with two pipeline microprocessors, so the same techniques are extended to examining three instructions at one time instead of just two. If there is any such dependency, thecontrol unit 87 flags any such dependent instruction so that it will not be loaded into the respective one oflatches step 405 of FIG. 8. Of course, there will be fewer dependencies that can hold back parallel execution of instructions with the use of the three input port ALU 313 (FIG. 7) of one aspect of the present invention. If there are no unresolvable dependencies among the three instructions loaded in the latches 217-219, thestep 405 is omitted. - Regardless of resolution of dependencies, there will at least be an instruction in the
latch 217 that can be executed. Anext step 407 designates that first instruction for examination, and astep 409 causes thecontrol unit 87 to decode the instruction so that it may be determined what pipeline resources are necessary to execute it. - A
step 411 determines whether the instruction requires access to read an operand from thecache memory 13 and, if so, directs it to a full adder. If not, the reducedcapability adder 241 may be used with the instruction. Details of this are shown in the flow diagram of FIG. 9, as described below. - Another
step 413 looks at the type of ALU that is required to execute the first instruction of the set that is stored in thelatch 217, and assigns to it either a full capability ALU, themove unit 315 or nothing if an ALU is not required to execute the instruction. Details of thestep 413 are provided in the flow diagram of FIG. 10, as described below. Thesteps - A
next step 415 asks whether all three instructions of the set stored in latches 217-219 (FIG. 4) have been assigned resources or held by thecontrol unit 87. If not, astep 417 causes thesteps latch 217, so thesteps latch 218. Once each of the three instructions of the set have been assigned resources, or designated to be held for a cycle, afinal step 419 indicates that the switching instructions to the various multiplexers in the several pipeline stages will be issued at the appropriate times for processing each of these three instructions as they work there way through the stages of the pipelines. After that is completed, thecontrol unit 87 returns to thestep 401 by causing the next three instructions to be loaded into the latches 217-219 in the manner previously described with respect to FIG. 4. - It will be noted that at the time the
control unit 87 is examining and assigning resources to the set of three instructions, other instructions earlier examined are being processed by other pipeline stages. Therefore, the resources that are allocated for a particular instruction are stored by theexecution unit 87 until that instruction has worked its way down to the stage where the resource must be provided. For example, an adder of the AG stage must be provided one cycle time after the assignment is made, so the multiplexers of the AG stage are appropriately switched at that next operational cycle. Similarly, the ALU/move unit that is assigned to a particular instruction is actually not connected to receive the instruction for at least three cycle times since the EX unit is three stages downstream from the IS stage. - It will be noted from FIGS.4-7 that the
control circuit 87 provides control signals to the various multiplexers, latches and other components as the result of decoding the instructions being executed. One aspect of thecontrol unit 87 is described in copending patent application Ser. No. 09/088,226, filed Jun. 1, 1998, which application is expressly incorporated herein in its entirety by this reference. - Referring to FIG. 9, the algorithm for executing the
step 411 of FIG. 8 is shown in more detail. Astep 421 first determines whether the instruction being examined requires memory access, and thus one of thefull capability adders next step 423 determines whether a full capability adder is available. If this is the first or second of the set of three instructions to be examined, then a full capability adder will be available but if it is the third instruction, it needs to be determined whether bothfull capability adders next step 425 shows that the instruction is flagged to be held for one operational cycle, in a manner described previously. If one of thefull capability adders next step 427 assigns the first available one to receive the instruction being examined. - Returning to the
initial step 421 of FIG. 9, if the instruction is such that it does not need a full capability adder, anext step 429 determines whether the instruction needs the reducedcapability adder 241. If so, it is then asked whether theadder 241 is available, in astep 431. If not, the processing proceeds to thestep 425 to hold that instruction for the next cycle. If theadder 241 is available, however, anext step 433 assigns it to the instruction being examined. Returning to thestep 429, if the instruction does not need the adder C, then the processing of thestep 411 of FIG. 8 is completed. - Referring to FIG. 10, a similar flow chart is provided for the
step 413 of FIG. 8. Afirst step 441 of FIG. 10 asks whether the instruction being analyzed needs one of the full ALU's 311 or 313 to be executed. If so, anext step 443 asks whether one of them is available and, if so, one is assigned to this instruction by astep 445. If neither of theALU step 447 and that instruction is held within the IS stage to be sent down the pipeline in the next execution cycle. - Returning to the
step 441, if the instruction does not need one of the full capability ALU's 311 or 313, anext step 449 determines whether the instruction requires themove unit 315 for execution. If not, the processing of thestep 413 of FIG. 8 is completed. But if the instruction does need themove unit 315, anext step 451 asks whether it is available and, if so, assigns it to receive that instruction at the later time, in astep 453. However, if the move unit is determined instep 451 not to be available, because it has been assigned to a previous instruction of the set, processing returns to thestep 443 to ascertain whether one of the full capability ALU's 311 or 313 is available to execute the instruction. If so, one of them is assigned to it even though the instruction does not need that much capability, in order to increase the number of instructions that are being processed in parallel at all times. - As one implemention detail of the microprocessor of FIGS.3-7, techniques for distributing clock signals to various circuit portions are given in copending patent application entitled “Improved Clock Distribution System,” of Sathyanandan Rajivan, filed Sep. 11, 1998, which application is incorporated herein in its entirety by this reference.
- Although the various aspects of the present invention have been described with respect to its preferred embodiments, it will be understood that the invention is entitled to protection within the full scope of the appended claims.
Claims (28)
1. A microprocessor comprising:
an instruction decoding stage that provides three sequences of decoded instructions, one set of three instructions at a time,
a data memory with only two ports,
three multi-staged pipelines receiving and processing in parallel the three sequences of decoded instructions provided by the instruction decoding stage, and
a control circuit responsive to an individual set of three instructions for dynamically connecting the two memory ports to any two of the pipelines to which instructions of the individual set requiring access to the memory are being sent while an instruction of the individual set not requiring access to the memory is sent through another of the pipelines.
2. The microprocessor of , which includes exactly three multi-staged pipelines, and wherein each set of instructions includes exactly three instructions.
claim 1
3. The microprocessor of , wherein the instruction of the individual set not requiring access to the memory includes a jump instruction.
claim 1
4. The microprocessor of , wherein the instruction of the individual set not requiring access to the memory includes an instruction to move data between two of a plurality of registers.
claim 1
5. The microprocessor of , wherein the instruction of the individual set not requiring access to the memory includes an instruction to perform arithmetic or logic operations on data in two of a plurality of registers.
claim 1
6. The microprocessor of , wherein each of the three pipelines includes an address generation stage and an instruction execution stage, the address generation and instruction execution stages of one of the three pipelines having significantly less capability than those of the other two of the three pipelines, whereby space and power are conserved by said one of the three pipelines.
claim 1
7. The microprocessor of , additionally including a set of registers from which data is read and into which data is written by each of the three pipelines.
claim 1
8. A microprocessor, comprising:
an instruction decoding stage that provides three sequences of decoded instructions, one set of three instructions at a time,
three multi-staged pipelines receiving and processing in parallel the three sequences of decoded instructions provided by the instruction decoding stage,
two arithmetic logic units,
a move unit, and
a control circuit responsive to an individual set of three instructions for dynamically connecting the two arithmetic logic units individually in any two of the three pipelines in order to accept instructions of the individual set requiring an arithmetic logic unit to execute while the move unit is connectable to another of the pipelines which accepts an instruction of the individual set not requiring an arithmetic logic unit to execute.
9. The microprocessor of , which includes exactly three multi-staged pipelines, and wherein each set of instructions includes exactly three instructions.
claim 8
10. The microprocessor of , wherein the instruction of the individual set that is accepted by said another of the pipelines includes a jump instruction.
claim 8
11. The microprocessor of , wherein the instruction of the individual set that is accepted by said another of the pipelines includes instructions to move data between two of a plurality of registers and instructions to move data between one of the plurality of registers and a memory.
claim 8
12. The microprocessor of , additionally including a set of registers from which data is read and into which data is written by each of the three pipelines.
claim 8
13. A microprocessor, comprising:
a number of pipelines in excess of two that are operated in parallel, each of the plurality of pipelines having a plurality of pipeline stages that executes instructions in steps along its stages,
a number of data memory access ports at least one less than the number of pipelines,
a switching circuit that individually connects the data memory ports with selected stages of any of a number of the plurality of pipelines at least one more than the number of data memory access ports at different times when necessary to execute instructions being processed by the pipelines, and
at least one remaining pipeline to which the data memory is not connected at one of said times being capable of executing instructions not requiring memory access.
14. The microprocessor of , additionally comprising:
claim 13
a number of arithmetic logic units at least one less than the number of pipelines,
said switching circuit additionally individually connecting the arithmetic logic units into one of the stages of any of a number of the plurality of pipelines at least one more than the number of arithmetic logic units at different times when necessary to execute instructions being processed by the pipelines, and
at least one remaining pipeline to which an arithmetic logic unit is not connected at one of said times being capable of executing instructions not requiring an arithmetic logic unit.
15. The microprocessor of , which additionally comprises a move unit that is connectable into said remaining at least one pipeline for moving data between ones of a plurality of registers or between one of the registers and a memory.
claim 14
16. A microprocessor, comprising:
a number of pipelines in excess of two that are operated in parallel, each of the plurality of pipelines having a plurality of pipeline stages that executes instructions in steps along its stages,
a number of arithmetic logic units at least one less than the number of pipelines,
a switching circuit that individually connects the arithmetic logic units into one of the stages of any of a number of the plurality of pipelines at least one more than the number of arithmetic logic units at different times when necessary to execute instructions being processed by the pipelines, and
at least one remaining pipeline to which an arithmetic logic unit is not connected at one of said times being capable of executing instructions not requiring an arithmetic logic unit.
17. The microprocessor of , which additionally comprises a move unit that is connectable into said remaining at least one pipeline for moving data between ones of a plurality of registers or between one of the registers and a memory.
claim 16
18. A microprocessor formed on a single integrated circuit chip, comprising:
an instruction memory adapted to provide a sequence of instructions to be executed,
an instruction issuing stage coupled to the instruction memory for making a set of three instructions stored therein available in parallel during a common interval for processing,
a data memory having first and second ports for simultaneous access therethrough to read operands therefrom,
three address generation stages, two of said address generation stages having individual outputs connected to address the data memory respectively through said first and second ports thereof and read operands therefrom, a remaining one of the address generation stages not having access to read operands stored in the data memory,
three arithmetic logic unit (ALU) stages, one of said three ALUs having less processing capability than the other two of said three ALUs, and
an interconnection circuit responsive to each set of three instructions made available by the instruction issuing stage (a) for routing up to two of the three instructions needing operands from the data memory through the two address generation stages having outputs connected to address the data memory, (b) for connecting two operands read from the data memory to any two of the ALUs having sufficient processing capability to execute their associated instructions, and (c) for routing a remaining one of the three instructions not requiring an operand either to a remaining one of the address generation stages or a remaining one of the ALUs, thereby to process the set of three instructions in parallel.
19. The microprocessor of , wherein the data memory and instruction memory are separate from each other.
claim 18
20. The microprocessor of , additionally comprising a plurality of registers, the contents of which are readable by at least some of the address generation and ALU stages.
claim 18
21. A method of processing a sequence of computer instructions with access to data stored in a memory through only a given number of parallel access ports, comprising:
reviewing in a single interval each of a set of a number of instructions at least one more than the given number,
calculating a memory address from each of no more than the given number of instructions in the set that require data from the memory,
reading data from the memory at the calculated addresses through the given number of ports,
executing those of the set of instructions having data that have been read from the memory, and
depending upon the type of at least one of the set of instructions in excess of the given number that does not need data from memory, either (a) concurrently with said address calculating operation, calculating from said excess instruction an address of another instruction, or (b) concurrently with executing those of the set of instructions having data read from the memory, executing said excess instruction.
22. The method according to , wherein said given number is two.
claim 21
23. The method according to , wherein the excess instruction is a jump instruction, and wherein the address of another instruction calculated from the excess instruction is subsequently used to designate another set of instructions that are reviewed in a subsequent interval.
claim 21
24. The method according to , wherein the excess instruction is a move instruction that is executed to move data between individual ones of a plurality of registers.
claim 21
25. The method according to , wherein the excess instruction is an instruction to perform arithmetic or logic operations on data in two of a plurality of registers.
claim 21
26. A method of executing a sequence of computer instructions by a processor having a plurality of registers, a given number of arithmetic logic units (ALUs), and access to a memory, comprising:
reviewing in a single interval each of a set of a number of instructions at least one more than the given number,
executing a given number of said set of instructions during a subsequent interval by use of the given number of ALUs, thereby to leave at least one of the set of instructions that is not being executed by one of the ALUs during the subsequent interval, and
depending upon the type of said at least one instruction not being executed by one of the ALUs during the subsequent interval, either (a) executing a jump to a new set of instructions, or (b) moving data between two registers, or © moving data between one of the registers and the memory.
27. The method according to , wherein said given number is two.
claim 21
28. A microprocessor on a single integrated circuit chip, comprising:
an instruction cache memory for storing instructions to be processed,
an instruction fetch stage that accesses the instruction cache memory to obtain instructions therefrom in a sequence in which the instructions are to be executed,
an instruction queue stage receiving instructions from the instruction fetch stage for storing three sequential instructions at a time for processing,
first, second and third address generating stages that each include adder circuits, the adder circuit of the third address generating stage having fewer input ports than the adder circuits of each of the first and second address generating stages,
a data cache memory for storing operands used in processing instructions and for storing results of processing instructions, the data cache memory having first and second parallel access ports that are connected to receive addresses calculated by the adders of the first and second address generating stages, respectively, and provide respective first and second operands from the data cache memory in response, the third address generating stage having no access to the data cache memory,
a circuit connecting an output of the adder of the third address generation stage to the instruction fetch stage for designating an address of an instruction to be read from the instruction cache memory,
first, second and third instruction execution stages that each include respective first, second and third arithmetic logic units (ALUs) with the third ALU having fewer input ports than either of the first or second ALUs,
circuits connected to outputs of the ALUs for writing results of instruction processing thereby into the registers or into the data cache memory through its said first and second ports,
a plurality of registers connected to provide data inputs to the adder circuits and each of the first, second and third ALUs, and to receive data from the writing circuits, and
a control circuit that routes instructions stored in the instruction queue stage into the first, second and third address generating stages and the first, second and third instruction execution stages in a manner that instructions requiring operands from the data cache memory are not routed to the third address generating stage and a limited set of instructions are routed to the third instruction execution stage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/842,108 US20010014940A1 (en) | 1998-04-20 | 2001-04-26 | Dynamic allocation of resources in multiple microprocessor pipelines |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US6280498A | 1998-04-20 | 1998-04-20 | |
US09/151,634 US6304954B1 (en) | 1998-04-20 | 1998-09-11 | Executing multiple instructions in multi-pipelined processor by dynamically switching memory ports of fewer number than the pipeline |
US09/842,108 US20010014940A1 (en) | 1998-04-20 | 2001-04-26 | Dynamic allocation of resources in multiple microprocessor pipelines |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/151,634 Division US6304954B1 (en) | 1998-04-20 | 1998-09-11 | Executing multiple instructions in multi-pipelined processor by dynamically switching memory ports of fewer number than the pipeline |
Publications (1)
Publication Number | Publication Date |
---|---|
US20010014940A1 true US20010014940A1 (en) | 2001-08-16 |
Family
ID=26742710
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/151,634 Expired - Fee Related US6304954B1 (en) | 1998-04-20 | 1998-09-11 | Executing multiple instructions in multi-pipelined processor by dynamically switching memory ports of fewer number than the pipeline |
US09/842,026 Expired - Fee Related US6408377B2 (en) | 1998-04-20 | 2001-04-26 | Dynamic allocation of resources in multiple microprocessor pipelines |
US09/842,108 Abandoned US20010014940A1 (en) | 1998-04-20 | 2001-04-26 | Dynamic allocation of resources in multiple microprocessor pipelines |
US09/842,107 Expired - Fee Related US6341343B2 (en) | 1998-04-20 | 2001-04-26 | Parallel processing instructions routed through plural differing capacity units of operand address generators coupled to multi-ported memory and ALUs |
Family Applications Before (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/151,634 Expired - Fee Related US6304954B1 (en) | 1998-04-20 | 1998-09-11 | Executing multiple instructions in multi-pipelined processor by dynamically switching memory ports of fewer number than the pipeline |
US09/842,026 Expired - Fee Related US6408377B2 (en) | 1998-04-20 | 2001-04-26 | Dynamic allocation of resources in multiple microprocessor pipelines |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/842,107 Expired - Fee Related US6341343B2 (en) | 1998-04-20 | 2001-04-26 | Parallel processing instructions routed through plural differing capacity units of operand address generators coupled to multi-ported memory and ALUs |
Country Status (1)
Country | Link |
---|---|
US (4) | US6304954B1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040225868A1 (en) * | 2003-05-07 | 2004-11-11 | International Business Machines Corporation | An integrated circuit having parallel execution units with differing execution latencies |
CN107168827A (en) * | 2017-07-05 | 2017-09-15 | 首都师范大学 | Dual redundant streamline and fault-tolerance approach based on checkpoint technology |
CN111860805A (en) * | 2019-04-27 | 2020-10-30 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
CN112379928A (en) * | 2020-11-11 | 2021-02-19 | 海光信息技术股份有限公司 | Instruction scheduling method and processor comprising instruction scheduling unit |
US11841822B2 (en) | 2019-04-27 | 2023-12-12 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
Families Citing this family (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6295592B1 (en) * | 1998-07-31 | 2001-09-25 | Micron Technology, Inc. | Method of processing memory requests in a pipelined memory controller |
US6272609B1 (en) * | 1998-07-31 | 2001-08-07 | Micron Electronics, Inc. | Pipelined memory controller |
US7254696B2 (en) * | 2002-12-12 | 2007-08-07 | Alacritech, Inc. | Functional-level instruction-set computer architecture for processing application-layer content-service requests such as file-access requests |
US7581210B2 (en) * | 2003-09-10 | 2009-08-25 | Hewlett-Packard Development Company, L.P. | Compiler-scheduled CPU functional testing |
US7206969B2 (en) * | 2003-09-10 | 2007-04-17 | Hewlett-Packard Development Company, L.P. | Opportunistic pattern-based CPU functional testing |
US7213170B2 (en) * | 2003-09-10 | 2007-05-01 | Hewlett-Packard Development Company, L.P. | Opportunistic CPU functional testing with hardware compare |
US7613961B2 (en) * | 2003-10-14 | 2009-11-03 | Hewlett-Packard Development Company, L.P. | CPU register diagnostic testing |
US7415700B2 (en) * | 2003-10-14 | 2008-08-19 | Hewlett-Packard Development Company, L.P. | Runtime quality verification of execution units |
US7206966B2 (en) * | 2003-10-22 | 2007-04-17 | Hewlett-Packard Development Company, L.P. | Fault-tolerant multi-core microprocessing |
US20050223172A1 (en) * | 2004-03-31 | 2005-10-06 | Ulrich Bortfeld | Instruction-word addressable L0 instruction cache |
US7840953B2 (en) * | 2004-12-22 | 2010-11-23 | Intel Corporation | Method and system for reducing program code size |
US20060200651A1 (en) * | 2005-03-03 | 2006-09-07 | Collopy Thomas K | Method and apparatus for power reduction utilizing heterogeneously-multi-pipelined processor |
US20080159145A1 (en) * | 2006-12-29 | 2008-07-03 | Raman Muthukrishnan | Weighted bandwidth switching device |
US8230410B2 (en) * | 2009-10-26 | 2012-07-24 | International Business Machines Corporation | Utilizing a bidding model in a microparallel processor architecture to allocate additional registers and execution units for short to intermediate stretches of code identified as opportunities for microparallelization |
US8639884B2 (en) * | 2011-02-28 | 2014-01-28 | Freescale Semiconductor, Inc. | Systems and methods for configuring load/store execution units |
US9547593B2 (en) | 2011-02-28 | 2017-01-17 | Nxp Usa, Inc. | Systems and methods for reconfiguring cache memory |
JP5922353B2 (en) * | 2011-08-22 | 2016-05-24 | サイプレス セミコンダクター コーポレーション | Processor |
JP6225554B2 (en) * | 2013-08-14 | 2017-11-08 | 富士通株式会社 | Arithmetic processing device and control method of arithmetic processing device |
US9003109B1 (en) * | 2014-05-29 | 2015-04-07 | SanDisk Technologies, Inc. | System and method for distributed computing in non-volatile memory |
US11010166B2 (en) * | 2016-03-31 | 2021-05-18 | Intel Corporation | Arithmetic logic unit with normal and accelerated performance modes using differing numbers of computational circuits |
GB2593740A (en) * | 2020-03-31 | 2021-10-06 | Cmr Surgical Ltd | Control system of a surgical robot |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4295193A (en) * | 1979-06-29 | 1981-10-13 | International Business Machines Corporation | Machine for multiple instruction execution |
EP0239081B1 (en) * | 1986-03-26 | 1995-09-06 | Hitachi, Ltd. | Pipelined data processor capable of decoding and executing plural instructions in parallel |
US5206940A (en) * | 1987-06-05 | 1993-04-27 | Mitsubishi Denki Kabushiki Kaisha | Address control and generating system for digital signal-processor |
US5692139A (en) * | 1988-01-11 | 1997-11-25 | North American Philips Corporation, Signetics Div. | VLIW processing device including improved memory for avoiding collisions without an excessive number of ports |
US5333280A (en) * | 1990-04-06 | 1994-07-26 | Nec Corporation | Parallel pipelined instruction processing system for very long instruction word |
JP3544214B2 (en) | 1992-04-29 | 2004-07-21 | サン・マイクロシステムズ・インコーポレイテッド | Method and system for monitoring processor status |
US5696935A (en) * | 1992-07-16 | 1997-12-09 | Intel Corporation | Multiported cache and systems |
EP0950946B1 (en) * | 1993-11-05 | 2001-08-16 | Intergraph Corporation | Software scheduled superscaler computer architecture |
US6216200B1 (en) * | 1994-10-14 | 2001-04-10 | Mips Technologies, Inc. | Address queue |
US5761475A (en) * | 1994-12-15 | 1998-06-02 | Sun Microsystems, Inc. | Computer processor having a register file with reduced read and/or write port bandwidth |
US5924117A (en) * | 1996-12-16 | 1999-07-13 | International Business Machines Corporation | Multi-ported and interleaved cache memory supporting multiple simultaneous accesses thereto |
US5809514A (en) | 1997-02-26 | 1998-09-15 | Texas Instruments Incorporated | Microprocessor burst mode data transfer ordering circuitry and method |
US5913049A (en) * | 1997-07-31 | 1999-06-15 | Texas Instruments Incorporated | Multi-stream complex instruction set microprocessor |
US6263424B1 (en) * | 1998-08-03 | 2001-07-17 | Rise Technology Company | Execution of data dependent arithmetic instructions in multi-pipeline processors |
-
1998
- 1998-09-11 US US09/151,634 patent/US6304954B1/en not_active Expired - Fee Related
-
2001
- 2001-04-26 US US09/842,026 patent/US6408377B2/en not_active Expired - Fee Related
- 2001-04-26 US US09/842,108 patent/US20010014940A1/en not_active Abandoned
- 2001-04-26 US US09/842,107 patent/US6341343B2/en not_active Expired - Fee Related
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040225868A1 (en) * | 2003-05-07 | 2004-11-11 | International Business Machines Corporation | An integrated circuit having parallel execution units with differing execution latencies |
CN107168827A (en) * | 2017-07-05 | 2017-09-15 | 首都师范大学 | Dual redundant streamline and fault-tolerance approach based on checkpoint technology |
CN111860805A (en) * | 2019-04-27 | 2020-10-30 | 中科寒武纪科技股份有限公司 | Fractal calculation device and method, integrated circuit and board card |
US11841822B2 (en) | 2019-04-27 | 2023-12-12 | Cambricon Technologies Corporation Limited | Fractal calculating device and method, integrated circuit and board card |
CN112379928A (en) * | 2020-11-11 | 2021-02-19 | 海光信息技术股份有限公司 | Instruction scheduling method and processor comprising instruction scheduling unit |
Also Published As
Publication number | Publication date |
---|---|
US6304954B1 (en) | 2001-10-16 |
US20010016900A1 (en) | 2001-08-23 |
US20010014939A1 (en) | 2001-08-16 |
US6408377B2 (en) | 2002-06-18 |
US6341343B2 (en) | 2002-01-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6408377B2 (en) | Dynamic allocation of resources in multiple microprocessor pipelines | |
US6256726B1 (en) | Data processor for the parallel processing of a plurality of instructions | |
US5333280A (en) | Parallel pipelined instruction processing system for very long instruction word | |
US5357617A (en) | Method and apparatus for substantially concurrent multiple instruction thread processing by a single pipeline processor | |
US4734852A (en) | Mechanism for performing data references to storage in parallel with instruction execution on a reduced instruction-set processor | |
US6044451A (en) | VLIW processor with write control unit for allowing less write buses than functional units | |
EP0968463B1 (en) | Vliw processor processes commands of different widths | |
KR19990087940A (en) | A method and system for fetching noncontiguous instructions in a single clock cycle | |
JPH1165844A (en) | Data processor with pipeline bypass function | |
US4739470A (en) | Data processing system | |
US5459847A (en) | Program counter mechanism having selector for selecting up-to-date instruction prefetch address based upon carry signal of adder which adds instruction size and LSB portion of address register | |
US6292845B1 (en) | Processing unit having independent execution units for parallel execution of instructions of different category with instructions having specific bits indicating instruction size and category respectively | |
US20060095746A1 (en) | Branch predictor, processor and branch prediction method | |
US6026486A (en) | General purpose processor having a variable bitwidth | |
US5617549A (en) | System and method for selecting and buffering even and odd instructions for simultaneous execution in a computer | |
KR100431975B1 (en) | Multi-instruction dispatch system for pipelined microprocessors with no branch interruption | |
US6263424B1 (en) | Execution of data dependent arithmetic instructions in multi-pipeline processors | |
EP0496407A2 (en) | Parallel pipelined instruction processing system for very long instruction word | |
US7134000B2 (en) | Methods and apparatus for instruction alignment including current instruction pointer logic responsive to instruction length information | |
US4707783A (en) | Ancillary execution unit for a pipelined data processing system | |
US6119220A (en) | Method of and apparatus for supplying multiple instruction strings whose addresses are discontinued by branch instructions | |
EP0251716A2 (en) | Instruction decoding microengines | |
US5862399A (en) | Write control unit | |
US5828861A (en) | System and method for reducing the critical path in memory control unit and input/output control unit operations | |
JPH06131180A (en) | Instruction processing system and instruction processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |