CN101944012B - Instruction processing method and super-pure pipeline microprocessor - Google Patents

Instruction processing method and super-pure pipeline microprocessor Download PDF

Info

Publication number
CN101944012B
CN101944012B CN201010243151.1A CN201010243151A CN101944012B CN 101944012 B CN101944012 B CN 101944012B CN 201010243151 A CN201010243151 A CN 201010243151A CN 101944012 B CN101944012 B CN 101944012B
Authority
CN
China
Prior art keywords
mentioned
instruction
source operand
computing
alu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010243151.1A
Other languages
Chinese (zh)
Other versions
CN101944012A (en
Inventor
吉拉德·M·卡尔
柯林·艾迪
罗德尼·E·虎克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Via Technologies Inc
Original Assignee
Via Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US12/609,193 external-priority patent/US9952875B2/en
Application filed by Via Technologies Inc filed Critical Via Technologies Inc
Publication of CN101944012A publication Critical patent/CN101944012A/en
Application granted granted Critical
Publication of CN101944012B publication Critical patent/CN101944012B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

A superscalar pipelined microprocessor includes a register set defined by an instruction set architecture of the microprocessor, execution units, and a store unit, coupled to the cache memory and distinct from the other execution units of the microprocessor. The store unit comprises an ALU. The store unit receives an instruction that specifies a source register of the register set and an operation to be performed on a source operand to generate a result. The store unit reads the source operand from the source register. The ALU performs the operation on the source operand to generate the result, rather than forwarding the source operand to any of the other execution units of the microprocessor to perform the operation on the source operand to generate the result. The store unit operatively writes the result to the cache memory.

Description

Command processing method with and applicable SuperScale pipeline microprocessor
Technical field
The present invention relates generally to the technical field of microprocessor, relates to especially a kind of microprocessor architecture design (microarchitecture) of microprocessor.
Background technology
A typical case of reduced instruction set computer architecture processor is that sort processor can be used a kind of framework that loads/store, that is to say, sort processor has comprised a load instructions, in order to an operand is loaded on from storer to a register of this processor, sort processor also comprises a save command, in order to the operand in a register of this processor is stored in storer.In general example, above-mentioned load instructions and save command are the instructions of unique meeting access memory, other instruction of carrying out arithmetic/logic receives operand separately and result is write to register from register, meaning, instruction non-load or that store is not allowed to specify in the operand in storer, this can be finished in the single-frequency cycle most instruction non-load or that store, in comparison, a load instructions needs to take several frequency periods with access memory (being high-speed cache or system storage).Therefore, general instruction sequence may include a load instructions, in order to extract (fetch) operand to one first register from storer, this load instructions is then arithmetical logic (arithmetic/logical) instruction thereafter, in order to carry out an arithmetic logical operation on the operand in the first register, (be additive operation, subtraction, increment operation, multiplying, displacement/revolution (shirt/rotate) computing, boolean sum (Boolean AND) computing, boolean or (Boolean OR) computing, anti-(Boolean NOT) computing of boolean etc.) and result is write to one second register, a save command is followed thereafter in this arithmetical logic instruction again, in order to by the result write store in the second register.The advantage of the prominent example of above-mentioned loading/storage framework is well-known.
But, the result that load/storage framework produces is that many processors have comprised different load/store unit, be located away from the performance element of carrying out arithmetic logical operation, that is to say, one loading unit is only carried out and data is loaded on to a register from storer, one storage element is only carried out data is stored to storer from a register, and ALU (Arithmetic/Logical Unit, ALU) is to carrying out arithmetic logical operation and result is write to a destination register from the operand that carrys out source-register.So, with above-mentioned instruction sequence example, loading unit can be carried out load instructions to extract operand to the first register in storer, one ALU can be carried out arithmetical logic instruction and carries out arithmetic logical operation (perhaps being undertaken by the second operand in another register) and result is write to the second register with the operand in the first register, finally, storage element can be carried out the save command of the result write store in the second register.
The advantage of using different load/store unit and ALU is that framework is simple and speed is fast, but, shortcoming is that the action of passing on that result is done between unit by register can consume the much time, a part for this problem can achieve a solution by passing on bus, the bus of passing on can directly transfer to another performance element from a performance element a result and need be via register, but, the problem that this is still consumed if having time, delay (delay) situation occurring in the process of passing on of anticipating.Time of being consumed be mainly a function that depends on distance and resistance-capacitance circuit (RC circuit) time constant, this distance refers to that signal is passing on the required distance between different performance elements of making a return journey in bus, and resistance-capacitance circuit time constant refers to the resistance-capacitance circuit time constant about this signal transmssion line (signal trace).About passing on to amount to time delay of result, can reach one or more frequency period, depending on the laying (layout) of performance element in Known designs and the process technique of using.
Summary of the invention
One embodiment of the invention provide a kind of SuperScale pipeline microprocessor.This SuperScale pipeline microprocessor comprises a set of registers, a high-speed cache, the multiple performance element being defined by an instruction set architecture of this SuperScale pipeline microprocessor and a storage element that is coupled to aforementioned cache.Above-mentioned storage element is other performance element that is different from this SuperScale pipeline microprocessor, and above-mentioned storage element comprises an ALU.Above-mentioned storage element is to receive one first instruction, and above-mentioned the first instruction specifies one first of above-mentioned set of registers to carry out source-register and in one first one first computing that carries out and produce on source operand a result.Above-mentioned storage element is also in order to carry out source operand from above-mentioned the first source register read above-mentioned first.Above-mentioned ALU is in order to carry out on source operand above-mentioned the first computing to produce the above results above-mentioned first, but not by the above-mentioned first any one that come that source operand transfers to above-mentioned other performance element to come to carry out above-mentioned the first computing to produce the above results on source operand above-mentioned first.Above-mentioned storage element is more in order to write aforementioned cache by the above results.
Another embodiment of the present invention provides a kind of command processing method, be applicable to a SuperScale pipeline microprocessor, this SuperScale pipeline microprocessor has a set of registers, a high-speed cache, the multiple performance element being defined by an instruction set architecture of this SuperScale pipeline microprocessor and a storage element that is different from other performance element of this SuperScale pipeline microprocessor.Above-mentioned command processing method comprises: by above-mentioned storage element, receive one first instruction, above-mentioned the first instruction specifies one first of above-mentioned set of registers to carry out source-register and in one first one first computing that carries out and produce on source operand a result; By above-mentioned storage element, from above-mentioned the first source register read above-mentioned first, carry out source operand; By an ALU of above-mentioned storage element, above-mentioned first, come to carry out above-mentioned the first computing to produce the above results on source operand, but not by the above-mentioned first any one that come that source operand transfers to above-mentioned other performance element to come to carry out above-mentioned the first computing to produce the above results on source operand above-mentioned first; And by above-mentioned storage element, the above results is write to aforementioned cache.
Accompanying drawing explanation
Fig. 1 is according to the calcspar of SuperScale pipeline microprocessor of the present invention.
Fig. 2 is the calcspar of the loading unit 124 as shown in Figure 1 according to the present invention.
Fig. 3 is the operation workflow figure of the SuperScale pipeline microprocessor 100 as shown in Figure 1 according to the present invention.
Fig. 4 is the conventional microprocessor operation workflow figure that shows comparison SuperScale pipeline microprocessor 100 of the present invention.
Fig. 5 is according to the time diagram of the effect described in one embodiment of the invention.
Fig. 6 is according to the calcspar of the loading unit shown in another embodiment of the present invention.
Fig. 7 is according to the calcspar of the loading unit described in another embodiment of the present invention.
Fig. 8 is according to the time diagram of the effect described in another embodiment of the present invention.
Fig. 9 is the calcspar of the storage element 126 as shown in Figure 1 according to the present invention.
Figure 10 is the operation workflow figure of the SuperScale pipeline microprocessor 100 as shown in Figure 1 according to the present invention.
Figure 11 is the conventional microprocessor operation workflow figure that shows comparison SuperScale pipeline microprocessor 100 of the present invention.
Figure 12 is according to the time diagram of the benefit described in another embodiment of the present invention.
[primary clustering symbol description]
100~SuperScale pipeline microprocessor;
102~instruction cache;
104~instruction transfer interpreter;
106~register alias table;
108~reservation station;
The set of 112~general-purpose register;
114~reorder buffer;
116~memory sub-system;
122~other performance element;
124~loading unit;
126~storage element;
132~macro instruction;
134~micro-order;
142,162~ALU;
144,146~bus;
148~pass on bus;
152~result bus;
154,156~ALU result;
202~address generator;
204~translation lookaside buffer;
206~cache label matrix;
208~caching data matrix;
212~steering logic;
214~multiplexer;
222~virtual load address;
224~physical address;
226~state;
228~cache line;
232~loading data;
234~hit/miss;
652,952~second operand;
662~storage area;
946~storage data
Embodiment
The present inventor finds in a pipe flow will loading unit design, in the last stage, may some frequency period can not be used to, that is to say, be the sub-fraction of frequency period time the time delay that the circuit of loading unit the last stage produces.Therefore, the present invention is advantageously incorporated into an ALU in the last stage of loading unit in an embodiment, make loading unit before the loading data that extract from storer are loaded into destination register, first in loading data, carry out arithmetic logical operation.By this favourable design, make can save loading data transfer to another arithmetical logic performance element time of required consumption when carrying out arithmetic logical operation.Microprocessor of the present invention has used a microprocessor architecture design that loads/store, this microprocessor architecture design is realized be processor non-load/store x86 framework (or macro architecture of processor).Instruction transfer interpreter produces the loading micro-order (being after this referred to as with ldalu micro-order) of special type, in order to indicate loading unit to carry out from the loading of storer and in loading data, carries out suitable ALU computing.This makes this microprocessor be achieved complicated macro instruction, the needed storer of this macro instruction read and an ALU computing is all arranged in performance element, so do not need another performance element to carry out this ALU computing, thereby avoid result to pass on caused time delay.
Fig. 1 is according to the calcspar of SuperScale pipeline microprocessor of the present invention.SuperScale pipeline microprocessor 100 comprises an instruction cache 102, for example, in order to the macro instruction of cache one instruction set architecture (: x86 instruction set architecture).The included instruction of macro instruction 132 needs a memory access and an ALU calculation function, for instance, one x86MOVZX reg/mem (null value extension is moved) instruction, in order to indicate SuperScale pipeline microprocessor 100 that the content replication that carrys out source operand in storer is carried out to null value extension to destination register and by this value.When the size of destination register is during than next large of effective size of memory operand, it is a very important running that null value is extended.Other example has comprised x86 instruction, this x86 instruction relates to (involve) memory operand and an ALU function, for example: addition (ADD), subtraction (SUB), increment (INC), decrement (DEC), multiplication (MUL), displacement (SAL/SAR/SHL/SHR), revolution (RCL/RCR/ROL/ROR), with (AND) or (OR), the function such as anti-(NOT), XOR (XOR).
SuperScale pipeline microprocessor 100 comprises an instruction transfer interpreter 104 that is coupled to instruction cache 102, this instruction transfer interpreter 104 translates to macro instruction 132 micro-order 134 of for example ldalu micro-order, and this ldalu micro-order is indicated the data in loading unit 124 (by being described further in subsequent paragraph) pseudostatic ram and carry out ALU computing in the data that load.In another embodiment, instruction transfer interpreter 104 translates to macro instruction 132 micro-order 134 of for example stalu micro-order, this stalu micro-order indication storage element 126 (will be described further in subsequent paragraph) on storage data, carry out ALU computing and by this data storing in storer.
SuperScale pipeline microprocessor 100 comprises a register alias table (Register Alias Table, RAT) 106, register alias table 106 produces micro-order dependence and with procedure order, micro-order 134 is sent to (dispatch) to reservation station 108, and reservation station 108 is delivered to micro-order 134 (issue) performance element (being loading unit 124, storage element 126 and other performance element 122) again and carried out.In one embodiment, reservation station 108 is not sent micro-code instruction 134 with procedure order.For instance, other performance element 122 can comprise integer arithmetic logical block, floating point unit and single instruction multiple data (Single Instruction Multiple Data, SIMD) performance element (for example: multimedia is extended (MultiMedia eXtension, MMX) unit or single instruction multiple data crossfire extend (Streaming SIMDExtensions, SSE) unit).Performance element 122/142/162 provides its result 152/154/156 to a reorder buffer (Reorder Buffer separately, ROB) 114, reorder buffer 114 can be guaranteed in order Retirement (retirement) to architecture states (architectural state).SuperScale pipeline microprocessor 100 also comprises a memory sub-system 116 that is coupled to loading unit 124 and storage element 126, and memory sub-system 116 has comprised high-speed cache, loaded impact damper, store buffer and a Bus Interface Unit.
Performance element 122/142/162 receives the operand from a general-purpose register set (General PurposeRegister, GPR) 112.Performance element 122/142/162 also receives mutually result 152/154/156 from the other side to be used as the operand passing in bus 148.Especially loading unit 124 receives the operand in a bus 144, and storage element 126 receives the operand in a bus 146.Loading unit 124 comprises an ALU 142, and storage element 126 comprises an ALU 162, and its relevant computing will be described further in follow-up.
by ALU Function Integration Mechanism to load instructions
Fig. 2 is the calcspar of the loading unit 124 as shown in Figure 1 according to the present invention.Loading unit 124 comprises an address generator 202, address generator 202 use as shown in Figure 1 carry out source operand 144 to produce virtual load address 222 (namely this place certainly being loaded to the storage address of data).One translation lookaside buffer (Translation LookasideBuffer, TLB) 204 of loading unit 124 access memory subsystems 116, translation lookaside buffer 204 is responsible for searching virtual location 222 so that the physical address 224 after translating to be provided.One cache label matrix (cache tag array) 206 is searched a label part of physical address 224 and is provided state 226 for each road (way) of high-speed cache.Index (index) part of physical address 224 is the index of a caching data matrix 208, and caching data matrix 208 is exported a cache line (cache line) 228 for each road of high-speed cache.Steering logic 212 inspection states 226 are to determine that physical address 224 is whether as the hit/miss (hit/miss) 234 of high-speed cache.In addition, steering logic 212 is controlled a multiplexer 214, and multiplexer 214 is selected the data the specified cache line of the applicable cache line 228 exported from data matrix 208 and load instructions or ldalu micro-order, it is 1,2,4,8,16,32 or 64 that these data can be according to the difference of embodiment, and is provided as loading data 232.
In a traditional loading unit 124, load data 232 and will be provided the result of being used as a traditional load instructions, but, loading unit 124 of the present invention advantageously, also comprised ALU 142 as shown in Figure 1, ALU 142 receives and loads data 232 and optionally in loading data 232, carry out an ALU computing to produce an ALU result (alu-result) 154.If (this instruction is a regular load instructions, and ALU 142 just allows and loads data 232 and be used as ALU result 154 and pass through.) ALU 142 is in order to carry out different various computings according to different embodiment.
In one embodiment, ALU 142 is carried out a null value and is extended (zero-extend) computing, and comprises that multiple and (AND gate) removes (mask off) and load the included high bit of memory operand of not specified by ldalu micro-order in data 232 to hide.
In other embodiments, ALU 142 is in order to one or more single (single) operand computing of extra execution, comprises but is not limited to following computing:
1. anti-(NOT) computing of boolean (Boolean): ALU result 154 make to load data 232 each be reversed (invert).
2. do not wait door (NE GATE) computing: ALU result 154 is for loading the negative position of two complement codes (two ' s complement negation) of data 232.
3. increment (increment) computing: ALU result 154 adds 1 again for loading data 232.
4. decrement (decrement) computing: ALU result 154 subtracts 1 again for loading data 232.
5. symbol extends (sign-extend) computing: ALU result 154 is the loading data 232 after symbol extension.
6. null value detecting computing: be 1 o'clock if load data 232, the value of ALU result 154 is true (true); Otherwise if when loading data 232 are non-vanishing, the value of ALU result 154 is false (false).
7. one value detecting computing: if when all positions of loading data 232 are all binary one (1) value, the value of ALU result 154 is true; Otherwise, if load all positions of data 232, be not that while being entirely binary one value, the value of ALU result 154 is false.
8. Data Format Transform computing: ALU result 154 is for being formatted into the loading data 232 of a specific data form, and this specific data form can be a for example floating-point format or single instruction multiple data form.
In another embodiment shown in Fig. 6, ALU 142 is to receive a second operand 652, and in order to carry out the ALU function of a dual-operand in second operand 652 and loading data 232.ALU 142 can be in order to carry out in addition one or more dual-operand computing again, including but not limited to following computing:
9. Boolean logic (AND, OR, XOR, NOR) computing: ALU 142 is at second operand 652 and the boolean calculation that loads execution appointment in data 232, to produce ALU result 154.
10. arithmetic (ADD, SUB, MUL) computing: ALU 142 is in second operand 652 and the arithmetical operation that loads execution appointment in data 232, to produce ALU result 154.
In another embodiment shown in Fig. 6, loading unit 124 has comprised storage area 662, storage area 662 is used to when load address does not exist data cache store second operand 652, and when load address does not exist data cache, can make must loading unit 124, to re-execute from system storage extraction loading data and ldalu micro-order.
Fig. 3 is the operation workflow figure of the SuperScale pipeline microprocessor 100 as shown in Figure 1 according to the present invention.Flow process starts to carry out from square 302.
At square 302, instruction transfer interpreter 104 decoding one macro instructions 132 also translate to single ldalu micro-order 134 by it, and the specified computing of macro instruction 132 is a storage address of extracting data from this place in order to produce.Macro instruction 132 has also been specified an ALU computing of carrying out in the data that extract in storer to produce a result.Macro instruction 132 has also been specified the general-purpose register 112 as the destination register of this result.Ldalu micro-order has been specified identical address operand with macro instruction 132.And ldalu micro-order has also been specified identical ALU computing with macro instruction 132.Finally, ldalu micro-order has more been specified identical general-purpose register 112 with macro instruction 132 for destination operand.For instance, macro instruction 132 can be an x86 MOVZX reg/mem instruction or a PMOVZX reg/mem instruction, in this example, instruction transfer interpreter 104 translates to macro instruction 132 a single ldalu micro-order of having specified null value to extend to its ALU computing again.Flow process proceeds to square 304.
At square 304, reservation station 112 is delivered to loading unit 124 by ldalu micro-order, and flow process proceeds to square 306.
At square 306, loading unit 124 is according to producing virtual load address 222 by the specified source operand 144 that comes of ldalu micro-order, and flow process proceeds to square 308.
At square 308, loading unit 124 is searched virtual load address 222 to obtain entity load address 224 in translation lookaside buffer 204, and flow process proceeds to square 312.
At square 312, loading unit 124 with entity load address 224 sequentially access cache label matrixes 206 with caching data matrix 208 to obtain state 226 and cache line 228, and multiplexer 214 has been selected the loading data 232 specified by ldalu micro-order, flow process proceeds to square 322.
At square 322, loading unit 124 is carried out by the specified ALU computing of ldalu micro-order in loading data 232, and to produce ALU result 154, flow process proceeds to square 324.
At square 324, loading unit 124 outputs to ALU result 154 on its result bus, in fact, ALU 142 is carried out required ALU computing and has advantageously been alleviated and transfer to by loading data 232 delay that another performance element 122 is followed to carry out the demand of ALU computing and the computing of passing on.Flow process proceeds to square 326.
At square 326, reorder buffer 114 receives and stores ALU result 154 from the result bus of loading unit, and flow process proceeds to square 328.
At square 328, reorder buffer 114 retires from office its stored ALU result 154 to object general-purpose register 112, and flow process finishes.
Fig. 4 is the operation workflow that shows conventional microprocessor, in order to compare the running of SuperScale pipeline microprocessor 100 of the present invention.Although the assembly in the SuperScale pipeline microprocessor 100 shown in Fig. 1 is also present in the description of Fig. 4, but it must be appreciated that the loading unit in microprocessor described in Fig. 4 does not comprise in order to load an ALU of carrying out ALU computing in data, and instruction transfer interpreter can't produce special ldalu micro-order for loading data.Flow process starts to carry out from square 402.
At square 402, instruction transfer interpreter 104 decoding one macro instructions 132 it is translated to two micro-orders 134, is load instructions, one is alu micro-order.For instance, macro instruction 132 can be an x86MOVZX reg/mem instruction or a PMOVZX reg/mem instruction, in this example, instruction transfer interpreter 104 can translate to macro instruction 132 a loading micro-order and specify null value to extend to an alu micro-order of its ALU function.Then, register alias table 116 produces a dependence for alu micro-order in loading micro-order.Flow process proceeds to square 404.
At square 404, reservation station 112 is delivered to loading unit 124 by load instructions, and flow process proceeds to square 406.
At square 406, loading unit 124 is according to producing virtual load address 222 by the specified source operand 144 that comes of load instructions, and flow process proceeds to square 408.
At square 408, loading unit 124 is searched virtual load address 222 to obtain entity load address 224 in translation lookaside buffer, and flow process proceeds to square 412.
At square 412, loading unit 124 with entity load address 224 sequentially access cache label matrixes 206 with caching data matrix 208 to obtain state 226 and cache line 228, and multiplexer 214 has been selected the loading data 232 specified by load instructions, flow process proceeds to square 414 and square 416.
At square 414, loading unit 124 outputs to extracted loading data 232 on its result bus, and flow process proceeds to square 418.
At square 416, since loading data 222 can have been taken at present, be to carry out source operand, reservation station 112 is delivered to a performance element 122 (a for example integer unit) by alu micro-order, and flow process proceeds to square 418.
At square 418, integer unit 112 receives loading data 232 from the result bus of loading unit 124 and carrys out source operand as one, and flow process proceeds to square 422.
At square 422, integer unit 112 is carried out by the specified ALU computing of alu micro-order in the loading data 232 that are received from loading unit 124, to produce an ALU result.Flow process proceeds to square 424.
At square 424, integer unit 112 outputs to ALU result on its result bus 152, and flow process proceeds to square 426.
At square 426, reorder buffer 114 receives and stores ALU result from the result bus 152 of integer unit 122, and flow process proceeds to square 428.
At square 428, reorder buffer 114 retires from office its stored ALU result to object general-purpose register 112, and flow process finishes.
Comparison diagram 3 can be found with Fig. 4, instruction transfer interpreter 104 produces a single ldalu micro-order and loading unit 124 comprises that an ALU 142 is to carry out by the specified ALU computing of ldalu micro-order, advantageously avoided passing on running by conventional microprocessor is caused, as shown in Figure 5.
Fig. 5 is according to the time diagram of the effect described in one embodiment of the invention.In figure, shown six frequency periods, figure left side is depicted as the pipe flow will stage being separated by register in conventional microprocessor, and figure right side is depicted as the pipe flow will stage being separated by register in the SuperScale pipeline microprocessor 100 of one embodiment of the invention.In the example shown in Fig. 5, supposed that loading unit 124 includes four pipe flow will stages, is denoted as A, B, C and D separately.But, it should be noted that in other embodiments, loading unit 124 can have the pipe flow will stage of varying number.And in the example shown in Fig. 5, supposed that the integer arithmetic logical block in conventional microprocessor comprises the single stage.
In conventional microprocessor, a load instructions is connected on loading unit 124 and carries out after corresponding to respectively pipe flow will stage A, B, C and the D of frequency period 1,2,3,4.Then load data and transferred to integer unit, integer unit is loading in data and is carrying out an ALU computing in frequency period 5, finally, in frequency period 6, the ALU result being produced by integer unit is written into reorder buffer 114 and is transferred to other performance element 122.
In SuperScale pipeline microprocessor 100 as shown in Figure 1, general similar in appearance to conventional microprocessor, a ldalu instruction is connected on loading unit 124 and carries out after corresponding to respectively pipe flow will stage A, B, C and the D of frequency period 1,2,3,4.But, be different from conventional microprocessor, at loading unit 124, in the pipe flow will stage D of frequency period 4, ALU 142 has been carried out in data 232 by the specified ALU computing of ldalu micro-order to produce ALU result 154 loading.In frequency period 5, the ALU result 154 being produced by loading unit 124 is written into reorder buffer 114 and is transferred to other performance element 122.Therefore visible, SuperScale pipeline microprocessor 100 is as shown in Figure 1 at least, having produced ALU result 154 and allowed other instruction can obtain ALU result 154 more during a Zao frequency period than conventional microprocessor.And say as above-mentioned, when the resistance-capacitance circuit time constant of the required distance between different performance elements of making a return journey on signal is passing on bus and this signal transmssion line increases, when meaning is transfer delay increase, the time that the present invention can save so has also just and then promoted.
Fig. 7 is according to the calcspar of the loading unit described in another embodiment of the present invention.Loading unit 124 in this figure is similar to the loading unit 124 shown in Fig. 1, but, the loading unit 124 of Fig. 7 turns ALU result 154 to send back in inside oneself is used as to come source operand 144, for a load address 222 that calculates a subsequent load instructions (or ldalu micro-order).In some design, short that path comes may be passed on than the outside from other performance element in conventional microprocessor in the path of passing in this inside, and this other performance element can be carried out ALU computing and result and can be taken as one and carry out source operand 144 and transferred to loading unit 124 from this.The advantage in the path of passing in relevant inside will illustrate in Fig. 8.
Fig. 8 is one similar in appearance to the time diagram shown in Fig. 5.But the example shown in Fig. 8 supposed in conventional microprocessor, after a load instructions is connected on alu micro-order and by the ALU result of alu micro-order, be used as to come source operand with calculating load address.Similarly, in the 8th figure, supposed in SuperScale pipeline microprocessor 100, a load instructions is connected on after alu micro-order and is used as to come source operand with calculating load address by the ALU result 154 of alu micro-order.In addition, example hypothesis conventional microprocessor (and SuperScale pipeline microprocessor 100) shown in Fig. 8 is respectively in frequency period 5 and 7, need an extra frequency period so that result bootstrap loading unit 124 is transferred to integer unit 122, and another extra frequency period transfer to loading unit 124 by result from integer unit 122.As shown in Figure 8, loading unit 124 of the present invention is carried out by the specified ALU computing of ldalu micro-order in the pipe flow will stage D of frequency period 4, and ALU result 154 inside being turned to send back in frequency period 5 oneself makes loading unit 124 can use ALU result 154, to produce load address 222, but not make an ALU result be transferred to loading unit 124 from other performance element.Therefore, in this example, have the SuperScale pipeline microprocessor 100 of loading unit 124 as shown in Figure 7 can be advantageously to be less than three frequency periods of conventional microprocessor in the situation that, handle ldalu micro-order or load microinstruction sequence.
Fig. 9 is the calcspar of the storage element 126 as shown in Figure 1 according to the present invention.Storage element 126 has comprised the ALU 162 in Fig. 1, the storage data 946 that ALU 162 receives via bus 146 from general-purpose register set 112 or via passing on bus 148 from performance element 122/124/126.ALU 162 is carried out an ALU computing to produce an ALU result 156 on storage data 946, and ALU result 156 is provided to a store buffer in memory sub-system 116, is provided to reorder buffer 114 and is provided to performance element 122/124/126 via passing on bus 148.Store buffer the most at last ALU result 156 writes to storer.This ALU computing can be as described in Figure 2 by the performed arbitrary ALU computing that singly carrys out source operand of loading unit 124.In addition, in one embodiment, ALU 162 can receive a second operand 952 so that this ALU computing can be as described in Figure 6 by loading unit 124 performed arbitrary two come the ALU computing of source operand.
As follows in described in 10th~12 figure, by ALU 162 being integrated into storage element 126 with in will first carrying out ALU computing on storage data 946 before storage data 946 write stories, can advantageously avoid conventional microprocessor to understand the transfer delay occurring.
In one embodiment, by storing computing, to disassemble be two different micro-order-storage address micro-order and storage data micro-orders to SuperScale pipeline microprocessor 100.And SuperScale pipeline microprocessor 100 has comprised two independent unit-storage address locations and storage data unit, respectively in order to carry out this storage address micro-order and storage data micro-order.Store address location and comprise an address generator (being similar to the address generator 202 of loading unit 124), in order to produce a virtual storage address from storing the specified source operand that comes of address micro-order.Then store address location searches virtual storage address and stores address to obtain an entity after translating in translation lookaside buffer 204, this entity stores address and is the data that store address location and write to a store buffer of memory sub-system 116, and this store buffer is to be configured to give this to store computing.Entity in store buffer stores address and is finally written to cache label matrix 206 with caching data matrix 208 or is written in system storage.In a conventional microprocessor, storage element only receives storage data (other performance element not having outside storage element is carried out an ALU computing on storage data) and storage data is write to store buffer.Store buffer finally writes to storage data the storage address producing by storing address location from storage data unit.In one embodiment, store address location and do not shown, storage data unit is the storage element 126 shown in Fig. 9.
Figure 10 is the operation workflow figure of the SuperScale pipeline microprocessor 100 as shown in Figure 1 according to the present invention.Flow process starts to carry out from square 1002.
At square 1002, instruction transfer interpreter 104 decoding one macro instructions 132 are also translated to single stalu micro-order 134.Macro instruction 132 has specified general-purpose register 112, one ALU computings of holding (hold) operand to be executed on this operand to produce a result and by result write store.Stalu micro-order is carried out source operand for it and has been specified the identical general-purpose register 112 specified with macro instruction 132.Moreover stalu micro-order has also been specified the identical ALU computing specified with macro instruction 132.Flow process proceeds to square 1004.
At square 1004, reservation station 112 is delivered to storage element 126 by stalu micro-order, and flow process proceeds to square 1006.
At square 1006, storage element 126 is from receiving storage data 946 by the specified general-purpose register 112 of stalu micro-order (or from passing on bus 148).If stalu micro-order specifies is the ALU computing of a dual-operand, storage element 126 receives the second operand 952 that comes from one second general-purpose register 112 (or pass on bus 148) again.The ALU 162 of storage element 126 on storage data 946 (and when specifying, being also included in second operand 952) is carried out by the specified ALU computing of stalu micro-order, to produce ALU result 156.Flow process proceeds to square 1008.
At square 1008, storage element 126 is by a store buffer of ALU result 156 write store subsystems 116.As mentioned above, in one embodiment, ALU result 156 is also stored address location by the physical memory address being written to and writes to this store buffer, with in response to a storage address micro-order.Flow process proceeds to square 1012.
At square 1012, store buffer is by ALU result 156 write stories, and flow process finishes.
Figure 11 shows the operation workflow of conventional microprocessor, in order to compare the running of SuperScale pipeline microprocessor 100 of the present invention.Although the assembly in the SuperScale pipeline microprocessor 100 shown in Fig. 1 is also present in the description of Figure 11, but it must be appreciated that the storage element in conventional microprocessor does not comprise the ALU in order to carry out ALU computing on storage data, and instruction transfer interpreter can't produce special stalu micro-order for storage data.Flow process starts to carry out from square 1102.
At square 1102, instruction transfer interpreter 104 decoding one macro instructions 132 and by translate to two micro-orders 134, macro instruction 132 has specified general-purpose register 112, one ALU computings of holding an operand to be executed on this operand to produce a result and by result write store.First micro-order after translating is an ALU instruction, this ALU instruction carrys out source operand for it and has specified the identical general-purpose register 112 specified with macro instruction 132, and the identical ALU computing specified with macro instruction 132 also specified in this ALU instruction.This ALU instruction has been specified a temporary register for its destination operand.Second micro-order after translating is a storage micro-order, and this storage micro-order is carried out source operand (meaning is for its storage data) for it and specified above-mentioned temporary register.Flow process proceeds to square 1104.
At square 1104, reservation station 112 is delivered to integer unit 122 by alu micro-order, and flow process proceeds to square 1106.
At square 1106, integer unit 122 is from the specified general-purpose register 112 receipt source operands of alu micro-order, and coming to carry out on source operand by the specified ALU computing of alu micro-order to produce a result.Flow process proceeds to square 1108 and square 1112.
At square 1108, integer unit 122 outputs to result on result bus 152, and flow process proceeds to square 1114.
At square 1112, reservation station 108 is delivered to storage element 126 by storage micro-order, and flow process proceeds to square 1114.
At square 1114, storage element 126 is by result bus 152 from integer unit 112 reception results, and flow process proceeds to square 1116.
At square 1116, storage element 126 writes store buffer by result, and flow process proceeds to square 1118.
At square 1118, store buffer is by result write store, and flow process finishes.
Relatively Figure 10 and Figure 11 can find, instruction transfer interpreter 104 produces a single stalu micro-order, and storage element 126 comprises that an ALU 162 is to carry out by the specified ALU computing of stalu micro-order, this has advantageously been avoided passing on running by conventional microprocessor is caused, as shown in figure 12.
Figure 12 is according to the time diagram of the effect described in one embodiment of the invention.In figure, shown three frequency periods, figure left side is depicted as the pipe flow will stage being separated by register in conventional microprocessor, and figure right side is depicted as the pipe flow will stage being separated by register in the SuperScale pipeline microprocessor 100 of one embodiment of the invention.In the example shown in Figure 12, supposed that storage element 126 comprises the single pipe flow will stage.But, it should be noted that in other embodiments, storage element 126 can have the pipe flow will stage of varying number.And in the example shown in Figure 12, supposed that the integer arithmetic logical block in conventional microprocessor comprises the single stage.
In conventional microprocessor, an alu micro-order is connected on integer unit 122 carries out after carrying out the pipe flow will stage of specifying ALU computing, and the ALU computing of this appointment is to be used to interior generation one result of frequency period 1.Then this result is transferred to storage element by passing on bus 148 from integer unit, and storage element is storage data in this result of the interior reception of frequency period 2.Finally, at frequency period 3, storage data is written into store buffer.
In SuperScale pipeline microprocessor 100 as shown in Figure 1, a stalu instruction is connected on storage element 126 and carries out after the pipe flow will stage of frequency period 1.Compared to traditional microprocessor, at frequency period 1, ALU 162 in storage element 126 on storage data 946 (and in specify time, also be included in second operand 952) carry out by the specified ALU computing of stalu micro-order, to produce ALU result 156.At frequency period 2, the ALU result 156 being produced by storage element 126 is written into store buffer.Therefore visible, SuperScale pipeline microprocessor 100 is as shown in Figure 1 at least, having produced ALU result 156 and allowed store buffer and other instruction can obtain ALU result 156 more during a Zao frequency period than conventional microprocessor.And say as above-mentioned, when the resistance-capacitance circuit time constant of the required distance between different performance elements of making a return journey on signal is passing on bus and this signal transmssion line increases, when meaning is transfer delay increase, the time that the present invention can save so has also just and then promoted.
It should be noted that, although in the embodiment described in Figure 10, macro instruction 132 has been specified the general-purpose register 112 in order to hold an operand, and an ALU computing carries out on this operand that this result will be written into storer to produce a result, but instruction transfer interpreter 104 can produce a stalu micro-order and other micro-order (comprising ldalu micro-order) with other macro instruction of implementation.For instance, some macro instruction 132 has been specified the computing of reading, revising or write type in a memory operand, that is to say, macro instruction has been specified an ALU computing and a storage address, and this storage address is the address of the performed operand on it of this ALU computing, and this result will be write back this storage address.For such macro instruction, instruction transfer interpreter 104 of the present invention can produce thereafter a traditional loading micro-order of a then stalu micro-order, or a then ldalu micro-order of a traditional storage micro-order thereafter.
Another advantage of the present invention is to be incorporated into single ldalu (stalu) micro-order by loading micro-order and alu (alu and storage) micro-order, in SuperScale pipeline microprocessor 100, once only can consume an instruction slots (instruction slot) but not two instruction slots.For instance, ldalu (stalu) micro-order only takies respectively the project (entry) in register alias table 116, reservation station 108 and reorder buffer 114, but not can take respectively two projects in register alias table 116, reservation station 108 and reorder buffer 114 as loaded micro-order and alu (alu and storage) micro-order.Particularly by the more spaces that arrange out in reorder buffer 114, to more micro-orders, use, ldalu micro-order may create to send instruction to one of performance element 122/124/126 larger pond (pool) or window for microinstruction set, thereby may increase forward sight (lookahead) ability of SuperScale pipeline microprocessor 100, instruction-level parallelization when making more to make full use of program and carrying out, and can promote the behaviour in service of performance element 122/124/126.Moreover single ldalu micro-order only can produce the access (read source operand and write result) of twice pair of general-purpose register 112, load with alu microinstruction sequence and produce four accesses.Therefore, the present invention can reduce the congestion condition of general-purpose register 112, and can make the design of SuperScale pipeline microprocessor 100 comprise less, very fast, consume lower-wattage and more uncomplicated general-purpose register 112.Finally, the producible micro-order quantity of the each frequency period of instruction transfer interpreter 104 is limited (this quantity is 3 in one embodiment, is 4 in another embodiment).Separately according to an embodiment, in order to reduce the complexity of instruction transfer interpreter 104, instruction transfer interpreter 104 must produce the required all micro-orders of the known macro instruction of implementation one in same frequency period, this makes the limited quantity instruction slots of part is empty when some frequency period, therefore, allow instruction transfer interpreter 104 can produce less a micro-order with some macro instruction of implementation, can make instruction transfer interpreter 104 make full use of limited quantity instruction slots and translate macro instruction with speed faster.
Although above-described embodiment is the microprocessor for x86 framework, but the present invention has more than the microprocessor that is defined in x86 framework.Otherwise the pipeline that one or more ALU is incorporated in SuperScale pipeline microprocessor loads and/or storage element, this concept also can apply in the microprocessor of other framework.
In addition, although the described instruction transfer interpreter of above-described embodiment produces ldalu micro-order (or stalu micro-order), the for example square 302 in Fig. 3, with the macro instruction in response to a complexity, need to read and an ALU computing (or need an ALU computing and carry out write store) from storer, but in other embodiments, instruction transfer interpreter can identification one macro instruction sequence, first macro instruction in this macro instruction sequence is moved an operand to a register from storer, second macro instruction in this macro instruction sequence this operand in this register on carry out an ALU computing and this operand is write to a destination register and (or carry out an ALU computing and this operand is write to a destination register on first macro instruction in this macro instruction sequence operand in register thereafter, thereafter second macro instruction in this macro instruction sequence moved this operand to storer from destination register again).Instruction transfer interpreter is integrated into a single ldalu micro-order by these two macro instructions, this single ldalu micro-order indication loading unit by load data write destination register before, in loading data, carry out ALU computing and (or be integrated into a single stalu micro-order, this single stalu micro-order is storage element by before storage data write store, on storage data, carry out ALU computing), thus result transfer delay avoided.In other words, ldalu micro-order and stalu micro-order can be used in various situation and bring benefit, not just in response to translating single macro instruction.In an embodiment of another example, SuperScale pipeline microprocessor 100 comprises a microcode unit and a micro-sequencer (microsequencer), the micro-code instruction of the included microcode routine of microcode unit is to be stored in a microcode memory, micro-sequencer by microinstruction sequencing in the pipe flow will of SuperScale pipeline microprocessor 100.This microcode can be for instruction transfer interpreter 104 for example, with the macro instruction of implementation complexity or carry out other function: other function of initializing of built-in self-test (Built-In Self-Test, BIST) or SuperScale pipeline microprocessor 100.This microcode demand of can advantageously looking at any time use ldalu micro-order and stalu micro-order with reduction program the execution time on SuperScale pipeline microprocessor 100 and/or program code size.
In addition, above-described embodiment is described the ALU that is arranged in loading unit or storage element need to be less than a frequency period to carry out ALU computing (meaning is that ALU computing is the some that is executed in the corresponding frequency period of last pipe flow will stage of loading unit or storage element), it is all identical making to carry out the required frequency period quantity of all load/store instruction, no matter load/store instruction is regular load/store instruction or is ALU integration load/store instruction, even so, but in other embodiments, shared time of ALU computing is more than the pot life in the last pipe flow will stage of loading unit or storage element, therefore depending on the complexity of arithmetic and logical unit computing, to cause ALU to integrate load/store instruction and take the more multi-frequency cycle than conventional load/store instruction, and/or cause some ALU integration load/store instruction to take the more multi-frequency cycle than other ALU integration load/store instruction.In this embodiment, the instruction sequencer in reservation station (scheduler) must consider to carry out the variable amount of the required frequency period of a load/store instruction.
Though the present invention discloses as above with various embodiment, however its only for example with reference to but not in order to limit scope of the present invention, anyly have the knack of this skill person, without departing from the spirit and scope of the present invention, when doing a little change and retouching.For instance, software can be realized function, manufacture, modularization (modeling), simulation (simulation), description (description) and/or test of being relevant to apparatus and method of the present invention etc.This software can be used general procedure design language (as: C, C++), hardware description language (as: Verilog HDL, VHDL etc.) or other available program to carry out implementation.And this software is configurable can be used in media in any known computing machine, for example: semiconductor, disk or CD are (as a reading memory CD (Compact Disc Read-Only Memory, CD-ROM), digital versatile disc (Digital Versatile Disc Read-Only Memory, DVD-ROM) etc.).The embodiment of apparatus and method of the present invention can be included in semiconductor intelligence wealth core (semiconductorintellectual property core), for example: microcontroller core (as: being embedded in Hardware description language calls the turn), be more further converted to the integrated circuit (IC) products of hardware.In addition, the array mode that apparatus and method of the present invention can hardware and software embeds.Therefore, above-described embodiment is not in order to limit scope of the present invention, and protection scope of the present invention is when being as the criterion depending on the accompanying claim person of defining.Particularly, the present invention can be implemented into a micro processor, apparatus, and this micro processor, apparatus can be used in general purpose computing machine.Finally, have the knack of this skill person and should understand without departing from the spirit and scope of the present invention, with above-mentioned disclosed embodiment and conceptual design, go out and other framework of the identical object of the present invention.

Claims (18)

1. a SuperScale pipeline microprocessor, comprising:
One set of registers, is defined by an instruction set architecture of this SuperScale pipeline microprocessor;
One high-speed cache;
Multiple performance elements; And
One storage element, is coupled to aforementioned cache, and above-mentioned storage element is other performance element that is different from this SuperScale pipeline microprocessor, and above-mentioned storage element comprises an ALU,
Wherein above-mentioned storage element is to receive one first instruction, and above-mentioned the first instruction specifies one first of above-mentioned set of registers to carry out source-register and in one first one first computing that carries out and produce on source operand a result,
Wherein above-mentioned storage element is in order to carry out source operand from above-mentioned the first source register read above-mentioned first, and directly to this ALU, provides above-mentioned first to carry out source operand but not carry out source operand and transfer to any one of above-mentioned other performance element by above-mentioned first;
Wherein above-mentioned ALU is in order to carry out above-mentioned the first computing to produce the above results on source operand above-mentioned first, and
Wherein above-mentioned storage element is more in order to write aforementioned cache by the above results.
2. SuperScale pipeline microprocessor as claimed in claim 1, wherein above-mentioned storage element is in order to indirectly the above results is write to aforementioned cache by a store buffer.
3. SuperScale pipeline microprocessor as claimed in claim 1, wherein above-mentioned storage element is more in order to receive one second instruction, above-mentioned the second instruction is specified in above-mentioned set of registers in order to receive a second source register of a second source operand, and the reception of above-mentioned second source operand does not need to specify in one second computing of carrying out on above-mentioned second source operand, wherein above-mentioned storage element is carried out above-mentioned the first instruction and above-mentioned the second instruction with the frequency period of equal number.
4. SuperScale pipeline microprocessor as claimed in claim 1, wherein other performance element of above-mentioned SuperScale pipeline microprocessor is all non-in order to write aforementioned cache.
5. SuperScale pipeline microprocessor as claimed in claim 1, wherein at least one of above-mentioned other performance element has an ALU, above-mentioned ALU is in order to carry out by specified above-mentioned the first computing of above-mentioned the first instruction, and above-mentioned storage element can not carry out at least one any one that source operand transfers to above-mentioned other performance element to carry out on source operand above-mentioned the first computing to produce the above results in above-mentioned first by above-mentioned first.
6. SuperScale pipeline microprocessor as claimed in claim 1, wherein above-mentioned storage element is all instructions that write aforementioned cache in order to carry out, and above-mentioned other performance element is all non-in order to carry out the instruction that writes aforementioned cache.
7. SuperScale pipeline microprocessor as claimed in claim 1, wherein a second source operand is more specified in above-mentioned the first instruction, wherein above-mentioned the first computing is to be executed in above-mentioned first to come source operand and above-mentioned second source operand to produce the above results, and wherein above-mentioned second source operand is to be provided to above-mentioned storage element by a register of above-mentioned set of registers.
8. SuperScale pipeline microprocessor as claimed in claim 1, wherein above-mentioned storage element only needs the above-mentioned set of registers of single access to carry out above-mentioned the first instruction.
9. SuperScale pipeline microprocessor as claimed in claim 1, more comprises:
One instruction transfer interpreter, in order to one first macro instruction is translated to above-mentioned first instruction performed by above-mentioned storage element, wherein above-mentioned the first macro instruction is defined by above-mentioned instruction set architecture.
10. SuperScale pipeline microprocessor as claimed in claim 9, wherein above-mentioned instruction transfer interpreter is more in order to translate to a pair of instruction by one second macro instruction being defined by above-mentioned instruction set architecture, above-mentioned instruction is comprised to above-mentioned the first instruction and one second instruction, wherein above-mentioned the second instruction is performed by the one of above-mentioned other performance element, and the one of above-mentioned other performance element is carried out source operand by above-mentioned first and from aforementioned cache, is loaded on above-mentioned first and carrys out source-register, above-mentioned the first instruction is to carry out source operand from above-mentioned the first source register read above-mentioned first.
11. SuperScale pipeline microprocessors as claimed in claim 1, wherein above-mentioned the first computing comprises
At least one of following computing:
One null value is extended computing, and above-mentioned null value is extended computing by above-mentioned first size of carrying out source operand and carry out null value and extend to a destination register of aforementioned cache;
One boolean's inverse operation, above-mentioned boolean's inverse operation is by the above-mentioned first each position of carrying out source operand oppositely;
One wait door a computing, above-mentioned not wait door computing produce above-mentioned first come source operand one or two complement codes bear;
One increment operation, above-mentioned increment operation carrys out source operand increment by above-mentioned first;
One decrement computing, above-mentioned decrement computing carrys out source operand decrement by above-mentioned first;
One symbol extends computing, and above-mentioned symbol extension computing carrys out source operand by above-mentioned first and carries out symbol extension;
One null value detecting computing, above-mentioned null value detecting is performed on above-mentioned first, and to carry out source operand be 1 o'clock, the value that produces the above results is true, and when above-mentioned first carrys out source operand and be non-zero, the value that produces the above results is false;
One one value detecting computings, when above-mentioned one value detecting is performed on above-mentioned first all positions of carrying out source operand and is all binary one value, the value that produces the above results is very, the anti-value that produces the above results is vacation;
One Data Format Transform computing, above-mentioned Data Format Transform computing turns to a data layout by above-mentioned the first source operand format, and above-mentioned data layout is different from above-mentioned first and carrys out the data layout of source operand when aforementioned cache is read out, and wherein above-mentioned data layout is specified in above-mentioned the first instruction;
One boolean calculation, wherein above-mentioned ALU carries out above-mentioned boolean calculation to produce the above results on source operand and a second source operand above-mentioned first; And
One arithmetical operation, wherein above-mentioned ALU carries out above-mentioned arithmetical operation to produce the above results on source operand and a second source operand above-mentioned first.
12. 1 kinds of command processing methods, be applicable to a SuperScale pipeline microprocessor, the storage element other performance element, that comprise ALU that this SuperScale pipeline microprocessor has a set of registers, a high-speed cache, the multiple performance element being defined by an instruction set architecture of this SuperScale pipeline microprocessor and is different from this SuperScale pipeline microprocessor, above-mentioned command processing method comprises:
By above-mentioned storage element, receive one first instruction, above-mentioned the first instruction specifies one first of above-mentioned set of registers to carry out source-register and in one first one first computing that carries out and produce on source operand a result;
By above-mentioned storage element, from above-mentioned the first source register read above-mentioned first, carry out source operand and directly to this ALU, provide above-mentioned first to carry out source operand;
By the ALU of above-mentioned storage element, above-mentioned first, come to carry out above-mentioned the first computing to produce the above results on source operand, but not by the above-mentioned first any one that come that source operand transfers to above-mentioned other performance element to come to carry out above-mentioned the first computing to produce the above results on source operand above-mentioned first; And
By above-mentioned storage element, the above results is write to aforementioned cache.
13. command processing methods as claimed in claim 12, wherein above-mentioned storage element is in order to indirectly the above results is write to aforementioned cache by a store buffer.
14. command processing methods as claimed in claim 12, wherein above-mentioned storage element is more in order to receive one second instruction, above-mentioned the second instruction is specified in above-mentioned set of registers in order to receive a second source register of a second source operand, and the reception of above-mentioned second source operand does not need to specify in one second computing of carrying out on above-mentioned second source operand, wherein above-mentioned storage element is carried out above-mentioned the first instruction and above-mentioned the second instruction with the frequency period of equal number.
15. command processing methods as claimed in claim 14, wherein above-mentioned the first computing is to be executed in above-mentioned first to come source operand and above-mentioned second source operand to produce the above results.
16. command processing methods as claimed in claim 12, more comprise:
One first macro instruction is translated to above-mentioned first instruction performed by above-mentioned storage element, and wherein above-mentioned the first macro instruction is defined by above-mentioned instruction set architecture.
17. command processing methods as claimed in claim 16, more comprise:
One second macro instruction being defined by above-mentioned instruction set architecture is translated to a pair of instruction, above-mentioned instruction is comprised to above-mentioned the first instruction and one second instruction, wherein above-mentioned the second instruction is performed by the one of above-mentioned other performance element, and the one of above-mentioned other performance element is carried out source operand by above-mentioned first and is loaded on above-mentioned first from aforementioned cache and carrys out source-register, above-mentioned the first instruction is to carry out source operand from above-mentioned the first source register read above-mentioned first.
18. command processing methods as claimed in claim 17, wherein above-mentioned the first computing comprises following fortune
At least one that calculate:
One null value is extended computing, boolean's inverse operation,, and not etc. door computing, an increment operation, a decrement computing, a symbol do not extend computing, a null value detecting computing, one one value detecting computings, a Data Format Transform computing, a boolean calculation, an arithmetic operator;
Wherein above-mentioned ALU carries out above-mentioned boolean calculation to produce the above results on source operand and a second source operand above-mentioned first;
Wherein above-mentioned ALU carries out above-mentioned arithmetical operation to produce the above results on source operand and a second source operand above-mentioned first.
CN201010243151.1A 2009-08-07 2010-07-28 Instruction processing method and super-pure pipeline microprocessor Active CN101944012B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US23225409P 2009-08-07 2009-08-07
US61/232,254 2009-08-07
US12/609,193 2009-10-30
US12/609,193 US9952875B2 (en) 2009-08-07 2009-10-30 Microprocessor with ALU integrated into store unit

Publications (2)

Publication Number Publication Date
CN101944012A CN101944012A (en) 2011-01-12
CN101944012B true CN101944012B (en) 2014-04-23

Family

ID=43436014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010243151.1A Active CN101944012B (en) 2009-08-07 2010-07-28 Instruction processing method and super-pure pipeline microprocessor

Country Status (1)

Country Link
CN (1) CN101944012B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106485321B (en) * 2015-10-08 2019-02-12 上海兆芯集成电路有限公司 Processor with framework neural network execution unit
US10438115B2 (en) * 2016-12-01 2019-10-08 Via Alliance Semiconductor Co., Ltd. Neural network unit with memory layout to perform efficient 3-dimensional convolutions
KR20190086669A (en) * 2016-12-12 2019-07-23 인텔 코포레이션 Devices and methods for processor architecture

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514345A (en) * 2003-02-11 2004-07-21 智慧第一公司 Device and method used for reducing continuous bit correlation in random number producer
CN101329622A (en) * 2008-02-08 2008-12-24 威盛电子股份有限公司 Microprocessor and method for implementing macro instructions

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7415597B2 (en) * 2004-09-08 2008-08-19 Advanced Micro Devices, Inc. Processor with dependence mechanism to predict whether a load is dependent on older store

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1514345A (en) * 2003-02-11 2004-07-21 智慧第一公司 Device and method used for reducing continuous bit correlation in random number producer
CN101329622A (en) * 2008-02-08 2008-12-24 威盛电子股份有限公司 Microprocessor and method for implementing macro instructions

Also Published As

Publication number Publication date
CN101944012A (en) 2011-01-12

Similar Documents

Publication Publication Date Title
TWI423127B (en) Instruction process methods, and superscalar pipelined microprocessors
US9495159B2 (en) Two level re-order buffer
US6675376B2 (en) System and method for fusing instructions
TWI599949B (en) Method and apparatus for implementing a dynamic out-of-order processor pipeline
US6105129A (en) Converting register data from a first format type to a second format type if a second type instruction consumes data produced by a first type instruction
CN107077321B (en) Instruction and logic to perform fused single cycle increment-compare-jump
US8769539B2 (en) Scheduling scheme for load/store operations
TWI567751B (en) Multiple register memory access instructions, processors, methods, and systems
US6393555B1 (en) Rapid execution of FCMOV following FCOMI by storing comparison result in temporary register in floating point unit
CN114356417A (en) System and method for implementing 16-bit floating-point matrix dot-product instruction
TWI506539B (en) Method and apparatus for decimal floating-point data logical extraction
Furber et al. AMULET3: A high-performance self-timed ARM microprocessor
KR102478874B1 (en) Method and apparatus for implementing and maintaining a stack of predicate values with stack synchronization instructions in an out of order hardware software co-designed processor
EP3166014B1 (en) Processors supporting endian agnostic simd instructions and methods
CN101907984B (en) Command processing method and its applicable super-scale pipeline microprocessor
US9652234B2 (en) Instruction and logic to control transfer in a partial binary translation system
JP2003523573A (en) System and method for reducing write traffic in a processor
JP2014182817A (en) Converting conditional short forward branches to computationally equivalent predicated instructions
US9626185B2 (en) IT instruction pre-decode
KR20170036036A (en) Instruction and logic for a vector format for processing computations
CN104133748A (en) Method and system to combine corresponding half word units from multiple register units within a microprocessor
CN101944012B (en) Instruction processing method and super-pure pipeline microprocessor
JPS6014338A (en) Branch mechanism for computer system
Shum et al. Design and microarchitecture of the IBM System z10 microprocessor
CN104615408A (en) Microprocessor, integrated circuit, computer program product, and method for providing microcode instruction storage

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant