CN100578444C - CPU architecture of enhancing transfer prediction - Google Patents

CPU architecture of enhancing transfer prediction Download PDF

Info

Publication number
CN100578444C
CN100578444C CN 200510088652 CN200510088652A CN100578444C CN 100578444 C CN100578444 C CN 100578444C CN 200510088652 CN200510088652 CN 200510088652 CN 200510088652 A CN200510088652 A CN 200510088652A CN 100578444 C CN100578444 C CN 100578444C
Authority
CN
Grant status
Grant
Patent type
Prior art keywords
instruction
prediction
cpu
branch
execution
Prior art date
Application number
CN 200510088652
Other languages
Chinese (zh)
Other versions
CN1904822A (en )
Inventor
郭建成
Original Assignee
辉达公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Abstract

The invention relates to a CPU frame that enhances branch prediction. It at least includes two pipelines, each of which has plural levels. When one instruction fails in a branch, another would execute the instruction. Thus, the instruction execution times would be effectively decreased.

Description

加强转移预测的中央处理器架构 Strengthen the central branch prediction processor architecture

技术领域 FIELD

本发明涉及一种加强转移预测的中央处理器架构,尤其涉及一种于一中央处理器内部至少设置有二个以上的流水线,且各流水线各自设置有复数个级,对于一支线中预测执行的指令失败时,可藉由另一支线来执行正确的指令,故可有效降低执行指令的次数,且于多级的流水线中具有更佳的降低执行指令的次数的效果。 The present invention relates to a reinforced central branch prediction processor architecture, particularly to a least a central processing unit in the interior is provided with more than two lines, and each line are each provided with a plurality of stages for executing a prediction line when the failed instruction can be executed correctly by another branch instruction, it can effectively reduce the number of instructions executed, and having better reduce the number of instructions executed in multi-stage effect pipeline.

背景技术 Background technique

集成电路设计(IC Design)的产业,其是代表我国于智能财产权中的重大贡献,该产业皆属于发明专利的设计,且具备相当高度的创作,若无顶尖的精英份子从事研究即无法顺利发展,而为开发中或未开发国家所无法轻易跨入的领域,因此集成电路设计对于我国电子产业的竞争力而言,具有举足轻重的地位。 Integrated Circuit Design (IC Design) industries, which represent a significant contribution to our intellectual property rights in the industrial design patents belong to the invention, and have quite a high degree of creative, if not the top elite research that is not smooth development , and for the development or undeveloped areas that can not be easily entered the country, so the design of integrated circuits for the competitiveness of China's electronic industry has a pivotal position.

于计算机用的中央处理器(CPU)的领域中,流水线(Pipeline)于高速的中央处理器中属于主要的架构,因为在一个周期(Cycle)内不可能完成中央处理器必须执行的指令是不可能的,因此一个被执行的指令必须要被分成多级处理,且每一级皆可利用一周期来执行,因此一个高速的中央处理器必须有多级处理方能获得较佳的效率。 Field of the computer with a central processing unit (CPU), a pipeline (the Pipeline), of major architectures for high-speed central processing unit, because in one cycle (Cycle) instructions from the central processor must not be completed is not performed possible, and consequently an instruction to be executed must be divided into a multi-stage process, and each using a Jieke a cycle to execute, and therefore a high speed central processor must process a plurality of stages in order to obtain better efficiency.

但于中央处理器设置流水线时会产生"转移预测(The Branch Prediction)" 的问题,假设有一程序的设定如下: But the problem arises "branch prediction (The Branch Prediction)" when the central processor is provided in the pipeline, assumes that a program is set as follows:

如果预测状态为A,则一执行程序为B,而另一执行程序为C。 If the predicted state is A, a program execution is B, and the other for the execution of the program C.

中央处理器若欲执行程序B与C,必须要等待状态A有执行出一个结果, 但是己具流水线的中央处理器不能只为了得到状态A的结果,然后就将程序B、 C塞入多级流水线中,其正确的做法应该为己具流水线的中央处理器必预测A 的结果,以及读取程序B、 C的指令其中之一者进入流水线中,否则具有多级流水线的中央处理器将处于等待状态A的闲置状态,如果中央处理器所预测状态A与所读取程序B的指令皆为正确的,那么所中央处理器即可迅速获得正确 Ruoyu central processor executes the program B and C, must wait state A has performed a result, it has been pipelined central processor in order to obtain not just the result of state A, the program will then B, C stuffing multistage pipeline, which is already the correct way with the central processor of the pipeline will predictors a, and reads the program B, C where one instruction to enter the pipeline, or the central processor having a multistage pipeline will be at wait idle state a, state a if the central processor and the read instruction program predicted B are all correct, the central processor can quickly correct

结果,但是如果中央处理器所预测的状态A为错误的,那么中央处理器将会丢失其结果及停止B程序,再使C程序进行流水线且执行它,因此浪费许多处理时间,该已知中央处理器的架构虽然为是高速的,但是亦无法提高其运算效能,此即为本发明欲改进的目标。 As a result, if the central processor but the predicted error state A, then the central processor will be lost as a result and the program B is stopped, then the C-line program and executing it, thus wasting a lot of processing time, which is known central Although the processor architecture for the high-speed, but will not be able to improve its operational efficiency, namely objectives of the present invention to be improved.

发明内容 SUMMARY

基于解决以上所述已知技术的缺点,本发明为一种加强转移预测的中央处理器架构,本发明的主要目的为于一中央处理器内部至少设置有二个以上的流水线,且各流水线各自设置有复数个级,对于一支线中预测执行的指令失败时,可藉由另一支线来执行正确的指令,故可有效降低执行指令的次数, 且于多级的流水线中具有更佳的降低执行指令的次数的效果。 Based solve the above drawbacks of the known art, the present invention is a strengthening of the central branch prediction processor architecture, the main object of the present invention is at least a central processing unit provided inside more than two lines, each line and each is provided with a plurality of stages, when a wire for predicting the failure of the instruction execution, the branch may be performed by another proper instruction, it can effectively reduce the number of executions of instructions, and having a reduced better in a multi-stage pipeline the effect of the number of instructions executed.

本发明的另一目的在于上述二个以上流水线所设置的相同功能级可依设计需求做成一合并架构,用以降低中央处理器的价格及其设计的复杂性。 Another object of the present invention is the same function of the above two or more pipeline stages set according to design requirements to make a combined architecture for reducing the complexity and the price of the central processor design.

本发明采用以下技术方案来实现,按本发明一种加强转移预测的中央处理器架构执行指令的方法,该加强转移预测的中央处理器架构包括内存(1), 内存控制器(2)及所连接的中央处理器(3),中央处理器内部包括有一高速缓冲区(31),至少两个流水线AB,及四个级,其中A流水线包括A读取(32) 及A解码(33) 二个级;B流水线包括B读取(34)及B解码(35),而二流水线共同使用的级包括有执行(36)及回写(37),该方法的特征在于:若转移预测结果是正确的,而平时中央处理器利用A流水线来执行执令,则 The present invention employs the following technical solution to achieve, according to the present invention is a method for the central processor architecture for executing instructions branch prediction reinforcement, the reinforcing branch prediction memory comprises a central processor architecture (1), the memory controller (2) and the a central processing unit (3) connected inside the central processor comprising a buffer cache (31), at least two lines AB, and four stages, a wherein a comprises a read pipeline (32) a, and decoding (33) di stages; B comprising B reading lines (34) and the decoded B (35), while the two-stage pipeline comprising performing common use (36), and write back (37), the method being characterized in that: if the branch prediction result correct, but usually the central processor utilization a pipeline to perform the execution order, then

在第0周期中,A读取执行读取"CMPR1,R2"动作,而B读取不执行该读取动作: In the 0th cycle, A reading is performed to read "CMPR1, R2" action, and B read the reading operation is not performed:

在第1周期中A解码执行"CMPR1"解码动作,同时执行读取"JBprocess—A 动作,而B解码不执行解码读取动作; A decoding performed in the first period "CMPR1" decoding operation, reading is performed while "JBprocess-A operation, and does not perform the decoding B decodes the read operation;

在第2周期中执行"CMPR1,R2"动作,同时A读取执行读取"MOVR3,Rl" 和解码"JB process—A"动作; Perform "CMPR1, R2" operation in the second period, while reading is performed to read A "MOVR3, Rl" and decode "JB process-A" action;

在第3周期中,该"CMPR1,R2"执行回写动作,该"JB process—A""执行"执行动作,A解码执行"M0VR3,R1"解码动作而不执行读取动作,B解码不执行解码动作,但执行"MOVR3,R2"读取动作; In the third cycle, the "CMPR1, R2" write back operation, the "JB process-A" "execution" to perform an action, A decoding execution "M0VR3, R1" decoding operation without performing the reading operation, B is not decoded perform decoding operation, but executes "MOVR3, R2" read operation;

在第四周期中,'JB process—A"执行回写动作,"MOVR3,R2""执行"执行动作,而B解码执行"MOVR3,R2"解码动作。不执行读取动作,而A读取不执行解码读取动作; In a fourth cycle, 'JB process-A "write back operation performed," MOVR3, R2 "" perform, "perform an action, and perform the decoding B" MOVR3, R2 "decoding operation is not executed the reading operation, the reading and A decoding the read operation is not performed;

在第5周期中,"MOVR3,R2"执行回写动作,而不"执行"执行动作,A 解码,B解码不执行解码动作,A读取,B读取不执行读取动作。 In the fifth cycle, "MOVR3, R2" writeback operation performed, not "execute" to perform an action, A decoded, B decodes the decoding operation is not performed, A reading, B does not perform the reading operation to read.

该方法的特征还在于:若"转移预测"结果是错误的,则 The method is further characterized in that: if "branch prediction" to be false, then

在第O周期中A读取执行"CMPR1,R2"读取动作,而B读取不执行读取动 A first O read cycle performed "CMPR1, R2" read operation, reading is not performed while moving the reading B

作; Make;

在第1周期中,A解码执行"CMPR1,R2"解码动作,同时执行读取"JB process—A"动作,而B解码,B解码不执行上述动作; In the first period, A decoding execution "CMPR1, R2" decoding operation, reading is performed while "JB process-A" operation, and B decoding, the decoding does not perform the operation B;

在第2周期中,"CMPR1,R2""执行"执行动作,同时A读取执行"CMPR3,R1"读取动作,A解码执行"JBprocess一A"解码动作,而B解码,B 读取不执行上述动作; In the second cycle, "CMPR1, R2" "perform," perform an action, while A reading execution "CMPR3, R1" read operation, the implementation of the decoding A "JBprocess a A" decoding operation, the decoder B, B is not read performing the operation;

在第3周期中,"CMPR1,R2""执行"回写动作,同时执行"JB process—A" 执行动作,A解码执行"MOVR3,Rl"解码动作,B解码不"执行"执行"MOVR3,Rl"解码动作,A读取不执行读取动作,B读取执行"MOVR3,R2" 读取动作; In the third cycle, "CMPR1, R2" "execution" write back operation, while performing "JB process-A" to perform an action, A decoding execution "MOVR3, Rl" decoding operation, B decoding is not "execution" execution "MOVR3, Rl "decoding operation, A does not perform the read operation to read, B reading execution" MOVR3, R2 "read operation;

在第4周期中,"JBprocess一A"执行回写动作,B解码执行"MOVR3,R2" 解码动作,A解码不执行"MOVR3,R2"动作。 In the fourth cycle, "JBprocess a A" write back operation performed, B decoding execution "MOVR3, R2" decoding operation, A does not perform decoding "MOVR3, R2" operation. A读取和B读不执行读取动作; 在第5周期中,仅执行"MOVR3,R2"执行动作; 在第6周期中,仅执行"MOVR3,R2"回写动作。 A and B are read read read operation is not performed; in the fifth cycle, performing "MOVR3, R2" perform an action only; in the sixth period, perform only "MOVR3, R2" write-back operation. 以下结合附图说明及发明详细说明进一步对本发明做更深入的说明。 BRIEF DESCRIPTION The following detailed description of the invention and further more in-depth explanation of the present invention.

附图说明 BRIEF DESCRIPTION

图l为本发明装设多流水线的中央处理器架构的功能方块图; FIG mounted l multiple pipelined CPU architecture functional block diagram of the present invention;

图2为单一流水线架构来执行若干指令的第一流程图表; FIG 2 is a pipelined architecture to execute a single instruction of the first plurality of flow charts;

图3为单一流水线架构来执行若干指令的第二流程图表;图4为利用图1架构来执行若干指令的第一流程图表; 图5为利用图1架构来执行若干指令的第二流程图表。 FIG 3 is a pipelined architecture to execute a single instruction of a second plurality of flow diagram; FIG. 4 is an architecture using a plurality of instructions to perform a first flow diagram; FIG. 5 is an architecture using a plurality of instructions to perform a second flow chart. 附图标号说明:l内存;2内存控制器;3中央处理器;31高速缓存区; 32-A读取;33-A解码;34-B读取;35-B解码;36执行;37回写。 Reference numeral Note: l memory; memory controller 2; 3 central processor; cache region 31; 32-A reading; 33-A decoded; 34-B read; 35-B decoding; 36 performed; Press 37 write.

具体实施方式 detailed description

兹配合下列的附图说明本发明的详细结构,及其连结关系,以利做一一了解。 Hereby with the following drawings the detailed structure of the present invention, and their connection relationship, do 11 to facilitate understanding.

请参阅图l所示,为本发明装设多流水线的中央处理器架构的功能方块 See Figure l, the central mounting multiple pipeline processor architecture of the present invention is a functional block

图,其中包括一内存1及一内存控制器2及所连接的中央处理器3,中央处理器3内部包括有一高速缓存区(Cache) 31及二个流水线(Pipeline)及四个级(Stage),其中A流水线(Pipeline A)包括有:A读取32及A解码33的二个级; B流水线(Pipeline B)包括有:B读取34及B解码35的二个级,而二流水线的共同使用级包括有:执行36及回写37,当然针对不同的中央处理器的设计, 可以有设置更多的流水线及级。 FIG., 1 which includes a memory and a memory controller 2 and 3 connected to the central processor, the central processor 3 comprises an internal cache area (Cache) 31, and two lines (the Pipeline) and level four (the Stage) wherein a pipeline (the pipeline a) comprises are: a and a decoder 32 reads the two stage 33; B pipeline (the pipeline B) comprises: B 34 and B read two stage decoder 35, while the second pipeline use common level include: the implementation of 36 and 37 write back, of course, for different design of the central processor, you can have more lines and set the stage.

请参阅图2所示,为单一流水线架构来执行若干指令的第一流程图表,假设中央处理器正在执行下列指令: Please refer to FIG. 2, the first process is a single pipelined architecture to perform several instructions graph, the following is assumed that the central processor instruction execution:

(1) CMPR1, R2(比较R1与R2的数值) (1) CMPR1, R2 (compare R1 to R2 of values)

(2) JB process A (若R1大于R2,则执行跳至流程A) (2) JB process A (if R1 is greater than R2, the skip process executed A)

(3) MOVR3, R2 (将R2的数值搬移R3) (3) MOVR3, R2 (the value of R2 move R3)

流程A: Process A:

(4) MOVR3, Rl (将R1的数值搬移R3) (4) MOVR3, Rl (move the R1 value R3)

如果装置一流水线的中央处理器与单一流水线执行有关指令与转移预测是R1数值大于R2,且真实的情况亦为R1数值大于R2,此时执行结果就如图表 If the device is a single central processor pipeline pipelined execution transfer instruction relating to the predicted value R1 is greater than R2, and also the real situation values ​​greater than R1 R2, at this time, the execution result as a graph

所示: Below:

其中每一个指令皆必须经过四个级,其分别依序为:读取(Fetch)、解码(Decode)、执行(Execute)及回写(Writeback),方可完成所有的动作。 Which are each instruction must go through four stages, respectively, in order, is: read (Fetch), decoding (Decode), the Executive (Execute) and write-back (Writeback), in order to complete all of the action.

其中在第O个周期中,"读取"一"CMPR1, R2"的指令;于第l个周期中该"CMPR1, R2"进行"解码"动作,同时又读取下一个指令"JB processA";于第2个周期中,该"CMPR1, R2"进行"执行"动作,同时"JBprocess A"进行"解码"动作,同时读取下一个指令"MOVR3, Rl"(因R1数值确实大于R2数值,故执行processA);于第3个周期中,"CMPR1, R2"进行"回写"动作,"JB process A"进行"执行"动作,"MOV R3, Rl"进行"解码"动作;于第4个周期中,"JBprocessA"进行"回写"动作,"MOV R3, Rl"进行"执行"动作;于第5个周期中,"MOVR3, Rl"进行"回写" 动作。 Wherein the first O cycle, the "read" a "CMPR1, R2" command; in the l-th period "CMPR1, R2" for "decoding" operation, while reading the next instruction "JB processA" ; in the second cycle, the "CMPR1, R2" a "execute" operation, while "JBprocess a" for "decoding" operation, and reads the next instruction "MOVR3, Rl" (because R1 R2 value is greater than the value does , is performed so as processA); in the third cycle, "CMPR1, R2" a "write-back" action, "JB process A" for "executed" action, "MOV R3, Rl" for "decoding" operation; in the first the fourth cycle, "JBprocessA" a "write-back" action, "MOV R3, Rl" a "execute" operation; the 5th cycle, "MOVR3, Rl" a "write back" operation.

由图表中可看出在"转移预测"正确的情况下,仅需要花六个周期(第0 周期至第5周期)即可完成该些指令。 As can be seen from the graph in the "branch prediction" is correct, it takes only six periods (periods 0 through 5 cycles) to complete the plurality of instructions.

请参阅图3所示,为单一流水线架构来执行若干指令的第二流程图表,假设中央处理器正在执行下列指令: Please refer to FIG. 3, a single instruction pipeline architecture to perform several second flow charts, assuming the central processor is executing the following instructions:

(1) CMPR1, R2(比较R1与R2的数值) (1) CMPR1, R2 (compare R1 to R2 of values)

(2) JB process A (若R1大于R2,则执行跳至流程A) (2) JB process A (if R1 is greater than R2, the skip process executed A)

(3) MOVR3, R2 (将R2的数值搬移R3) 流程A (process A): (3) MOVR3, R2 (the value of R2 move R3) Process A (process A):

(4) MOVR3, Rl (将R1的数值搬移R3) (4) MOVR3, Rl (move the R1 value R3)

如果装置一流水线的中央处理器与单一流水线执行有关指令与转移预测是R1数值大于R2,且真实的情况为R1数值却未大于R2,代表其"转移预测" 结果是错误的,此时执行结果就如图表所示: If the device is a single central processor pipeline pipelined execution transfer instruction relating to the predicted value R1 is greater than R2, R1 and true case has not greater than the value R2, on behalf of its "branch prediction" to be false, then the results of as shown in the graph:

其中在第O个周期中,"读取"一"CMPR1, R2"的指令;于第l个周期中该"CMPR1, R2"进行"解码"动作,同时又读取下一个指令"JBprocess A";于第2个周期中,该"CMPR1, R2"进行"执行"动作,同时"JB process A"进行"解码"动作,同时读取下一个指令"MOV R3, Rl";于第3个周期中,"CMPR1, R2"进行"回写"动作,"JBprocessA"进行"执行"动作,"MOVR3, Rl"进行"解码"动作;于第4个周期中,"JB process A" 进行"回写"动作,"MOV R3, Rl"因为不符合真实故无法进行"执行" 动作,因此又重新读取一指令"MOVR3, R2";于第5个周期中,"MOVR3, R2"进行"解码"动作;于第6个周期中,"MOVR3, R2"进行"执行"动作;于第7个周期中,"MOVR3, R2"进行"回写"动作。 Wherein the first O cycle, the "read" a "CMPR1, R2" command; in the l-th period "CMPR1, R2" for "decoding" operation, while reading the next instruction "JBprocess A" ; in the second cycle, the "CMPR1, R2" a "execute" operation, while "JB process a" for "decoding" operation, and reads the next instruction "MOV R3, Rl"; in the third cycle in, "CMPR1, R2" a "write-back" action, "JBprocessA" for "execution" action, "MOVR3, Rl" for "decoding" operation; in the fourth cycle, "JB process A" for "write-back "action," MOV R3, Rl "because it can not be incompatible with the real" execute "operation, it is also a re-read command" MOVR3, R2 "; in the fifth cycle," MOVR3, R2 "" decodes " operation; at the sixth cycle, "MOVR3, R2" a "execute" operation; in the seventh period, "MOVR3, R2" a "write back" operation. 我们在发现该转移预测结果若未与真实情况相同时,此时必须要花费八个周期方可将指令执行完成,其原因是因为当中央处理器于第三个周期执行跳至流程A时,其结果是错误的,因此下一个指令应该是执行"将R2的数值搬移R3",而非执行流程A的"将R1的数值搬移R3"。 We found that the branch prediction results and the real situation is the same if not, then have to spend eight cycles before the instruction is complete, the reason is because when the central processor executes the process jumps to A in the third period, the result is wrong, so that the next instruction should be executed "will move R3 R2 value", instead of performing "the value of the move R3 R1 'of Scheme a.

但是中央处理器预测的结果如果是真实的话,才可执行读取"将R1的数值搬移R3"进入流水线的级,因此中央处理器必须舍弃"将R1的数值搬移R3" 的指令,并重新读取"将R2的数值搬移R3"的指令,正因如此而浪费了两个周期来处理该些指令。 But the central processor predictable result if it is true, before performing a read "will move the value of R1 R3" into the pipeline stages, the central processor must therefore give up "the value of R3 R1 move" command, and re-read take "move the R2 value R3" instruction, why two wasted cycles to process the plurality of instruction.

请参阅图4所示,为利用图l架构来执行若干指令的第一流程图表, 一个中央处理器若装设有二个流水线(分别为A流水线及B流水线)来执行上述的若干指令,其中A流水线(Pipeline A)包括有:A读取32及A解码33的二个级; B流水线(Pipeline B)包括有:B读取34及B解码35的二个级,而二流水线的共同使用级包括有:执行36及回写37。 Please refer to FIG. 4, FIG. L architectures utilizing a plurality of instructions to perform a first flow chart, if mounted with a central processing unit to execute a number of instructions two lines above (lines A and B respectively line), wherein a pipeline (the pipeline a) comprises are: a and a decoder 32 reads the two stage 33; B pipeline (the pipeline B) comprises: B 34 and B read two decoding stage 35, and the two lines used in common level include: the implementation of 36 and 37 write back. 而其"转移预测"结果是正确的,平时中央处理器利用A流水线来执行指令,在第2个周期中,中央处理器对于"JB process A"这个指令进行译码后了解存在有另一个支线,因此中央处理器于第3个周期中读取该"JB process A"以后的其它指令进入B流水线中,于第3 个周期之后,中央处理器执行"JB process A"指令时发现其结果为真实,因此于A流水线中执行下一个有效指令"将R1的数值搬移R3",并同时放弃执行于B流水线中的指令,而中央处理器于六个周期内即将该些指令执行完毕。 And its "branch prediction" result is correct, the central processing usually performed by the A instruction pipeline, in the second cycle, the central processor for the "JB process A" the branch instruction further understanding of the decoded exists Therefore the central processor reads the "JB process a" of other instructions later in the third cycle B into the pipeline, after the first 3 cycles in found to result when the central processor to perform the "JB process a" instruction true, so the next valid instruction in the execution pipeline a "will move the value of R1 R3 ', and B in the same time to give an instruction execution pipeline, and a central processing period is about six in the plurality of instruction completes.

请参阅图5所示,为利用图l架构来执行若干指令的第二流程图表, 一个中央处理器若装设有二个流水线(分别为A流水线及B流水线)来执行上述的若干指令,其中A流水线(Pipeline A)包括有:A读取32及A解码33的二个级; B流水线(Pipeline B)包括有:B读取34及B解码35的二个级,而二流水线的共同使用级包括有:执行36及回写37,其中有关指令与转移预测是R1数值大于R2,且真实的情况为R1数值却未大于R2,代表其"转移预测"结果是错误的,在第2个周期中,中央处理器对于"JB process A"这个指令进行译码后了解存在有另一个支线,同样地,中央处理器于第3个周期中读取该"JBprocess A"以后的其它指令进入B流水线中,于第三个周期之后,中央处理器执行"JB process A"指令时发现其结果为错误,因此于A流水线中执行下一个指令"将R1的数值搬移R3"乃于无效并放弃该指令,且 Refer to FIG. 5, FIG. L architectures utilizing a plurality of instructions to perform a second flow chart, if mounted with a central processing unit to execute a number of instructions two lines above (lines A and B respectively line), wherein a pipeline (the pipeline a) comprises are: a and a decoder 32 reads the two stage 33; B pipeline (the pipeline B) comprises: B 34 and B read two decoding stage 35, and the two lines used in common stage comprising: performing a write back 36 and 37, wherein the instructions related to branch prediction value R1 is greater than R2, R1 and true case has not greater than the value R2, on behalf of its "branch prediction" to be false, the second after the present cycle, the central processor for decoding "JB process a" Learn this command has another branch in the same manner, the central processor reads the "JBprocess a" other instructions later in the third period into the B as a result, when an error was found in the pipeline, in the third cycle after the central processor to perform the "JB process a" instruction, the next instruction in the pipeline a "will move value R1 R3" is the invalid and to discard the instruction, and 行于B流水线中的有效指令, 其共执行七个周期,较图3所揭示的流程而言,节省了一个指令。 Line B to effective instruction pipeline, which were performed seven cycles, than the process disclosed in FIG. 3, it saves one instruction.

更甚者,该二个以上流水线所设置的相同功能级可依设计需求做成一合并架构,用以降低中央处理器的价格及其设计的复杂性。 Worse still the same functional level, the two or more pipeline set according to design requirements to make a combined architecture for the central processor to reduce the price and complexity of the design.

藉由上述图1至图5的公开,即可了解本发明的主要为于一中央处理器内部至少设置有二个以上的流水线,且各流水线各自设置有复数个级,对于一支线中预测执行的指令失败时,可藉由另一支线来执行正确的指令,故可有效降低执行指令的次数,且于多级的流水线中具有更佳的降低执行指令的次数的效果,且上述二个以上流水线所设置的相同功能级可依设计需求做成一合并架构,用以降低中央处理器的价格及其设计的复杂性。 By the above disclosed Figures 1 to 5, to understand the present invention is provided with at least two or more lines in the interior of a main central processor, and each line are each provided with a plurality of stages for performing a prediction line when the failed instruction can be executed correctly by another branch instruction, it can effectively reduce the number of executions of instructions, and having a better effect of reducing the number of instructions executed in multi-stage pipeline, and said two or more same functionality stage pipeline set Ke Yi design demand made a merge architecture for reduce complexity price central processor its design. 上述的技术手段于市场上具极佳的商业价值,故提出专利的申请以寻求专利权的保护。 The above techniques with excellent commercial value in the market, it filed a patent to seek patent protection.

以上所述,仅为本发明的较佳实施例而已,不能以之限定本发明所实施的范围,即凡依本发明权利要求所作的均等变化与修饰,皆应仍属于本发明权利要求涵盖的范围内。 The above are only preferred embodiments of the present invention, it is not to limit the scope of the embodiments of the present invention, i.e., all modifications and alterations made under this invention as claimed in claim all should still belong to the claimed invention encompassed range.

Claims (2)

  1. 1.一种在一降低循环次数的加强转移执行的流水线中央处理器架构,其中,一转移指令的结果为正确或错误,所述流水线中央处理器架构包括:一具有二个流水线的中央处理器,各流水线各自设置有复数个级,通过所述复数个级,所有直接跟随在所述转移指令的所述正确或错误结果中的每一者后的指令分别由第一和第二流水线中的每一者取出,以使在执行所述转移指令后立即不失速的执行所述指令以降低执行所述指令需要的循环次数,其中所述转移指令在所述二个流水线的每一者中执行,如果在所述第一流水线中执行所述转移指令的结果为正确,则停止在所述第二流水线中直接跟随所述转移指令的所述指令的执行,且仅有所述第一流水线处理跟随在所述转移指令后的指令直到下一个转移执行错误,所述流水线中央处理器进一步配置以在所述第一流 1. A pipelined central processor architectures reduce a strengthening transfer cycles performed, wherein the result of a branch instruction is correct or wrong, of the pipeline architecture of the central processor comprises: a pipeline having two central processing , each of the instruction pipeline with a plurality of stages each provided by said plurality of stages, each of all of the directly following the branch instruction is correct or incorrect results by the first and second pipeline each removed immediately and without the stall so that the instructions are executed after the execution of the branch instruction executed in order to reduce the number of cycles required for said instruction, wherein said branch instruction execution in each of the two of the pipeline , if execution result of the branch instruction in the first pipeline is correct, stops the execution of the second instruction in the pipeline following the direct branch instruction, and only the first pipeline processing after the instruction following the branch instruction until the next transfer execution error, the central processor is further configured to pipeline in the first flow 线的执行结果错误的情况下丢弃所述第一流水线的执行结果,且跟随在所述转移指令后的指令随后仅在所述第二流水线中执行。 Discarding the first execution result of the pipeline execution result of the error line situation, and the instruction after the branch instruction is performed only then followed in the second pipeline.
  2. 2. 如权利要求1所述的中央处理器架构,其中所述流水线中的每一所述流水线都包括读取级,解码级,执行级和回写级,并且所述执行级和所述回写级由所述二个流水线共享。 2. The central processor architecture according to claim 1, wherein each of the pipeline in the pipeline comprises a fetch stage, decode stage, execution stage and a write-back stage, and the execution stage and the back write shared by the two stage pipeline.
CN 200510088652 2005-07-29 2005-07-29 CPU architecture of enhancing transfer prediction CN100578444C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510088652 CN100578444C (en) 2005-07-29 2005-07-29 CPU architecture of enhancing transfer prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510088652 CN100578444C (en) 2005-07-29 2005-07-29 CPU architecture of enhancing transfer prediction

Publications (2)

Publication Number Publication Date
CN1904822A true CN1904822A (en) 2007-01-31
CN100578444C true CN100578444C (en) 2010-01-06

Family

ID=37674093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510088652 CN100578444C (en) 2005-07-29 2005-07-29 CPU architecture of enhancing transfer prediction

Country Status (1)

Country Link
CN (1) CN100578444C (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807144B (en) * 2010-03-17 2014-05-14 上海大学 Prospective multi-threaded parallel execution optimization method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6065115A (en) 1996-06-28 2000-05-16 Intel Corporation Processor and method for speculatively executing instructions from multiple instruction streams indicated by a branch instruction

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6065115A (en) 1996-06-28 2000-05-16 Intel Corporation Processor and method for speculatively executing instructions from multiple instruction streams indicated by a branch instruction

Also Published As

Publication number Publication date Type
CN1904822A (en) 2007-01-31 application

Similar Documents

Publication Publication Date Title
US7219212B1 (en) Load/store operation of memory misaligned vector data using alignment register storing realigned data portion for combining with remaining portion
US5884057A (en) Temporal re-alignment of a floating point pipeline to an integer pipeline for emulation of a load-operate architecture on a load/store processor
Seshan High velociti processing [texas instruments vliw dsp architecture]
US20060184738A1 (en) Unaligned memory access prediction
US20060190700A1 (en) Handling permanent and transient errors using a SIMD unit
Boggs et al. The Microarchitecture of the Intel Pentium 4 Processor on 90nm Technology.
US5864689A (en) Microprocessor configured to selectively invoke a microcode DSP function or a program subroutine in response to a target address value of branch instruction
US20030097389A1 (en) Methods and apparatus for performing pixel average operations
US20050091475A1 (en) Method and apparatus for limiting ports in a register alias table
US20140095842A1 (en) Accelerated interlane vector reduction instructions
US20020019928A1 (en) Processing architecture having a compare capability
US20040205326A1 (en) Early predicate evaluation to reduce power in very long instruction word processors employing predicate execution
US7937559B1 (en) System and method for generating a configurable processor supporting a user-defined plurality of instruction sizes
JPH09311786A (en) Data processor
US20100274990A1 (en) Apparatus and Method for Performing SIMD Multiply-Accumulate Operations
US6327650B1 (en) Pipelined multiprocessing with upstream processor concurrently writing to local register and to register of downstream processor
US6678807B2 (en) System and method for multiple store buffer forwarding in a system with a restrictive memory model
Webb IBM z10: The next-generation mainframe microprocessor
JPH09212358A (en) Data processor and microprocessor
Maruyama et al. Sparc64 VIIIfx: A new-generation octocore processor for petascale computing
US6832117B1 (en) Processor core for using external extended arithmetic unit efficiently and processor incorporating the same
US9235414B2 (en) SIMD integer multiply-accumulate instruction for multi-precision arithmetic
US7085917B2 (en) Multi-pipe dispatch and execution of complex instructions in a superscalar processor
CN1349159A (en) Vector processing method of microprocessor
US7945768B2 (en) Method and apparatus for nested instruction looping using implicit predicates

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C41 Transfer of the right of patent application or the patent right
ASS Succession or assignment of patent right

Owner name: HUIDA COMPANY

Free format text: FORMER OWNER: SWIDA VIRGIN CO., LTD., BRITISH VIRGIN ISLANDS

Effective date: 20071130

Owner name: SWIDA VIRGIN CO., LTD., BRITISH VIRGIN ISLANDS

Free format text: FORMER OWNER: YULI ELECTRONICS INC.

Effective date: 20071130

C14 Granted