KR20010046738A

KR20010046738A - A branch prediction method using address trace

Info

Publication number: KR20010046738A
Application number: KR1019990050627A
Authority: KR
Inventors: 박성배
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1999-11-15
Filing date: 1999-11-15
Publication date: 2001-06-15
Also published as: KR100347865B1; US6988190B1

Abstract

PURPOSE: A branch prediction method using an address trace is provided to reduce the size of a chip and the production cost by reducing the time for an address decoding, and by accurately predicting a branch with a small quantity of trace cache. CONSTITUTION: An address trace cache(220) is composed of a start address which stores an address in which each routine is started, an end address which represents an address in which each routine is ended, a current access loop counter which counts the number of current access of a relevant routine and an old access loop counter which counts the number of whole access of the routine.

Description

Branch prediction method using address trace {A BRANCH PREDICTION METHOD USING ADDRESS TRACE}

본 발명은 분기 예측 방법에 관한 것으로, 좀 더 구체적으로는 어드레스 트레이스를 이용한 분기 예측 방법에 관한 것이다.The present invention relates to a branch prediction method, and more particularly, to a branch prediction method using an address trace.

도 1은 4 세대의 마이크로 아키텍처를 보여주기 위한 도면이다. 상기 도면은 1997년 9월, James E. Smith와 Sriram Vajapeyam에 의해 IEEE Computer, 68-74쪽에 실린 논문 "Trace Processors : Moving to Fourth-Generation Microarchitectures"의 도 1을 참조한 것이다. 도면을 참조하면, (a)는 1940년대에 최초의 디지털 컴퓨터에 적용되어 1960년대 초기까지 사용된 제 1 세대 마이크로 아키텍처인 직렬 프로세서(serial processor)이다. 이 직렬 프로세서는 다음 실행 전에 각각의 명령어(instruction)를 페치(fetch)하고 실행(execute)한다. 도 1의 (b)에 도시된 파이프라인을 사용하는 제 2 세대 마이크로 아키텍처는 1980년대 말까지 고성능 프로세스를 위한 표준으로 사용되었다. 그리고 도 1의 (c) 및 (d)에 도시된 제 3 세대 및 제 4 세대 마이크로 아키텍처는 수퍼 스칼라 프로세서들(super scalar processors)을 사용하는 특징을 가진다. 이들은, 1980년대 말 상업적으로 이용 가능한 프로세서들에 우선적으로 나타나기 시작했다. 도 1에 도시된 바와 같이, 다음 세대의 고성능 프로세서들은, 개개의 수퍼스칼라 파이프라인으로 명령어 그룹들을 발송하는, 보다 높은 수준의 제어를 할 수 있는 다중 수퍼스칼라 파이프라인들을 사용할 것이다. 동시에 다수 개의 명령어들을 캐쉬 블록으로 페치하고 상기 명령어들을 병렬 처리하는 수퍼스칼라 프로세서는, 예측 오류에 의해서라기 보다는 파이프라인의 정지(pipeline stall)와 잘못된 명령어를 무효화하는 것에 의해 야기되는 성능의 저하를 겪게 된다. 그러므로, 정확한 분기 예측은 수퍼스칼라 프로세서를 위해 더욱 중요하다.1 is a diagram for illustrating a microgeneration of a fourth generation. The figure refers to FIG. 1 of the article "Trace Processors: Moving to Fourth-Generation Microarchitectures" published by IEEE E. Smith and Sriram Vajapeyam in September 1997, IEEE Computer, pp. 68-74. Referring to the drawings, (a) is a serial processor, a first generation microarchitecture applied to the first digital computer in the 1940s and used until the early 1960s. This serial processor fetches and executes each instruction before the next execution. The second generation microarchitecture using the pipeline shown in FIG. 1B was used as a standard for high performance processes until the end of the 1980s. The third and fourth generation microarchitectures shown in FIGS. 1C and 1D are characterized by using super scalar processors. They began to appear first in commercially available processors in the late 1980s. As shown in FIG. 1, the next generation of high performance processors will use multiple superscalar pipelines with a higher level of control, sending instruction groups into individual superscalar pipelines. A superscalar processor that simultaneously fetches multiple instructions into a cache block and parallelizes the instructions suffers from performance degradation caused by pipeline stalls and invalidation of invalid instructions rather than by prediction errors. do. Therefore, accurate branch prediction is more important for superscalar processors.

앞에서 설명한 바와 같이, 수퍼스칼라 프로세서를 고성능화 하기 위한 핵심 기술은 파이프라인의 사용을 최적화 하는 기술이며, 여기서 가장 중요한 설계상의 요소는 분기(branch)를 제어하는 방법이다. 분기가 이루어질 것인지를 예측하는 데에는 여러 가지 기술들이 사용될 수 있는데, 이들 중 가장 많이 사용되는 방법으로는 이분 분기 예측(bimodal branch prediction)이 있다. 그러나 최근에는 더 많은 분기 히스토리(branch history)를 사용해서 더욱 정확한 예측결과를 보여주고 있는데, 그 중 한 방법은, 각각의 분기 히스토리를 독립적으로 고려하고 반복적인 패턴들의 이점을 이용하는 지역 분기 예측(local branch prediction) 방법이다. 그리고, 다른 하나의 방법으로는, 예측하는데 있어 최근의 모든 분기들의 결합된 히스토리(combined history)를 사용하는 전역 분기 예측(global branch prediction) 방법이다. 이러한 각각의 분기 예측 방법들은 독특한 이점들을 가진다. 예를 들면, 상기 이분 분기 예측 방법은 각각의 분기가 특정 방향으로 강하게 편중된 경우에 적합하고, 지역 분기 예측 방법은 단순한 반복 패턴들의 분기에 적합하다. 그리고 상기 전역 분기 예측 방법은 순차적으로 실행되는 분기들에 의해 실행되는 방향이 상관도가 매우 높을 때 특히 좋은 특성을 가진다. 앞에서 설명한 바와 같은 분기 예측의 방법들은, 1993년 6월, Scott McFarling에 의해 WRL Technical Note TN-36의 1-19 쪽에 실린 논문 "Combining Branch Predictors"에 소개되어 있다. 여기서, WRL(Western Research Laboratory)이란, 1982년 Digital Equipment Corporation에 의해 설립된 컴퓨터 시스템 연구 단체이다. 앞에서 설명한 분기 예측 방법 외에도, 지역 분기 예측 방법과 전역 분기 예측 방법을 결합한 콤바인드 분기 예측(combined branch prediction) 방법 등이 있다.As mentioned earlier, the key to high performance superscalar processors is to optimize the use of the pipeline, where the most important design element is how to control the branch. Several techniques can be used to predict whether a branch will occur, the most common of which is bimodal branch prediction. Recently, however, more branch history has been used to show more accurate predictions, one of which is local branch prediction, which takes into account each branch history independently and takes advantage of repetitive patterns. branch prediction). And another method is a global branch prediction method that uses a combined history of all recent branches in prediction. Each of these branch prediction methods has unique advantages. For example, the bifurcation branch prediction method is suitable when each branch is strongly biased in a specific direction, and the local branch prediction method is suitable for branches of simple repetitive patterns. The global branch prediction method has a particularly good characteristic when the direction executed by the sequentially executed branches has a high correlation. The methods of branch prediction as described earlier are introduced by Scott McFarling in June 1993 in the article "Combining Branch Predictors" on pages 1-19 of the WRL Technical Note TN-36. Here, WRL (Western Research Laboratory) is a computer system research organization established by Digital Equipment Corporation in 1982. In addition to the branch prediction method described above, there is a combined branch prediction method combining a local branch prediction method and a global branch prediction method.

상기와 같은 방법을 사용하는 분기 예측기(branch predictor)는, 분기 명령어를 만날 경우 조건 검사 과정의 결과가 나오기 전에 이전의 분기 명령어의 결과를 사용하여 해당 분기 명령어의 조건 검사 결과를 예측한다. 그리고, CPU(Central Processing Unit)는 예측된 조건 검사 결과에 따라 다음의 명령어를 페치하여 수행한다. 그 결과, 파이프라인 방식을 채택함에 의해 빠른 명령어의 페치(fetch)를 필요로 하는 근래의 CPU에서 성능을 저하시키는 파이프라인의 정지(stall)를 없앨 수 있다. 그러나, 만약 분기 예측의 결과가 틀릴 경우에는, 이미 페치하여 수행 중인 명령어들의 진행을 중지시키고 실제적인 다음 명령어를 페치하여 수행해야 한다. 부정확한 예측은 상기 파이프라인이 적절한 명령어들로 다시 채워질 때까지 정지하게 하는 원인이 된다. 이를 잘못된 예측으로 인한 분기 페널티(mispredicted branch penalty)라 한다.The branch predictor using the above method predicts the condition check result of the branch instruction by using the result of the previous branch instruction before the result of the condition check process when the branch instruction is encountered. The central processing unit (CPU) fetches and executes the following instruction according to the predicted condition check result. As a result, by adopting a pipelined scheme, it is possible to eliminate pipeline stalls that degrade performance in recent CPUs that require fast instruction fetches. However, if the result of branch prediction is wrong, the instruction that is already fetched and executed must be stopped and the actual next instruction must be fetched. Incorrect prediction causes the pipeline to stop until it is refilled with appropriate instructions. This is called the mispredicted branch penalty.

5-단 파이프라인을 가지는 프로세서는 보통 두 사이클의 분기 페널티를 가지므로, 예를 들어 4-웨이 수퍼스칼라 설계에는 8개의 명령어에 대한 손실이 발생된다. 파이프라인이 확장되면, 한층 더 많은 명령어들의 손실이 발생함과 동시에 상기와 같은 분기 페널티가 증가하게 된다. 일반적으로, 프로그램들은 4 내지 6 명령어마다 분기를 하게 되기 때문에, 부정확한 분기 예측은 하이 수퍼스칼라 또는 깊은 파이프라인 설계에 있어서 격심한 성능의 저하를 초래한다.Processors with a five-stage pipeline usually have a branching penalty of two cycles, so a four-way superscalar design, for example, loses eight instructions. As the pipeline expands, the branch penalty increases as the loss of more instructions occurs. In general, because programs branch every 4-6 instructions, incorrect branch prediction results in severe performance degradation in high superscalar or deep pipeline designs.

상기와 같은 분기 페널티를 줄이기 위해 많은 노력들이 이루어지고 있으며, 최근에는 트레이스 캐쉬(trace cache)를 사용하는 트레이스 프로세서가 사용되고 있다. 이와 같은 트레이스 프로세서는 James E. Smith와 Sriram Vajapeyam에 의해 발표된 상기 논문 "Trace Processors : Moving to Fourth-Generation Microarchitectures"에 개시되어 있다.Many efforts have been made to reduce such branch penalties, and recently, a trace processor using a trace cache has been used. Such trace processors are disclosed in the article "Trace Processors: Moving to Fourth-Generation Microarchitectures" published by James E. Smith and Sriram Vajapeyam.

도 2a는 일반적인 명령어 캐쉬(instruction cache ; 21) 내에 저장된 기본 블록들의 동적 시퀀스(dynamic sequence)를 보여주기 위한 도면이고, 도 2b는 일반적인 트레이스 캐쉬(22)를 보여주기 위한 도면이다. 도 2a를 참조하면, 도면에 표시된 화살표는 'taken' 분기(분기 목적지 어드레스(branch target address)로 점프하는 경우)들을 나타낸다. 명령어 캐쉬(21)에서는 명령어들이 불연속적인 캐쉬 장소들에 저장되기 때문에, 매 사이클마다 발생되는 다중 분기 예측들조차도 기본 블록들 'ABCDE' 내에서의 명령어들을 페치하기 위해서는 4 사이클이 요구된다.FIG. 2A illustrates a dynamic sequence of basic blocks stored in a general instruction cache 21, and FIG. 2B illustrates a general trace cache 22. Referring to FIG. 2A, the arrows shown in the figure indicate 'taken' branches (when jumping to a branch target address). In the instruction cache 21, because instructions are stored in discrete cache locations, even multiple branch predictions that occur every cycle require four cycles to fetch instructions within the basic blocks 'ABCDE'.

이러한 이유 때문에, 몇몇 연구자들은 길이가 긴 동적 명령어 시퀀스들을 캡쳐하기 위한 특정 명령어 캐쉬를 제안해 오고 있다. 이와 같은 캐쉬는, 도 2b에 도시된 바와 같이 각각의 라인이 동적 명령어 스트림(dynamic instruction stream)의 단편(snapshot) 또는 트레이스(trace)를 저장한다. 따라서, 이와 같은 구조를 가지는 캐쉬를 트레이스 캐쉬(22)라 한다. 상기 트레이스 캐쉬(22)는, 1994년 6월 J. Johnson에 의해 Technical Report CSL-TR-94-630, Computer Science Laboratory, Stanford Univ.에 실린 논문 "Expansion Caches for Superscalar Processors"; 1995년 1월 A. Peleg와 U. Weiser에 의해 취득된 U. S. Pat. No. 5,381,533, "Dynamic flow instruction cache memory organized around trace segments independant of virtual address line"; 그리고 1996년 12월 E. Rotenberg, S. Bennett 및 J. Smith에 의해 Proc. 29th Int'l Symp. Microarchitecture의 24-34쪽에 실린 논문 "Trace cache : A Low Latency Approach to High Bandwidth Instruction Fetching"에 각각 개시되어 있다.For this reason, some researchers have proposed specific instruction caches for capturing long, dynamic instruction sequences. Such a cache has each line storing a snapshot or trace of a dynamic instruction stream, as shown in FIG. 2B. Therefore, the cache having such a structure is referred to as the trace cache 22. The trace cache 22 is described in the article "Expansion Caches for Superscalar Processors" in J. Johnson, June 1994, in Technical Report CSL-TR-94-630, Computer Science Laboratory, Stanford Univ .; U. S. Pat. Acquired by A. Peleg and U. Weiser in January 1995. No. 5,381,533, "Dynamic flow instruction cache memory organized around trace segments independant of virtual address line"; And in December 1996 by E. Rotenberg, S. Bennett and J. Smith. 29th Int'l Symp. Each of the articles, "Trace cache: A Low Latency Approach to High Bandwidth Instruction Fetching," on pages 24-34 of the microarchitecture.

이 외에도 상기 트레이스 캐쉬(22)는, 1998년 6월 Sanjay Jeram Patel, Marius Evers 및 Yale N. Patt에 의해서 Proc. 25th Inter'l Symp. Computer Architecture의 262-271쪽에 발표된 논문 "Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing"; 1999년 2월 Sanjay Jeram Patel, Daniel Holmes Friendly 및 Yale N. Patt에 의해서 IEEE TRANSACTIONS ON COMPUTER, VOL. 48, NO.2의 193-204쪽에 발표된 논문 "Evaluation of Design Options for the Trace Cache Fetch Mechanism"; 그리고 1999년 2월 Eric Rotenberg, Steve Bennett 및 James E. Smith에 의해서 IEEE TRANSACTIONS ON COMPUTER, VOL. 48, NO.2의 111-120쪽에 발표된 논문 "A Trace Cache Microarchitecture and Evaluation"에 개시되어 있다.In addition, the trace cache 22 was prepared in June 1998 by Sanjay Jeram Patel, Marius Evers and Yale N. Patt. 25th Inter'l Symp. "Improving Trace Cache Effectiveness with Branch Promotion and Trace Packing", published on pages 262-271 of Computer Architecture; February 1999 by Sanjay Jeram Patel, Daniel Holmes Friendly and Yale N. Patt, IEEE TRANSACTIONS ON COMPUTER, VOL. 48, "Evaluation of Design Options for the Trace Cache Fetch Mechanism," published on pages 193-204 of NO.2; And in IEEE 1999, by Eric Rotenberg, Steve Bennett, and James E. Smith, IEEE TRANSACTIONS ON COMPUTER, VOL. 48, No. 2, pages 111-120, "A Trace Cache Microarchitecture and Evaluation."

앞에서 설명한 바와 같이, 도 2a에 도시된 명령어 캐쉬(21) 내에 불연속적으로 나타난 블록들과 동일한 동적 시퀀스는, 도 2b에 도시된 트레이스 캐쉬(22) 내에 연속적으로 나타내진다. 따라서, 기존의 분기 예측 방법에서와 같이 프로그램된 루틴에 따라 명령어가 있는 어드레스로 반복적인 분기를 수행하지 않고도 트레이스 캐쉬(22)에 저장된 명령어들을 순차적으로 실행할 수 있다. 따라서, 기존의 분기 예측 방법에서 초래되는 분기 패널티를 방지할 수 있을 뿐만 아니라, 명령어 캐쉬(21)의 불연속적인 위치에 저장되는 명령어들을 트레이스 캐쉬(22) 내에 연속적으로 저장함으로써 보다 진보된 병렬 처리를 수행할 수 있다.As described above, the same dynamic sequence as the blocks appearing discontinuously in the instruction cache 21 shown in FIG. 2A is represented continuously in the trace cache 22 shown in FIG. 2B. Thus, as in the conventional branch prediction method, the instructions stored in the trace cache 22 may be sequentially executed without performing a repetitive branching to an instructioned address according to a programmed routine. Thus, not only the branch penalty incurred by the conventional branch prediction method can be prevented, but also the advanced parallel processing can be performed by continuously storing the instructions stored in the discontinuous positions of the instruction cache 21 in the trace cache 22. Can be done.

그러나 상기와 같은 트레이스 캐쉬(22)는, 명령어 자체를 저장하기 때문에 이 명령에 대응되는 어드레스로의 디코딩 과정이 필요하다. 그리고, 상기와 같은 트레이스 캐쉬(22)는, 반복적으로 실행되는 명령어들조차도 실행되는 순서에 따라 되풀이해서 저장하기 때문에 캐쉬 사이즈가 너무 커지는 단점이 있다. 따라서, 상기와 같은 방식으로 명령어들을 저장하는 트레이스 캐쉬(22)가 모든 명령어를 저장할 만큼 충분한 크기를 갖기 위해서는, 칩 사이즈 및 생산 단가가 증가하는 문제가 발생된다. 따라서, 트레이스 캐쉬(22)의 어드레스 디코딩 시간을 줄일 수 있고, 적정량의 캐쉬 용량을 사용하여 칩 사이즈 및 생산 단가를 줄일 수 있는 분기 예측 방법이 요구된다.However, since the trace cache 22 stores the instruction itself, a decoding process to an address corresponding to the instruction is necessary. In addition, the trace cache 22 as described above has a disadvantage in that the cache size becomes too large because even repeatedly executed instructions are repeatedly stored in the order of execution. Therefore, in order for the trace cache 22 storing the instructions to be large enough to store all the instructions in the above manner, a problem arises in that the chip size and the production cost increase. Accordingly, there is a need for a branch prediction method that can reduce the address decoding time of the trace cache 22 and can reduce the chip size and production cost by using an appropriate amount of cache capacity.

따라서, 본 발명의 목적은 상술한 제반 문제점을 해결하기 위해 제안된 것으로, 트레이스 캐쉬를 사용하는 분기 예측 방법에 있어서, 어드레스 디코딩 시간을 줄일 수 있고, 적은 용량의 트레이스 캐쉬로 정확한 분기 예측을 수행함으로써 칩 사이즈 및 생산 단가를 줄일 수 있는 분기 예측 방법을 제공하는데 있다.Accordingly, an object of the present invention has been proposed to solve the above-mentioned problems. In the branch prediction method using the trace cache, the address decoding time can be reduced, and the accurate branch prediction is performed with a small amount of the trace cache. It is to provide a branch prediction method that can reduce chip size and production cost.

도 1은 4 세대의 마이크로 아키텍처를 보여주기 위한 도면;1 is a diagram to illustrate a fourth generation microarchitecture;

도 2a는 일반적인 명령어 캐쉬 내에 저장된 기본 블록들의 동적 시퀀스를 보여주기 위한 도면;2A is a diagram for showing a dynamic sequence of basic blocks stored in a general instruction cache;

도 2b는 일반적인 트레이스 캐쉬를 보여주기 위한 도면;2B is a diagram for showing a typical trace cache;

도 3은 반복되는 명령어 패턴의 일례를 보여주기 위한 도면;3 is a diagram to show an example of a repeating command pattern;

도 4는 도 3에 도시된 명령어들이 종래 기술에 의한 트레이스 캐쉬에 저장된 패턴을 보여주기 위한 도면;4 is a view for showing a pattern in which the instructions shown in FIG. 3 are stored in a trace cache according to the prior art;

도 5는 본 발명에 의한 어드레스 트레이스 캐쉬의 구성을 보여주기 위한 도면; 그리고5 is a diagram showing the configuration of an address trace cache according to the present invention; And

도 6은 도 3에 도시된 명령어들이 본 발명에 의한 트레이스 캐쉬에 저장된 패턴을 보여주기 위한 도면.6 is a view for showing a pattern stored in the trace cache according to the present invention the instructions shown in FIG.

*도면의 주요 부분에 대한 부호의 설명** Description of the symbols for the main parts of the drawings *

22 : 트레이스 캐쉬 220 : 어드레스 트레이스 캐쉬22: trace cache 220: address trace cache

상술한 바와 같은 본 발명의 목적을 달성하기 위한 본 발명의 특징에 의하면, 트레이스 캐쉬를 사용하는 분기 예측 방법은, 반복되지 않는 명령어들로 이루어진 루틴이 실행되는 경우 실행되는 명령의 순서에 따라 각각의 명령어에 대응되는 어드레스를 상기 트레이스 캐쉬에 저장하는 단계, 그리고, 반복되는 명령어들로 이루어진 루틴이 실행되는 경우 상기 루틴이 시작되는 어드레스, 상기 루틴이 종료되는 어드레스, 그리고 상기 루틴의 현재 억세스 횟수 및 상기 루틴의 전체 억세스 횟수를 카운트하여 저장하는 단계를 포함한다.According to a feature of the present invention for achieving the object of the present invention as described above, the branch prediction method using a trace cache, according to the order of instructions to be executed when a routine consisting of instructions that are not repeated is executed. Storing an address corresponding to an instruction in the trace cache, and when the routine consisting of repeated instructions is executed, an address at which the routine starts, an address at which the routine ends, and a current number of accesses of the routine and the Counting and storing the total number of accesses of the routine.

상기 트레이스 캐쉬는, 반복되는 명령어들로 이루어진 상기 루틴이 실행되는 경우, 상기 루틴의 현재 억세스 횟수 및 상기 루틴의 전체 억세스 횟수를 카운트하기 위한 루프 카운터들을 포함한다.The trace cache includes loop counters for counting the current number of accesses of the routine and the total number of accesses of the routine when the routine of repeated instructions is executed.

상기 분기 예측 방법은, 상기 루프 카운터들에 의해 카운트된 계수들을 비교하여, 상기 두 계수가 서로 같을 때 상기 루틴 다음에 실행될 루틴의 시작 어드레스를 어드레싱하는 단계를 포함한다. 그리고, 잘못된 분기 예측이 발생된 경우 상기 루프 카운터를 재구성하는 단계를 포함하되, 상기 루프 카운터는 가장 최근에 업데이트 된 루프 카운트를 사용한다.The branch prediction method includes comparing coefficients counted by the loop counters and addressing a start address of a routine to be executed after the routine when the two coefficients are equal to each other. And reconfiguring the loop counter when false branch prediction occurs, wherein the loop counter uses the most recently updated loop count.

(실시예)(Example)

이하 본 발명에 따른 실시예를 첨부된 도면 도 3 내지 도 6을 참조하여 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to FIGS. 3 to 6.

본 발명의 신규한 트레이스 캐쉬는, 명령어에 대응되는 어드레스 트레이스 자체를 디코딩 된 형태로 저장하므로 각각의 명령어에 대한 어드레스 디코딩 시간을 줄일 수 있다. 그리고, 반복 수행되는 루틴에 대한 어드레스 트레이스의 저장시 적은 용량의 트레이스 캐쉬를 사용하므로 칩 사이즈 및 생산 단가를 줄일 수 있다.The novel trace cache of the present invention stores the address traces corresponding to the instructions in a decoded form, thereby reducing the address decoding time for each instruction. In addition, since a small amount of trace cache is used when storing address traces for routines that are repeatedly performed, chip size and production cost can be reduced.

도 3은 반복되는 명령어 패턴의 일례를 보여주기 위한 도면이다.3 is a diagram illustrating an example of a repeated command pattern.

먼저, 도 3에 도시된 루틴 ①을 참조하면, 연산 A 및 연산 B가 30회 반복해서 수행된다. 이와 같은 루틴 ①의 수행이 종료되면, 루틴 ②가 실행되는데, 루틴 ②에서는 연산 C, 연산 D 및 연산 E가 순차적으로 20회 반복 수행된다. 그리고, 이와 같은 루틴 ②의 수행이 종료되면, 연산 F 및 연산 G가 40회 반복 수행되는 루틴 ③이 수행된다.First, referring to the routine 1 shown in Fig. 3, operations A and B are repeatedly performed 30 times. When the execution of the routine ① is finished, the routine ② is executed. In the routine ②, the operations C, D and E are sequentially performed 20 times. When the execution of the routine ② is finished, the routine ③ in which the operations F and G are repeatedly performed 40 times is performed.

예를 들어, 도 3에 도시된 바와 같은 루틴들이 수행된다고 가정할 때, 종래 기술에 의한 트레이스 캐쉬(22)에 저장되는 명령어들은 도 4와 같다.For example, assuming that routines as shown in FIG. 3 are performed, the instructions stored in the trace cache 22 according to the prior art are as in FIG. 4.

도 4를 참조하면, 종래의 트레이스 캐쉬(22)는, 실행되는 명령어들이 반복적으로 실행되거나 또는 반복적으로 실행되지 않거나 상관하지 않고, 단지 실행되는 순서에 따라 명령어들을 저장한다. 따라서, 상기 트레이스 캐쉬(22)는, 루틴 ①을 위한 60개의 명령어(명령어 2개 × 30회 반복 = 60)를 저장하기 위한 데이터 저장 영역과, 루틴 ②를 위한 60개의 명령어(명령어 3개 × 20회 반복 = 60)를 저장하기 위한 데이터 저장 영역, 그리고 루틴 ③을 위한 80개의 명령어(명령어 2개 × 40회 반복 = 80)를 저장하기 위한 데이터 저장 영역이 각각 요구된다. 즉, 도 3에 도시된 루틴 ① 내지 루틴 ③을 저장하기 위해서는, 총 200개의 데이터 저장 영역이 요구된다. 이 경우, 각각의 명령어들을 저장하는데 32 bits가 소요된다면, 상기 루틴 ① 내지 루틴 ③을 저장하기 위해서는 총 6,400 bits(즉, 800 Bytes)의 데이터 저장 영역이 필요로 하게된다.Referring to FIG. 4, the conventional trace cache 22 stores instructions in the order in which they are executed, regardless of whether the instructions being executed are executed repeatedly or not repeatedly. Accordingly, the trace cache 22 has a data storage area for storing 60 instructions (2 instructions x 30 repetitions = 60) for the routine ①, and 60 instructions (3 instructions x 20) for the routine ②. A data storage area for storing repetition = 60) and a data storage area for storing 80 instructions (2 instructions x 40 repetitions = 80) for the routine 3 are required. That is, in order to store the routines 1 to 3 shown in Fig. 3, a total of 200 data storage areas are required. In this case, if 32 bits are required to store each instruction, a total data storage area of 6,400 bits (ie, 800 Bytes) is required to store the routines 1 to 3.

도 5는 본 발명에 의한 어드레스 트레이스 캐쉬(220)의 구성을 보여주기 위한 도면이다. 본 발명에 의한 어드레스 트레이스 캐쉬(220)는, 도 5에 도시된 바와 같이, 각각의 루틴이 시작되는 어드레스를 저장하기 위한 스타트 어드레스(start address)와, 각각의 루틴이 종료되는 어드레스를 나타내기 위한 엔드 어드레스(end address), 해당 루틴의 현재 억세스 횟수를 카운트하기 위한 커런트 억세스 루프 카운터(current access loop counter), 그리고 상기 루틴의 전체 억세스 횟수를 나타내기 위한 올드 억세스 루프 카운터(old access loop counter)로 구성된다.5 is a diagram illustrating the configuration of the address trace cache 220 according to the present invention. The address trace cache 220 according to the present invention, as shown in FIG. An end address, a current access loop counter to count the current number of accesses of the routine, and an old access loop counter to indicate the total number of accesses of the routine. It is composed.

예를 들어, 도 5에 도시된 루틴 ①에서 수행되는 명령어들의 수행 정보를 본 발명에 의한 어드레스 트레이스 캐쉬(220)로 나타내는 경우, 먼저 루틴 ①의 스타트 어드레스 및 엔드 어드레스가 상기 트레이스 캐쉬(220)에 저장된다. 이어서, 상기 커런트 억세스 루프 카운터에는 루틴 ①의 현재 억세스 횟수가 저장되고, 올드 억세스 루프 카운터에는 루틴 ①의 전체 억세스 횟수(예를 들면, 30회)가 저장된다. 루틴 ①의 억세스가 반복됨에 따라 상기 커런트 억세스 루프 카운터의 값이 증가하게 되는데, 만약 상기 커런트 억세스 루프 카운터의 값이 올드 억세스 루프 카운터의 값과 같게 되면 루틴 ①의 수행이 종료되고, 루틴 ②의 스타트 어드레스가 NFP(next fetch point)로서 저장된다.For example, when the execution information of the instructions performed in the routine ① shown in FIG. 5 is represented by the address trace cache 220 according to the present invention, the start address and the end address of the routine ① are first written to the trace cache 220. Stored. Subsequently, the current access loop counter stores the current access count of the routine ①, and the old access loop counter stores the total access count (for example, 30 times) of the routine ①. As the access of the routine ① is repeated, the value of the current access loop counter is increased. If the value of the current access loop counter is equal to the value of the old access loop counter, the execution of the routine ① is terminated, and the start of the routine ② is started. The address is stored as an next fetch point (NFP).

도 3 및 도 5에 도시된 바와 같이, 연이어 수행되는 루틴 ① 내지 루틴 ③은, 앞에서 설명한 바와 같은 방법에 의해 상기 어드레스 트레이스 캐쉬(220)에 각각 저장될 수 있다. 그러나, 상기 루틴 ① 내지 루틴 ③과 같이 반복되어 실행되지 않는 루틴의 경우에는, 그 루틴을 이루고 있는 명령어 각각에 대한 어드레스를 순차적으로 기입한다. 그리고, 잘못된 분기 예측이 발생된 경우에는 상기 루프 카운터를 재구성하는데, 여기에는 가장 최근에 업데이트 된 루프 카운트가 사용된다.As shown in FIGS. 3 and 5, the routines 1 to 3 may be stored in the address trace cache 220 by the method described above. However, in the case of routines that are not repeatedly executed, such as the routines 1 to 3, addresses for each of the instructions constituting the routine are sequentially written. If the wrong branch prediction occurs, the loop counter is reconfigured, and the most recently updated loop count is used.

도 6은 도 3에 도시된 명령어들이 본 발명에 의한 어드레스 트레이스 캐쉬(220)에 저장된 패턴을 보여주기 위한 도면이다.FIG. 6 is a diagram illustrating a pattern in which the instructions illustrated in FIG. 3 are stored in the address trace cache 220 according to the present invention.

도 6을 참조하면, 예를 들어 도 3에 도시된 루틴 ①을 본 발명에 의한 어드레스 트레이스 캐쉬(220)에 저장하는 경우, 루틴 ①의 스타트 어드레스로는 최초에 실행되는 명령어 A의 어드레스가 저장되고, 루틴 ①의 엔드 어드레스로는 최후에 실행되는 명령어 B의 어드레스가 각각 저장된다. 루틴 ①의 경우, 전체 반복되는 횟수는 30회이므로, 올드 억세스 루프 카운터는 30으로 저장되고, 루틴 ①이 반복해서 실행될 때마다 커런트 억세스 루프 카운터 값이 1 씩 증가하게 된다. 이와 동일한 방법으로 루틴 ② 및 루틴 ③에 대한 정보가 상기 어드레스 캐쉬(220)에 각각 저장된다.Referring to FIG. 6, for example, when the routine ① shown in FIG. 3 is stored in the address trace cache 220 according to the present invention, the address of the instruction A that is executed first is stored as the start address of the routine ①. , The address of the instruction B executed last is stored as the end address of the routine ①. In the case of the routine ①, since the total number of repetitions is 30, the old access loop counter is stored as 30. Each time the routine ① is repeatedly executed, the current access loop counter value is increased by one. In the same manner, information about the routine ② and the routine ③ is stored in the address cache 220, respectively.

도 6에 도시된 바와 같이, 본 발명에 의한 어드레스 캐쉬(220)는 각각의 루틴이 시작되는 어드레스와, 각각의 루틴이 종료되는 어드레스, 커런트 억세스 루프 카운터 및 올드 억세스 루프 카운터로 구성되기 때문에, 반복되는 루틴에 대한 정보를 저장하기 위해서는 단지 4개의 데이터 저장 영역이 요구된다. 따라서, 본 발명에 의한 어드레스 트레이스 캐쉬(220)로 도 3에 도시된 상기 루틴 ① 내지 루틴 ③을 저장하기 위해서는 총 12개의 데이터 저장 영역이 요구된다. 이 경우, 각각의 정보를 저장하는데 32 bits가 소요된다면, 상기 루틴 ① 내지 루틴 ③을 저장하기 위해서는 총 384 bits(즉, 48 Bytes)가 요구된다. 이는 도 4에 도시된 종래 기술에 의한 트레이스 캐쉬에 비해 소요되는 데이터 저장 영역이 약 16.7배 가량 줄어든 것이다.As shown in Fig. 6, the address cache 220 according to the present invention is composed of an address at which each routine starts, an address at which each routine ends, a current access loop counter, and an old access loop counter. Only four data storage areas are required to store information about the routines being created. Therefore, a total of twelve data storage areas are required to store the routines 1 to 3 shown in FIG. 3 with the address trace cache 220 according to the present invention. In this case, if 32 bits are required to store each piece of information, a total of 384 bits (ie, 48 Bytes) are required to store the routines 1 to 3. This reduces the data storage area required by about 16.7 times compared to the conventional trace cache shown in FIG.

이와 같은 효과는, 반복되는 명령이 많이 포함된 루틴일수록, 그리고 명령어들의 반복되는 횟수가 많을수록 더욱 커진다. 이와 같이, 본 발명에 의한 어드레스 트레이스 캐쉬(220)는 종래의 트레이스 캐쉬에 비해 현저히 줄어든 데이터 저장 영역을 사용하기 때문에, 칩 사이즈 및 생산 단가를 줄일 수 있다. 뿐만 아니라, 본 발명에 의한 어드레스 트레이스 캐쉬(220)는 각각의 명령어에 대한 어드레스 트레이스 자체를 디코딩된 형태로 저장하므로, 명령어의 어드레스 디코딩 시간을 줄일 수 있다.This effect is greater for routines that contain more repeating instructions, and for more repeating instructions. As such, since the address trace cache 220 according to the present invention uses a data storage area that is significantly reduced compared to the conventional trace cache, the chip size and the production cost can be reduced. In addition, the address trace cache 220 according to the present invention stores the address trace for each instruction in a decoded form, thereby reducing the address decoding time of the instruction.

이상에서, 본 발명에 따른 회로의 구성 및 동작을 상기한 설명 및 도면에 따라 도시하였지만 이는 예를 들어 설명한 것에 불과하며 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 다양한 변화 및 변경이 가능함은 물론이다.In the above, the configuration and operation of the circuit according to the present invention are shown in accordance with the above description and drawings, but this is merely described, for example, and various changes and modifications are possible without departing from the spirit of the present invention. .

이상과 같은 본 발명에 의하면, 명령어에 대응되는 어드레스 트레이스 자체를 디코딩 된 형태로 저장하므로 각각의 명령어에 대한 어드레스 디코딩 시간을 줄일 수 있고, 적은 용량의 트레이스 캐쉬를 사용하므로 칩 사이즈 및 생산 단가를 줄일 수 있다.According to the present invention as described above, since the address trace itself corresponding to the instruction is stored in a decoded form, the address decoding time for each instruction can be reduced, and the chip cache and the production cost can be reduced because a small amount of trace cache is used. Can be.

Claims

In a branch prediction method that uses a trace cache:

When a routine consisting of instructions that are not repeated is executed, storing an address corresponding to each instruction in the trace cache according to the order of instructions to be executed; And,

Counting and storing the address at which the routine starts, the address at which the routine ends, and the current number of accesses of the routine and the total number of accesses of the routine when the routine of repeated instructions is executed. A branch prediction method using an address trace.

The method of claim 1,

The trace cache,

When the routine consisting of repeated instructions is executed,

And loop counters for counting the current number of accesses of the routine and the total number of accesses of the routine.

The method of claim 2,

The branch prediction method,

Comparing coefficients counted by the loop counters to address the starting address of the routine to be executed after the routine when the two coefficients are equal to each other.

The method of claim 2,

The branch prediction method,

Reconfiguring the loop counter when false branch prediction occurs,

And wherein the loop counter uses the most recently updated loop count.