KR20030007425A

KR20030007425A - Processor having replay architecture with fast and slow replay paths

Info

Publication number: KR20030007425A
Application number: KR1020027010573A
Authority: KR
Inventors: 업톤마이클디.; 세이거데이비드에이.; 보그스대럴디.; 힌톤글렌제이.
Original assignee: 인텔 코오퍼레이션
Priority date: 2000-02-14
Filing date: 2000-12-29
Publication date: 2003-01-23
Also published as: GB2376328A; GB2376328B; GB0221325D0; CN1452736A; DE10085438B4; CN1208716C; HK1048872A1; KR100508320B1; HK1048872B; WO2001061480A1; DE10085438T1; AU2001224640A1

Abstract

본 발명의 일 양상에 따라서, 실행 코어, 제1 리플레이 메카니즘 및 제2 리플레이 메카니즘을 포함하는 마이크로프로세서가 제공된다. 실행 코어는 제1 명령어 실행시에 데이터 추론을 수행한다. 제1 리플레이 메카니즘은 데이터 추론에 오류가 있음을 나타내는 제1 종류의 에러가 검출될 때에 제1 리플레이 경로를 통해 제1 명령어를 리플레이하는데 이용된다. 제2 리플레이 메카니즘은 데이터 추론에 오류가 있음을 나타내는 제2 종류의 에러가 검출될 때에 제2 리플레이 경로를 통해 제1 명령어를 리플레이하는데 이용된다.According to one aspect of the present invention, a microprocessor is provided that includes an execution core, a first replay mechanism, and a second replay mechanism. The execution core performs data inference upon executing the first instruction. The first replay mechanism is used to replay the first instruction through the first replay path when a first type of error is detected that indicates an error in data inference. The second replay mechanism is used to replay the first instruction over the second replay path when a second type of error is detected that indicates an error in data inference.

Description

PROCESSOR HAVING REPLAY ARCHITECTURE WITH FAST AND SLOW REPLAY PATHS

도 1은 미국 특허 제5,966,544호에 개시된 프로세서(100)의 일 실시예의 블록도를 도시한 것이다. 도 1에 도시된 프로세서(100)는 제1 클록 주파수(I/O 클록)에서 동작하는 I/O 링(111), 제2 클록 주파수(예컨대, 저속 클록)에서 동작하는 레이턴시 톨러런트(latency-tolerant) 실행 코어(121), 제3 클록 주파수(예컨대, 중속 클록)에서 동작하는 레이턴시 인톨러런트(latency-intolerant) 실행 서브 코어(131), 및 제4 클록 주파수(예컨대, 고속 클록)에서 동작하는 레이턴시 크리티컬(latency-critical) 실행 서브 코어(141)를 포함한다. 도 1에 도시된 프로세서(100)는 또한 종래의 출원에서 교시된 바와 같이 프로세서(100)의 여러 부분, 즉 서브 코어들에 적당한 클록킹을 제공하도록 구성된 클록 승산 및/또는 제산 유니트(110, 120, 130)를 포함한다. 본 출원과 관련이 가장 깊은 종래 출원의 교시의 특정 부분은 실행 코어가 서로 다른 클록값에서 동작하는 2 또는 그 이상의 부분(서브 코어)을 포함할 수 있다는 것이다.1 shows a block diagram of one embodiment of a processor 100 disclosed in US Pat. No. 5,966,544. The processor 100 illustrated in FIG. 1 includes an I / O ring 111 that operates at a first clock frequency (I / O clock) and a latency tolerance that operates at a second clock frequency (eg, a slow clock). tolerant execution core 121, a latency-intolerant execution sub-core 131 operating at a third clock frequency (e.g., a medium speed clock), and a fourth clock frequency (e.g., a high speed clock). Latency-critical execution sub-core 141. The processor 100 shown in FIG. 1 is also a clock multiplication and / or division unit 110, 120 configured to provide adequate clocking to various portions of the processor 100, ie, sub-cores, as taught in the prior application. , 130). A particular part of the teaching of the prior application, which is most relevant to the present application, is that the execution core may include two or more parts (sub cores) operating at different clock values.

동작에 있어서, I/O 링(111)은 I/O 클록 주파수에서 메모리 읽기 및 쓰기와 같은 여러 가지 I/O 동작을 수행함으로써 컴퓨터 시스템의 나머지 부분(미도시)과 통신한다. 예컨대, 프로세서(100)는 I/O 클록 주파수에서 I/O 링(111)에서 I/O 동작을 수행하여 외부 메모리 장치로부터 데이터를 읽어들일 수 있다. 여러 가지 실행 서브 코어(121, 131, 141)는 그들 각자의 클록 주파수에서 입력 명령어들 및/또는 입력 데이터들에 관하여 여러 가지 기능이나 동작을 수행할 수 있다. 예컨대, 레이턴시 톨러런트 실행 서브 코어(121)는 입력 데이터에 대해 실행 동작을 수행하여 제1 결과를 생성할 수 있다. 레이턴시 인톨러런트 서브 코어(131)는 그 제1 결과에 대해 실행 동작을 수행하여 제2 결과를 생성할 수 있다. 마찬가지로, 레이턴시 크리티컬 실행 서브 코어(141)는 그 제2 결과에 대해서 다른 실행 동작을 수행하여 제3 결과를 생성할 수 있다. 여러 가지 실행 서브 코어들에 의해서 수행된 여러 가지 동작들은 연산 동작, 논리 동작, 기타 다른 동작들을 포함할 수 있다. 당업자라면 그 여러 가지 동작들이 수행되는 실행 순서는 여러 가지 실행 서브 코어들의 계층적 순서에 반드시 따를 필요는 없음을 잘 알고 이해하고 있다. 예컨대, 입력 데이터는 가장 안쪽의 서브 코어로 즉시에 곧 바로 가서 거기에서 얻어진 결과는 그 가장 안쪽의 서브 코어로부터 다른 서브 코어로 가거나 또는 다시 쓰기(write-back)를 위해 I/O 링(111)으로 되돌아간다. 게다가, 종래 출원에 개시되고 교시된 바와 같이, 온 칩(on-chip) 캐시 구조는 프로세서(100)의 2 또는 그 이상의 부분과 엇갈리게 분리될 수 있다. 그것으로서, 특정 동작 및/또는 기능들은 온 칩 캐시에 저장된 데이터의 일 양상에 대해 하나의 클록 주파수에서 수행될 수 있고, 다른 동작 및/또는 기능들은 온 칩 캐시에 저장된 데이터의 다른 양상에 대해 다른 클록 주파수에서 수행될 수 있다. 예컨대, 하나의 서브 코어에서는 하나의 클록 주파수에서 온 칩 캐시에 대한 경로 예측기 미스(way predictor miss)가 수행될 수 있고, 다른 서브 코어에서는 다른 주파수에서 TLB 히트/미스 검출 및/또는 페이지 폴트 검출이 수행될 수 있다. 그것으로서, 실행 프로세스에서 특정의 에러와 상태는 다른 에러와 상태보다 조기에 검출될 수 있다.In operation, I / O ring 111 communicates with the rest of the computer system (not shown) by performing various I / O operations such as memory reads and writes at the I / O clock frequency. For example, the processor 100 may read data from an external memory device by performing an I / O operation on the I / O ring 111 at an I / O clock frequency. The various execution subcores 121, 131, 141 may perform various functions or operations on input instructions and / or input data at their respective clock frequencies. For example, the latency tolerant execution subcore 121 may perform an execution operation on the input data to generate a first result. The latency intact sub core 131 may perform an execution operation on the first result to generate a second result. Similarly, the latency critical execution subcore 141 may perform another execution operation on the second result to generate a third result. Various operations performed by various execution subcores may include computational operations, logical operations, and other operations. Those skilled in the art know and understand that the order in which the various operations are performed is not necessarily in accordance with the hierarchical order of the various execution subcores. For example, the input data immediately goes directly to the innermost subcore and the result obtained therefrom can go from the innermost subcore to another subcore or to the I / O ring 111 for write-back. Return to In addition, as disclosed and taught in the prior application, the on-chip cache structure may be staggered from two or more portions of the processor 100. As such, certain operations and / or functions may be performed at one clock frequency for one aspect of the data stored in the on-chip cache, and other operations and / or functions may be different for other aspects of the data stored in the on-chip cache. It may be performed at a clock frequency. For example, a way predictor miss for an on-chip cache may be performed at one clock frequency in one sub-core, and TLB hit / miss detection and / or page fault detection may be performed at another frequency in another sub-core. Can be performed. As such, certain errors and conditions in the executing process may be detected earlier than other errors and conditions.

도 2는 데이터 추론 작업을 용이하게 하는 일반화된 리플레이 구조를 포함하는 종래 출원에 개시된 프로세서(200)의 일 실시예의 블록도를 도시한 것이다. 이 실시예에서, 프로세서(200)는 명령어 캐시(I-캐시)(231)로부터 수신된 명령어들을 실행을 위해 실행 코어(251)로 제공하는 멀티플렉스(241)에 결합된 스케쥴러(231)를 포함한다. 실행 코어(251)는 멀티플렉서(241)로부터 수신된 각종 명령어들을 실행하는데 있어 데이터 추론을 수행할 수 있다. 도 2에 도시된 프로세서(200)는, 데이터 추론에 오류가 있다고 판단되는 경우에, 실행된 명령어의 카피를 재실행을위해 실행 코어(251)에 다시 보내는 체커 유니트(281)를 포함한다. 그러나 이 일반화된 리플레이 구조에서는 체커 유니트(281)는 실행 코어(251), TLB 및 태그 로직(261), 및 캐시 히트/미스 로직(271) 다음에 위치한다. 어떤 명령어들은 이 체커 유니트가 검출을 하기도 전에 (데이터 추론에 오류가 있기 때문에) 잘못 실행되었던 것으로 드러났을 수도 있다. 특히, TLB/TAG 로직(261)과 히트/미스 로직(271)이 실행되기도 전에 데이터 추론에 오류가 있음을 나타내는 특정 에러와 상태가 검출될 수 있는 경우들이 있다. 불행하게도, 체커 유니트(281)의 현재 위치 때문에, 데이터 추론의 오류로 인해 잘못 실행되었던 명령어들은 체커 유니트(281)에 도달할 때까지는 재실행, 즉 리플레이를 위해 실행 코어(251)로 다시 보내지지 않을 것이다. 따라서, 데이터 추론 오류로 인해 명령어가 잘못 실행되었던 것으로 알려진 때부터 그 명령어가 재실행을 위해 실제로 다시 보내진 때까지의 불필요한 지연이 있게 된다. 따라서, 시스템 성능은 잘못 실행되었던 명령어들이 처리 중에 조기에 재실행, 즉 리플레이되었더라면 최적화될 수 있었던 만큼 최적화되지 못한다.2 shows a block diagram of one embodiment of a processor 200 disclosed in a prior application that includes a generalized replay structure that facilitates data inference operations. In this embodiment, processor 200 includes a scheduler 231 coupled to multiplex 241 that provides instructions received from instruction cache (I-cache) 231 to execution core 251 for execution. do. Execution core 251 may perform data inference in executing various instructions received from multiplexer 241. The processor 200 shown in FIG. 2 includes a checker unit 281 that sends a copy of the executed instruction back to the execution core 251 for redo when it is determined that there is an error in data inference. However, in this generalized replay structure, the checker unit 281 is located after the execution core 251, the TLB and tag logic 261, and the cache hit / miss logic 271. Some instructions may have been found to have been executed incorrectly (because there was an error in data inference) even before this checker unit detected. In particular, there are cases where certain errors and conditions may be detected that indicate that there is an error in data inference before the TLB / TAG logic 261 and the hit / miss logic 271 are executed. Unfortunately, because of the current position of the checker unit 281, instructions that were incorrectly executed due to errors in data inference will not be sent back to the execution core 251 for redo, ie, replay, until the checker unit 281 is reached. will be. Thus, there is an unnecessary delay from when a command is known to have been executed incorrectly due to a data inference error, until the command is actually resent for reexecution. Thus, system performance is not optimized as much as it could have been if the instructions that had been executed incorrectly had been rerun, i.e., replayed early in processing.

본 발명은 일반적으로 프로세서 분야에 관한 것으로, 특히 데이터 추론(data-speculating) 작업을 용이하게 하는 고속 및 저속 리플레이 경로를 갖는 리플레이 구조(replay architecture)에 관한 것이다.TECHNICAL FIELD The present invention generally relates to the field of processors, and more particularly to a replay architecture having fast and slow replay paths that facilitate data-speculating operations.

관련 출원의 상호 인용Cross Citation of Related Application

본 출원은 1996년 11월 23일자로 출원되어 현재는 미국특허 제5,966,544호로 특허된 미국특허 출원 제08/746,547호의 일부 계속 출원으로서 1998년 12월 30일자로 출원된 제09/222,805호의 일부 계속 출원이다. 본 출원과 상기 출원들은 모두 미국 캘리포니아주 산타클라라시 소재의 인텔사에 양도된 것이다.This application is part of US 09 / 222,805, filed December 30, 1998, filed December 23, 1996, and is part of US patent application Ser. No. 08 / 746,547, filed November 23, 1996, now patented US 5,966,544. to be. This application and the above applications are both assigned to Intel Corporation of Santa Clara City, California.

본 발명의 특징들과 이점들은 첨부 도면을 참조하면 더욱 잘 이해될 것이다.The features and advantages of the present invention will be better understood with reference to the accompanying drawings.

도 1은 서로 다른 주파수에서 동작하는 여러 가지 서브 코어를 포함하는 프로세서의 일 실시예의 블록도.1 is a block diagram of one embodiment of a processor including various subcores operating at different frequencies.

도 2는 일반화된 리플레이 구조를 갖는 프로세서의 일 실시예의 블록도.2 is a block diagram of one embodiment of a processor having a generalized replay structure.

도 3은 본 발명의 교시가 구현되는 프로세서 파이프라인의 일 실시예의 흐름도.3 is a flow diagram of one embodiment of a processor pipeline in which the teachings of the present invention are implemented.

도 4는 제1 및 제2 리플레이 메카니즘을 갖는 프로세서의 일 실시예의 블록도.4 is a block diagram of one embodiment of a processor having first and second replay mechanisms.

도 5는 제1 및 제2 리플레이 메카니즘을 갖는 프로세서의 일 실시예의 상세 블록도.5 is a detailed block diagram of one embodiment of a processor having first and second replay mechanisms.

도 6은 본 발명의 교시에 따른 방법의 일 실시예의 흐름도.6 is a flow diagram of one embodiment of a method in accordance with the teachings of the present invention.

본 발명의 일 양상에 따라서, 실행 코어, 제1 리플레이 메카니즘 및 제2 리플레이 메카니즘을 포함하는 마이크로프로세서가 제공된다. 실행 코어는 제1 명령어 실행 시에 데이터 추론을 수행한다. 제1 리플레이 메카니즘은 데이터 추론에 오류가 있음을 나타내는 제1 종류의 에러가 검출될 때에 제1 리플레이 경로를 통해 제1 명령어를 리플레이하는데 이용된다. 제2 리플레이 메카니즘은 데이터 추론에오류가 있음을 나타내는 제2 종류의 에러가 검출될 때에 제2 리플레이 경로를 통해 제1 명령어를 리플레이하는데 이용된다.According to one aspect of the present invention, a microprocessor is provided that includes an execution core, a first replay mechanism, and a second replay mechanism. The execution core performs data inference upon executing the first instruction. The first replay mechanism is used to replay the first instruction through the first replay path when a first type of error is detected that indicates an error in data inference. The second replay mechanism is used to replay the first instruction over the second replay path when a second type of error is detected that indicates an error in data inference.

실시예에 대한 하기의 상세한 설명에서는, 본 발명의 철저한 이해를 위해 많은 특정의 세부사항들이 기재된다. 그러나 당업자라면 본 발명은 이들 특정의 세부사항없이도 실시될 수 있음을 잘 알 것이다.In the following detailed description of the embodiments, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

다음의 설명에서 본 발명의 교시는 입력된 명령들을 실행할 때에 데이터 추론을 용이하게 하는 방법, 장치, 및 시스템을 구현하는데 이용된다. 실행 시간을단축시키기 위해 실행 유니트는 입력된 명령을 실행할 때에 데이터 추론을 수행한다. 데이터 추론에 오류가 있는 경우, 그 입력 명령어는 올바른 결과가 얻어질 때까지 실행 유니트에 의해 재실행된다. 일 실시예에서, 제1 종류의 에러나 제2 종류의 에러가 검출되면 데이터 추론에 오류가 있다고 판단된다. 제1 종류의 에러는 제2 종류의 에러보다 일찍 검출될 수 있다. 일 실시예에서 제1 체커는 입력된 명령어의 실행에 대해서 제1 종류의 에러가 검출되면 재실행, 즉 리플레이를 위해서 그 입력된 명령어의 제1 카피를 실행 유니트에 다시 보내는 일을 담당한다. 제2 체커는 입력된 명령어의 실행에 대해서 제2 종류의 에러가 검출되면 재실행, 즉 리플레이를 위해 그 입력된 명령어의 제2 카피를 실행 유니트에 다시 보내는 일을 담당한다. 일 실시예에서 선택기는 소정의 우선 순위 일람표에 따라서 다음 번의 입력 명령어, 즉 잘못 실행된 명령어의 제1 카피나 잘못 실행된 명령어의 제2 카피를 실행을 위해 실행 유니트로 제공하는데 사용된다. 본 발명의 교시는 명령어 실행 시에 데이터 추론을 수행하는 프로세서나 기계에 적용될 수 있다. 그러나 본 발명은 데이터 추론을 수행하는 프로세서나 기계에만 한정되는 것은 아니며, 많은 레벨의 리플레이 메카니즘을 필요로 하는 프로세서와 기계에도 적용될 수 있다.In the following description, the teachings of the present invention are used to implement methods, apparatus, and systems that facilitate data inference when executing inputted instructions. In order to shorten the execution time, the execution unit performs data inference when executing the input command. If there is an error in data inference, the input instruction is redone by the execution unit until the correct result is obtained. In one embodiment, if an error of the first or second type is detected, it is determined that there is an error in data inference. An error of the first kind may be detected earlier than an error of the second kind. In one embodiment, the first checker is responsible for redoing, ie, sending the first copy of the entered instruction back to the execution unit for replay when a first type of error is detected for the execution of the entered instruction. The second checker is responsible for redoing, ie, sending a second copy of the entered instruction back to the execution unit for replay if a second type of error is detected for the execution of the entered instruction. In one embodiment, the selector is used to provide the execution unit for execution with the next input instruction, either a first copy of an incorrectly executed instruction or a second copy of an incorrectly executed instruction, according to a predetermined priority list. The teachings of the present invention can be applied to processors or machines that perform data inference upon instruction execution. However, the present invention is not limited to processors or machines that perform data inference, but can also be applied to processors and machines that require many levels of replay mechanism.

도 3은 본 발명이 구현될 수 있는 프로세서 파이프라인(300)의 일 실시예의 블록도이다. 본 명세서에서, 용어 "프로세서"는 명령어 계열을 실행할 수 있는 장치를 말하며, 범용 마이크로프로세서, 특수 목적 마이크로프로세서, 그래픽 컨트롤러, 오디오 프로세서, 비디오 프로세서, 멀티미디어 컨트롤러 및 마이크로컨트롤러를 포함하나, 이에 한정된 것은 아니다. 프로세서 파이프라인(300)은페치단(fetch stage)(310)에서부터 시작하는 여러 처리단을 포함한다. 페치단에서는 명령어들이 검색되어 파이프라인(300)으로 공급된다. 예컨대, 프로세서에 내장되어 있거나 프로세서와 밀접하게 연관되어 있는 캐시 메모리로부터, 또는 시스템 버스를 통해 외부 메모리로부터 마이크로명령어가 검색될 수 있다. 그 다음, 페치단(310)에서 검색된 명령어들은 디코드단(320)으로 입력되고, 이 디코드단에서 명령어 또는 마이크로명령어들은 프로세서에 의한 실행을 위한 마이크로명령어 또는 마이크로오퍼레이션(여기서는 UOP 또는 uop라고도 함)으로 디코드된다. 할당단(330)에서는 마이크로명령어의 실행에 필요한 프로세서 자원이 할당된다. 파이프라인에서 다음 단은 개명(rename)단(340)인데, 이 단에서는 레지스터 재사용에 따른 오종속성(false dependency)을 제거하기 위하여 외부 레지스터 참조가 내부 레지스터 참조로 변환된다. 스케쥴/디스패치단(350)에서는 각 마이크로명령어나 UOP가 스케쥴되어 실행단(360)으로 디스패치된다. 그 다음, 마이크로명령어나 UOP는 실행단(60)에서 실행된다. 실행 후, 마이크로명령어는 회수(retire)단(370)에서 회수된다.3 is a block diagram of one embodiment of a processor pipeline 300 in which the present invention may be implemented. As used herein, the term "processor" refers to a device capable of executing a series of instructions, and includes, but is not limited to, a general purpose microprocessor, special purpose microprocessor, graphics controller, audio processor, video processor, multimedia controller, and microcontroller. . Processor pipeline 300 includes several processing stages starting from fetch stage 310. In the fetch stage, instructions are retrieved and supplied to the pipeline 300. For example, microinstructions may be retrieved from cache memory embedded in or closely associated with the processor, or from external memory via a system bus. The instructions retrieved from the fetch stage 310 are then input to the decode stage 320, where the instructions or microinstructions are referred to as microinstructions or microoperations (herein referred to as UOP or uop) for execution by the processor. Decoded. In the allocating stage 330, processor resources required for execution of the microinstruction are allocated. The next stage in the pipeline is the rename stage 340, where external register references are converted to internal register references to remove false dependencies due to register reuse. In the schedule / dispatch stage 350, each microcommand or UOP is scheduled and dispatched to the execution stage 360. The microinstruction or UOP is then executed at execution stage 60. After execution, the microinstruction is retrieved from the retrieval stage 370.

일 실시예에서, 위에서 설명된 여러 단들은 3 단계로 편성될 수 있다. 제1 단계는 페치단(310), 디코드단(320), 할당단(330) 및 개명단(340)을 포함하는 정렬 프론트 엔드(in-order front end)라고 할 수 있다. 정렬 프론트 엔드 단계 동안에는 명령어들은 그들의 원래의 프래그램 시퀀스에 따라서 파이프라인(300)을 통해 진행한다. 제2 단계는 스케쥴/디스패치단(350)과 실행단(360)을 포함하는 비정렬(out-of-order) 실행 단계라고 할 수 있다. 이 단계 동안에는 각 명령어는원래의 프로그램에서의 순차적 위치에 상관없이 그 데이터 종속성이 해소되고 적당한 실행 유니트가 이용될 수 있게 되자 마자 스케쥴되고, 디스패치되고, 실행된다. 제3 단계는 프로그램의 무결성과 의미체계를 보존하기 위하여 명령어들을 그들의 원래의 순차적 프로그램 순서에 따라 회수하는 회수단(370)을 포함하는 정렬 회수 단계를 말한다.In one embodiment, the various tiers described above may be organized in three stages. The first stage may be referred to as an in-order front end including a fetch stage 310, a decode stage 320, an allocation stage 330, and a renamed stage 340. During the alignment front end phase, instructions proceed through pipeline 300 according to their original program sequence. The second step may be referred to as an out-of-order execution step including the schedule / dispatch stage 350 and the execution stage 360. During this phase, each instruction is scheduled, dispatched, and executed as soon as its data dependency is resolved and the appropriate execution unit is available, regardless of its sequential position in the original program. The third step refers to an alignment recovery step comprising a recovery stage 370 that retrieves instructions according to their original sequential program order to preserve the integrity and semantics of the program.

리플레이 구조를 갖는 프로세서에서는 입력된 명령어들의 스케쥴링과 실행에 있어서 자유롭게 할 수 있는 일들이 있다. 예컨대, 입력된 UOP는 그 소스 데이터가 준비되어 있지 않거나 알려지지 않을 수도 있지만 실행 유니트로 디스패치되어 실행될 수가 있다. 만일 입력 UOP의 실행에 있어 데이터 추론에 오류가 있다고 판단되면, 각 UOP는 올바른 결과가 얻어질 때까지 실행 유니트로 다시 보내져 재실행(리플레이)될 수 있다. 물론, 리플레이된 UOP는 가용 자원을 사용하게 되어 총 시스템 성능을 저하시킬 수 있기 때문에 리플레이, 즉 재실행량을 제한하는 것이 바람직하다. 그럼에도 불구하고 상기와 같은 재실행(리플레이)을 수행하는 것이 총 성능에 있어 이득이 될 수 있다. 예컨대, 대부분의 UOP는 적은 싸이클 수에서 올바르게 실행되고 극히 적은 수의 UOP만이 리플레이되어야 하는 경우라면, 최악의 경우가 생길 수 있는 한 모든 UOP를 대기시켜야 하는 최소 공통 분모(lowest-common-denominator) 경우에 비해 총효율이 개선될 것이다.In a processor having a replay structure, there are things that can be freely done in the scheduling and execution of input instructions. For example, the input UOP may be dispatched to an execution unit and executed, although its source data may not be ready or unknown. If it is determined that there is an error in data inference in the execution of the input UOP, each UOP can be sent back to the execution unit and replayed (replayed) until the correct result is obtained. Of course, it is desirable to limit the amount of replay, i.e., replay, because the replayed UOP can use available resources and degrade the total system performance. Nevertheless, performing such replays (replays) can benefit from total performance. For example, if most UOPs run correctly in a small number of cycles and only a very small number of UOPs need to be replayed, the lowest common-denominator should wait all UOPs as far as the worst can happen. The total efficiency will be improved.

종래 출원에서 교시된 바와 같이, 온 칩 데이터 캐시(레벨 제로, 즉 L0 데이터 캐시라고도 함)는 그 데이터 저장 어레이가 이 어레이에 대해서 히트/미스 결정을 제공하는 로직보다 더 높은 클록 도메인에서 존재하게끔 분할될 수 있다. TLB및 태그 로직도 데이터 저장 어레이보다 더 느린 클록 도메인에서 존재할 수 있다. TLB 및 태그 로직은 반드시 요구되는 것은 아니지만 히트/미스 로직과 동일한 클록 도메인에서 존재할 수 있다.As taught in the prior application, the on-chip data cache (also called level zero, or L0 data cache) is partitioned such that its data storage array is in a higher clock domain than the logic that provides a hit / miss decision for that array. Can be. TLB and tag logic can also be present in a slower clock domain than data storage arrays. TLB and tag logic are not required but may be present in the same clock domain as the hit / miss logic.

총 성능 이득이 얻어질 수 있는 경우들 중 하나는 UOP의 실행이 L0 데이터 캐시로부터의 데이터에 의존하거나 그 데이터를 사용하는 경우이다. 모든 UOP를 그 소스 데이터가 유효한 것으로 판단될 때까지 대기시키기 보다는, 그들의 소스 데이터가 L0 데이터 캐시에 존재한다는 것이 추측되기는 하나 확실히 알려지지는 않더라도 프로세스 중에 조기에 일부 UOP를 추론적으로 디스패치하여 실행하는 것이 더 낫다. 대부분의 경우에 L0 데이터 캐시는 히트될 것이며, 유효 데이터는 소스로서 이용될 것이다. 아주 적은 경우에만 데이터 추론에 오류가 있으며 UOP는 리플레이될 것이다. 그것으로서, 대부분의 UOP는 적은 수의 싸이클에서 올바로 실행되며, 따라서 총 성능을 향상시킨다.One of the cases where the total performance gain can be obtained is when the execution of the UOP relies on or uses the data from the LO data cache. Rather than having all UOPs wait until their source data is considered valid, it is speculated that their source data exists in the L0 data cache, but it is not clear if it is reasonably dispatching some UOPs early in the process. Better In most cases the LO data cache will be hit and valid data will be used as the source. Only in a few cases will there be errors in data inference and the UOP will be replayed. As such, most UOPs run correctly in a small number of cycles, thus improving overall performance.

도 4는 명령어 실행시 데이터 추론을 용이하게 하기 위하여 제1(고속 또는 조기라고도 함) 및 제2(저속 또는 나중이라고도 함) 리플레이 경로를 가진 프로세서(400)의 일 실시예의 블록도이다. 도 4에 도시된 바와 같이, 프로세서(400)는 명령어 캐시(미도시)에 결합되어 이 명령어 캐시로부터 선택기(예컨대, 멀티플렉서)(421)를 통해 실행을 위한 실행 코어(431)로 수신된 제1 명령어를 스케쥴하고 디스패치하는 스케쥴러/디스패처(411)를 포함한다. 일 실시예에서, 실행 코어(431)는 입력된 명령어 실행시 데이터 추론을 수행한다. 상술한 바와 같이, 입력된 명령어는 그 소스 데이터가 준비되어 있지 않거나 알려지지 않더라도 디스패치되어 실행될 수 있다. 예컨대, 입력된 명령어의 실행은 L0 데이터 캐시에 존재할 수도 존재하지 않을 수도 있는 소스 데이터를 필요로 할 수 있다. 그러나, 상술한 바와 같이, 입력된 명령어의 실행을 위해 요구되는 소스 데이터가 L0 데이터 캐시에 존재한다고 추론함으로써 총성능이 얻어질 수 있다. 프로세서(400)는 데이터 추론에 오류가 있음을 나타내는 제1 종류의 에러가 검출되면 입력 명령어를 리플레이하는 제1 리플레이 메카니즘(441)을 더 포함한다. 일 실시예에서 제1 종류의 에러는 제1 기간 내에 검출될 수 있다. 프로세서(400)는 또한 데이터 추론에 오류가 있음을 나타내는 제2 종류의 에러가 검출되면 입력 명령어를 리플레이하는 제2 리플레이 메카니즘(451)을 포함한다. 일 실시예에서 제2 종류의 에러는 제1 기간보다 긴 제2 기간 내에 검출될 수 있다. 그것으로서, 제1 종류의 에러가 이미 검출되어 있다면, 본 발명은 잘못 실행된 명령어가 제2 종류의 에러가 검출될 때까지 대기해야만 했을 경우에 리플레이될 수 있었던 속도보다 훨씬 빠르게 그 명령어가 리플레이될 수 있게 한다. 도 4에 도시된 바와 같이, 데이터 추론에 오류가 있음을 나타내는 제1 종류의 에러가 검출되어서 입력 명령어의 실행이 잘못 수행된 것으로 판단되면, 제1 리플레이 메카니즘(고속 또는 조기 체커)(441)은 각 명령어를 재실행(리플레이)을 위해 멀티플렉서(421)를 통해 실행 코어(431)로 재전송할 것이다. 마찬가지로, 데이터 추론에 오류가 있음을 또는 다른 에러 상태가 존재함을 나타내는 제2 종류의 에러가 검출되어서 입력 명령어의 실행이 잘못된 것으로 판단되면, 제2 리플레이 메카니즘(저속 또는 나중 체커)(441)은 각 명령어를 재실행(리플레이)을 위해 멀티플렉서(421)를 통해 실행 코어(431)로 재전송할 것이다.도 4에 도시된 제1 및 제2 리플레이 메카니즘의 기능과 동작에 대해 이하에서 더욱 상세히 설명한다.4 is a block diagram of one embodiment of a processor 400 having a first (also known as fast or early) and a second (also known as slow or later) replay path to facilitate data inference upon instruction execution. As shown in FIG. 4, the processor 400 is coupled to an instruction cache (not shown) and received from the instruction cache via the selector (eg, multiplexer) 421 to the execution core 431 for execution. A scheduler / dispatcher 411 that schedules and dispatches instructions. In one embodiment, execution core 431 performs data inference upon execution of the input instruction. As described above, the input command can be dispatched and executed even if its source data is not ready or unknown. For example, the execution of the entered instruction may require source data that may or may not exist in the LO data cache. However, as described above, the total performance can be obtained by inferring that the source data required for the execution of the input instruction exists in the LO data cache. The processor 400 further includes a first replay mechanism 441 for replaying input instructions when a first type of error is detected that indicates an error in data inference. In one embodiment, the first kind of error may be detected within the first period. Processor 400 also includes a second replay mechanism 451 that replays input instructions when a second type of error is detected that indicates an error in data inference. In one embodiment, the second type of error may be detected within a second period longer than the first period. As such, if an error of the first kind has already been detected, the present invention will allow the instruction to be replayed much faster than it could have been replayed if it had to wait until an error of the second kind was detected. To be able. As shown in FIG. 4, if a first type of error indicating an error in data inference is detected and it is determined that execution of an input instruction is performed incorrectly, the first replay mechanism (fast or early checker) 441 is determined. Each instruction will be resent to the execution core 431 via the multiplexer 421 for replay (replay). Similarly, if a second type of error is detected indicating that there is an error in data inference or that another error condition exists and the execution of the input instruction is determined to be incorrect, then the second replay mechanism (slow or later checker) 441 is determined. Each instruction will be retransmitted to the execution core 431 via the multiplexer 421 for replay (replay). The functions and operations of the first and second replay mechanisms shown in FIG. 4 are described in more detail below.

도 5는 도 4와 관련하여 상술된 제1 및 제2 리플레이 경로를 가진 프로세서(500)의 일 실시예의 상세 블록도이다. 도 5에 도시된 바와 같이, 프로세서(500)는 명령어 캐시(미도시)로부터 멀티플렉서(521)를 통해 실행을 위한 실행 코어(531)로 수신된 명령어들을 스케쥴하고 디스패치하는 스케쥴러(511)를 포함한다. 멀티플렉서(521)의 기능과 동작에 대해 다음에 상세히 설명한다. 일 실시예에서 실행 코어(531)는 멀티플렉서(521)로부터 수신된 입력 명령어의 실행시에 데이터 추론을 수행한다. 프로세서(500)는 입력 명령어의 제1 카피를 만들고 이 제1 카피를 제1 클록 도메인에서 적어도 하나의 클록 사이클 동안에 보유하는 제1 지연 유니트(541)를 더 포함한다. 프로세서(500)는 또한 제1 지연 유니트(541)와 실행 코어(531)에 결합된 제1 체커(545)를 포함한다. 일 실시예에서 제1 체커(545)는 제1 에러 종류 집합에 대해서 데이터 추론에 오류가 있는지 여부를 판단하고, 오류가 있다고 판단되면 그 입력 명령어의 제1 카피를 재실행을 위해 제1 버퍼(547)를 통해 실행 코어로 재전송하도록 구성된다. 도 5에 도시된 바와 같이, 일 실시예에서 프로세서(500)는 제1 지연 유니트에 결합되어 입력 명령어의 제2 카피를 만들고 그 제2 카피를 제2 클록 도메인에서 적어도 하나의 클록 싸이클 동안 보유하는 제2 지연 유니트(551)를 더 포함한다. 프로세서(500)는 제2 지연 유니트(551)와 제1 체커(545)에 결합된 제2 체커(555)를 포함한다. 일 실시예에서 제2 체커는 제2 에러 종류 집합에 대해서 명령어 실행에 오류가 있는지 여부를 판단하고, 오류가 있다고 판단되면 그 입력 명령어의 제2 카피를 재실행을 위해 제2 버퍼(557)를 통해 실행 코어(531)로 재전송하도록 구성된다. 도 5에 도시된 바와 같이, 멀티플렉서(521)는 스케쥴러(511), 실행 코어(531), 제1 지연 유니트(541), 제1 체커(545), 제2 체커(555), 제1 버퍼(547), 및 제2 버퍼(557)에 결합되어 있다. 일 실시예에서 멀티플렉서(521)는 명령어 캐시로부터는 입력 명령어와 그 후속 명령어를, 제1 체커로부터는 입력 명령어의 제1 카피를, 그리고 제2 체커로부터는 입력 명령어의 제2 카피를 수신하도록 구성된다. 일 실시예에서 멀티플렉서(521)는 소정의 우선순위 일람표에 따라서 후속 명령어, 입력 명령어의 제1 카피, 또는 입력 명령어의 제2 카피 중 어느 하나를 선택적으로 실행 코어(531)에 제공하도록 더 구성된다. 일 실시예에서 입력 명령어의 제2 카피에는 제1 실행 우선순위가 입력 명령어의 제1 카피에는 제2 실행 우선순위가, 후속 명령어에는 제3 실행 우선순위가 주어진다. 일 실시예에서 제1 실행 우선순위는 제2 실행 우선순위보다 높고, 제2 실행 우선순위는 제3 실행 우선순위보다 높다. 일 실시예에서 제1 에러 종류 집합은 제2 에러 종류 집합의 부분집합이다. 다른 실시예에서 제1 에러 종류 집합은 제2 에러 종류 집합에 우선하는(complimentary) 것이다. 일 실시예에서 제1 에러 종류 집합은 레벨 제로 캐시 경로 예측기가 미스된 것을 표시하는 에러, 레벨 제로 캐시 CAM 확장이 미스매치된 것을 표시하는 에러, 및 저장 전송(store forwarding) 버퍼 데이터가 알려지지 않음을 표시하는 에러를 포함한다. 일 실시예에서 제2 에러 종류 집합은 TLB 미스를 표시하는 에러, 페이지 폴트를 표시하는 에러, 또는 명령어가 잘못 실행되었다는 것과 각 명령어가 리플레이될 필요가 있다는 것 등을 표시하는 기타 다른 에러를 포함한다. 일 실시예에서 제1 지연 유니트(541)는 제1 클록 도메인에서 소정 수의 클록 싸이클 후에 제1 체커(545)에 입력 명령의 제1 카피를 제공하도록 구성된다. 일 실시예에서 제1 클록 도메인에서의 소정 수의 클록 싸이클은 실행 코어를 통한 입력 명령의 시간 지연에 대략적으로 해당한다.FIG. 5 is a detailed block diagram of one embodiment of a processor 500 having first and second replay paths described above with respect to FIG. 4. As shown in FIG. 5, the processor 500 includes a scheduler 511 for scheduling and dispatching instructions received from the instruction cache (not shown) to the execution core 531 for execution via the multiplexer 521. . The function and operation of the multiplexer 521 will be described in detail below. In one embodiment execution core 531 performs data inference upon execution of input instructions received from multiplexer 521. The processor 500 further includes a first delay unit 541 that makes a first copy of the input instruction and holds the first copy for at least one clock cycle in the first clock domain. The processor 500 also includes a first checker 545 coupled to the first delay unit 541 and the execution core 531. In one embodiment, the first checker 545 determines whether there is an error in data inference with respect to the first set of error types, and if it is determined that the error is present, the first checker 547 to re-execute the first copy of the input instruction. Is retransmitted to the execution core. As shown in FIG. 5, in one embodiment the processor 500 is coupled to the first delay unit to make a second copy of the input instruction and retain the second copy for at least one clock cycle in the second clock domain. A second delay unit 551 is further included. The processor 500 includes a second delay unit 551 and a second checker 555 coupled to the first checker 545. In one embodiment, the second checker determines whether there is an error in the instruction execution for the second set of error types, and if it is determined that the error is found, through the second buffer 557 for re-executing the second copy of the input instruction. Configured to retransmit to execution core 531. As illustrated in FIG. 5, the multiplexer 521 may include a scheduler 511, an execution core 531, a first delay unit 541, a first checker 545, a second checker 555, and a first buffer ( 547, and a second buffer 557. In one embodiment, the multiplexer 521 is configured to receive an input instruction and its subsequent instructions from the instruction cache, a first copy of the input instruction from the first checker, and a second copy of the input instruction from the second checker. do. In one embodiment the multiplexer 521 is further configured to selectively provide to the execution core 531 any one of a subsequent instruction, a first copy of an input instruction, or a second copy of an input instruction in accordance with a predetermined priority list. . In one embodiment, a first execution priority is given to a second copy of the input instruction, a second execution priority is given to the first copy of the input instruction, and a third execution priority is given to subsequent instructions. In one embodiment, the first execution priority is higher than the second execution priority, and the second execution priority is higher than the third execution priority. In one embodiment, the first set of error types is a subset of the second set of error types. In another embodiment, the first set of error types is complimentary to the second set of error types. In one embodiment, the first set of error types indicates an error indicating that the level zero cache path predictor is missed, an error indicating that the level zero cache CAM extension is mismatched, and that store forwarding buffer data is unknown. Contains the error to display. In one embodiment, the second set of error types includes an error indicating a TLB miss, an error indicating a page fault, or other error indicating that the instruction was executed incorrectly and that each instruction needed to be replayed. . In one embodiment, the first delay unit 541 is configured to provide a first copy of the input command to the first checker 545 after a predetermined number of clock cycles in the first clock domain. In one embodiment, the predetermined number of clock cycles in the first clock domain corresponds approximately to the time delay of the input command through the execution core.

프로세서(500) 내의 다른 유니트가 그 자신의 명령어들을 발생시켜 그에 해당하는 기능을 수행할 수 있는 경우들이 있다. 예컨대, 프로세서(500) 내의 메모리 제어 유니트나 메모리 실행 유니트(미도시)는 페이지 분할 및 TLB 재로딩 등을 다루기 위한 풀 스토어 오퍼레이션이나 UOP를 포함하는 자신의 파이프라인 내에서의 실행을 위해 명령어를 디스패치할 필요가 종종 있다. 이런 종류의 명령어를 생성 명령어라고 하는데, 이것은 프로세서(500) 내의 어떤 유니트에 의해 발생 또는 만들어지고 명령어 캐시로부터의 명령어 흐름 내에 있지 않기 때문이다. 일 실시예에서 멀티플렉서(521)는 또한 생성 명령어를 수신하여 이것을 실행을 위한 실행 코어(531)로 전송하도록 결합된다. 멀티플렉서(521)는 여러 경로로부터 명령어들을 동시에 수신할 수 있기 때문에 여러 경로로부터 멀티플렉서(521)로 전송된 명령어들 사이의 실행 우선순위를 정하는 소정의 우선순위 일람표가 필요하다. 예컨대, 멀티플렉서는 동일한 처리 기간 또는 클록 싸이클에서 스케쥴러(511)로부터는 후속 명령어를, 제1 체커(545)로부터는 리플레이될 입력 명령어의 제1 카피를, 제2 체커(555)로부터는 리플레이될 다른 입력 명령어를, 그리고 다른 유니트(예컨대, 메모리 제어 또는 실행 유니트)로부터는 생성 명령어를 수신할 수 있다. 일 실시예에서 멀티플렉서(521)는 명령어 캐시로부터 들어오는 명령어에는 낮은 우선순위를, 제1 체커로부터 들어오는 리플레이 명령어에는 중간 우선순위를, 제2 체커로부터 들어오는 리플레이 명령어에는 높은 우선순위를, 그리고 생성 명령어에는 최고로 높은 우선순위를 부여한다.There are cases where other units in the processor 500 may generate their own instructions to perform the corresponding functions. For example, a memory control unit or memory execution unit (not shown) within processor 500 dispatches instructions for execution within its own pipeline, including UOP or full store operations to handle page splitting and TLB reloading, and the like. There is often a need to do it. This kind of instruction is called a generation instruction because it is generated or produced by any unit in the processor 500 and is not in the instruction flow from the instruction cache. In one embodiment multiplexer 521 is also coupled to receive the generation instruction and send it to execution core 531 for execution. Since the multiplexer 521 can receive instructions from multiple paths at the same time, a certain priority list is needed to determine the execution priority between instructions sent from the multiple paths to the multiplexer 521. For example, the multiplexer receives subsequent instructions from scheduler 511, first copy of input instructions to be replayed from first checker 545, and another to be replayed from second checker 555 in the same processing period or clock cycle. Input instructions and generation instructions from other units (eg, memory control or execution units) may be received. In one embodiment, the multiplexer 521 has a low priority for instructions coming from the instruction cache, a medium priority for replay instructions coming from the first checker, a high priority for replay instructions coming from the second checker, and a generation instruction. Give the highest priority.

상술한 바와 같이, 일 실시예에서 제1 체커(545)에 의해 검출된 에러 상태는 제2 체커(555)에 의해 검출된 에러 상태의 부분 집합일 수 있다. 이 경우, 제2 체커(555)는 확실한 체킹을 제공할 필요가 있는데, 이것은 UOP가 일단 제2 체커에 의해 취해지고 나면 리플레이될 수 없기 때문이다. 다른 실시예에서 제1 체커(545)에 의해 취급된 에러 상태는 제2 체커(555)에 의해 취급된 에러 상태에 우선할 수 있다. 이 경우, 제1 체커(545)는 그 에러 집합에 대해서 상술한 "신뢰도는 높기는 하나 보증되지는 않는" 체킹이 아닌 확실한 체킹을 제공할 필요가 있는데, 그 이유는 나중 체커는 조기 체커의 출력을 다시 체킹하지 않기 때문이다. 따라서 부분 집합 모드가 바람직하다.As described above, in one embodiment the error condition detected by the first checker 545 may be a subset of the error condition detected by the second checker 555. In this case, the second checker 555 needs to provide reliable checking because the UOP cannot be replayed once the UOP is taken by the second checker. In other embodiments, the error condition handled by the first checker 545 may take precedence over the error condition handled by the second checker 555. In this case, the first checker 545 needs to provide a reliable check rather than the "high but not guaranteed" checking described above for the error set, because the latter checker outputs the early checker. Because do not check again. Thus, subset mode is preferred.

상술한 바와 같이, 제2 체커(555)는 제1 체커(545)에 의해 취급되지 않은 에러에 대해서 추가적인 및/또는 우선적인 체킹을 제공할 수 있다. 어떤 에러가 어느 체커에 의해 취급되는가 하는 것은 여러 요소에 의해 결정되는데, 이러한 요소들로는 프로세서 성능, 설계 복잡성, 다이 면적 등이 있으나, 이에 한정되는 것은 아니다. 일 실시예에서 제2 체커(555)는 TLB 미스와 예컨대 프로세서(500)의 메모리 제어 유니트(미도시)에서 일어날 수 있는 각종 문제로 인해 명령어를 리플레이하는 일을 담당한다. 이러한 각종 문제는 전체 물리적 어드레스 체크에 따른 캐시 미스, 전체 물리적 어드레스 체크에 따른 저장으로부터의 잘못된 전송과 같이, 짧은 시간 내에 검출하기가 어려운 문제나 에러 상태를 포함할 수 있다.As discussed above, the second checker 555 may provide additional and / or preferential checking for errors not handled by the first checker 545. Which error is handled by which checker depends on several factors, including but not limited to processor performance, design complexity, die area, and so forth. In one embodiment, the second checker 555 is responsible for replaying instructions due to TLB misses and various problems that may occur, for example, in a memory control unit (not shown) of the processor 500. These various problems may include problems or error conditions that are difficult to detect within a short time, such as cache misses following full physical address checks, and incorrect transmissions from storage following full physical address checks.

일 실시예에서, 제1 체커(545)와 제2 체커(555)는 서로 협력하여 멀티플렉서(521)의 동작을 제어한다. 도 5에 도시된 바와 같이, 멀티플렉서(521)는 제1 체커(545)와 제2 체커(555)로부터 수신된 선택 신호와 그리고 선택적으로는 명령어 캐시로부터의 명령어 흐름 내에 있지 않은 생성 명령어를 발생시키는 메모리 제어 유니트(미도시)와 같은 다른 유니트로부터 수신된 다른 선택 신호에 따라서 그 해당 기능을 수행한다. 이러한 선택 신호들은 멀티플렉서(521)에 의해 이용되어, 여러 경로로부터의 하나 이상의 명령어가 실행되기 위해 대기하고 있는 경우에 주어진 처리 싸이클에서 어느 명령어를 실행을 위한 실행 코어(531)에 전송할 것인가를 판단한다. 일 실시예에서 생성 명령어에는 제1 실행 우선순위가, 제2 체커(555)로부터 들어오는 명령어에는 제1 우선순위보다 낮은 제2 우선순위가, 제1 체커로부터 들어오는 명령어에는 제2 우선순위보다 낮은 제3 우선순위가, 그리고 스케쥴러(511)를 통해 명령어 캐시로부터 들어오는 후속 명령어에는 제3 우선순위보다 낮은 제4 우선순위가 부여된다.In one embodiment, the first checker 545 and the second checker 555 cooperate with each other to control the operation of the multiplexer 521. As shown in FIG. 5, the multiplexer 521 generates a selection signal received from the first checker 545 and the second checker 555 and optionally a generation instruction that is not in the instruction flow from the instruction cache. It performs its corresponding function in accordance with another selection signal received from another unit such as a memory control unit (not shown). These selection signals are used by the multiplexer 521 to determine which instructions to send to the execution core 531 for execution in a given processing cycle when one or more instructions from several paths are waiting to be executed. . In one embodiment, the generation instruction has a first execution priority, the instruction coming from the second checker 555 has a second priority lower than the first priority, and the instruction coming from the first checker has a second priority lower than the second priority. A third priority, and subsequent instructions coming from the instruction cache via the scheduler 511, are given a fourth priority that is lower than the third priority.

일 실시예에서 일단 특정 UOP가 제1 체커(545)에 의한 고속 리플레이를 위해 전송되고 나면, 사본이 이미 존재하기 때문에 그 UOP의 동일한 인스턴스화(instantiation)는 제2 체커(555)에 의한 저속 리플레이를 위해 전송되지 않을 수가 있다. 이러한 일이 생기는 것을 막기 위해, 일 실시예에서 각 UOP는 제1 체커(545)와 제2 체커(555) 간의 리플레이 활동을 조정하기 위하여 이들 체커에 의해 사용될 수 있는 몇가지 특수 필드를 포함할 수 있다. 예컨대, 일 실시예에서 UOP는 NEEDS_FAST_REPLAY 필드라고 하는 필드를 포함할 수 있는데, 이 필드는 제1 체커(545)에 의해 설정되며, 제1 체커(545)가 UOP를 고속 리플레이를 위해 전송하기를 원한다는 것을 표시한다. 각 UOP는 또한 GOT_FAST_REPLAY 필드라고 불리는 다른 필드를 포함할 수 있다. 일 실시예에서 GOT_FAST_REPLAY 필드는 제1 체커(545)와 제2 체커(555) 간의 상호 협력에 의해 전송된다. 예컨대, 제1 체커가 제1 종류의 에러가 검출되었기 때문에 고속 리플레이를 위해 제1 명령어를 전송하고자 한다고 가정한다. 이 경우 제1 체커(545)는 각 UOP의 해당 NEEDS_FAST_REPLAY 필드를 설정하여 이 특정 UOP가 고속 리플레이 경로에서 리플레이될 필요가 있음을 표시한다. 만일 제2 체커(555)가 동일 클록 싸이클에서 저속 리플레이를 위해 제2 UOP를 전송하고자 한다면, 제1 명령어의 GOT_FAST_REPLAY 필드는 클리어되고 멀티플렉서(521)는 고속 리플레이를 추구하는 것 대신에 저속 리플레이 UOP를 선택하도록 제어될 것이다. 후에 제1 UOP가 제2 체커(555)에 도달하면, 그 해당 NEEDS_FAST_REPLAY 필드가 이미 설정되어 있으므로 이것은 저속 리플레이 경로에서 리플레이를 위해 전송될 것이다.In one embodiment, once a particular UOP has been sent for fast replay by the first checker 545, the same instantiation of that UOP will result in a slower replay by the second checker 555 because a copy already exists. May not be sent. To prevent this from happening, in one embodiment each UOP may include some special fields that can be used by these checkers to coordinate the replay activity between the first checkers 545 and the second checkers 555. . For example, in one embodiment the UOP may include a field called the NEEDS_FAST_REPLAY field, which is set by the first checker 545 that the first checker 545 wants to send the UOP for fast replay. It is displayed. Each UOP may also include another field called the GOT_FAST_REPLAY field. In one embodiment, the GOT_FAST_REPLAY field is transmitted by mutual cooperation between the first checker 545 and the second checker 555. For example, assume that the first checker wants to send a first instruction for fast replay because a first kind of error was detected. In this case, the first checker 545 sets the corresponding NEEDS_FAST_REPLAY field of each UOP to indicate that this specific UOP needs to be replayed in the fast replay path. If the second checker 555 wants to send a second UOP for slow replays in the same clock cycle, the GOT_FAST_REPLAY field of the first instruction is cleared and the multiplexer 521 is responsible for slow replay UOPs instead of seeking fast replays. Will be controlled to select. Later, when the first UOP reaches the second checker 555, it will be sent for replay in the slow replay path since its corresponding NEEDS_FAST_REPLAY field is already set.

도 6은 데이터 추론 작업을 용이하게 하는 고속 및 저속 경로를 이용하는 방법(600)의 일 실시예의 흐름도를 나타낸 것이다. 방법(600)은 블록(601)에서 시작하여 블록(605)으로 진행한다. 블록(605)에서 실행 코어 또는 유니트는 입력 명령어 실행시에 데이터 추론을 수행한다. 그런 다음, 방법(600)은 블록(605)에서 블록(609)으로 진행한다. 블록(609)에서는 제1 종류의 에러의 검출 여부가 판단된다. 상술한 바와 같이, 일 실시예에서 LO 데이터 캐시 경로 예측기가 미스된 경우에 제1 종류의 에러가 생기는데, 이 경우에는 데이터가 L0 데이터 캐시에 데이터가 존재할 수 없으며, L0 데이터 캐시 CAM 확장은 미스매치되거나(즉, 경로 예측기는 히트되나 태그들은 매치되지 않음), 저장 전송 버퍼 데이터는 알려지지 않는다(즉, 데이터는 저장 전송 버퍼로부터 전송될 것으로 추정되나 스토어 데이터는 이용될 수 없음). 블록(613)에서 입력 명령어는 제1 종류의 에러가 검출되었다면 재실행된다. 상술한 바와 같이, 제1 종류의 에러가 검출되면 제1 체커 유니트(즉, 고속 또는 조기 체커)는 입력 명령어의 카피를 고속 리플레이 경로 상에서 리플레이, 즉 재실행을 위해 전송할 것이다. 그 후, 방법(600)은 블록(617)으로 진행한다. 블록(617)에서는 제2 종류의 에러의 검출 여부가 판단된다. 이 실시예에서 제2 체커(즉, 저속 또는 나중 체커)는 제2 종류의 에러의 발생 여부를 판단하는 일을 담당한다. 블록(621)에서, 제2 종류의 에러가 발생된 경우 입력 명령어는 재실행된다. 상술한 바와 같이, 제2 체커는 제2 종류의 에러가 발생된 경우에 입력 명령어의 카피를 저속 리플레이 경로 상에서 리플레이를 위해 전송하는 일을 담당한다.6 shows a flow diagram of one embodiment of a method 600 using fast and slow paths to facilitate data inference tasks. The method 600 begins at block 601 and proceeds to block 605. In block 605 the execution core or unit performs data inference upon executing the input instruction. The method 600 then proceeds from block 605 to block 609. In block 609, it is determined whether a first type of error has been detected. As described above, in one embodiment a first type of error occurs when the LO data cache path predictor is missed, in which case the data cannot exist in the L0 data cache, and the L0 data cache CAM extension is mismatched. (I.e., the path predictor is hit but the tags do not match), or the store transfer buffer data is unknown (i.e. the data is assumed to be transferred from the store transfer buffer but store data is not available). In block 613 the input instruction is re-executed if an error of the first kind is detected. As discussed above, if an error of the first kind is detected, the first checker unit (ie, fast or early checker) will send a copy of the input instruction to replay, i.e., redo, on the fast replay path. The method 600 then proceeds to block 617. In block 617, it is determined whether a second type of error has been detected. In this embodiment the second checker (i.e. low speed or later checker) is responsible for determining whether a second type of error has occurred. In block 621, the input instruction is re-executed if a second kind of error has occurred. As mentioned above, the second checker is responsible for sending a copy of the input instruction for replay on the slow replay path when a second type of error occurs.

지금까지 본 발명은 바람직한 실시예와 연계하여 설명되었다. 당업자에게는 상술한 설명을 고려하여 많은 선택, 변형, 수정 및 이용이 가능함은 명백하다.The present invention has been described so far in connection with a preferred embodiment. It will be apparent to those skilled in the art that many choices, variations, modifications, and uses are possible in light of the above teachings.

Claims

An execution core that performs data speculation when executing the first instruction;

A first replay mechanism for replaying the first instruction through a first replay path when a first type of error is detected indicating an error in the data inference; And

A second replay mechanism for replaying the first instruction through a second replay path when a second type of error is detected that indicates an error in the data inference

Microprocessor comprising a.

The microprocessor of claim 1, wherein the first type of error can be detected within a first period and the second type error can be detected within a second period longer than the first period.

The method of claim 1, wherein the first replay mechanism

A first delay unit for making a first copy of the first instruction and holding the first copy for at least one clock cycle in a first clock domain; And

A first checker that determines whether the first type of error is detected and sends back to the execution core to replay the first copy of the first instruction when the first type of error is detected (checker)

Microprocessor comprising a.

The method of claim 3, wherein the second replay mechanism

A second delay unit that makes a second copy of the first instruction and holds the second copy for at least one clock cycle in a second clock domain; And

A second checker that determines whether the second type of error is detected and sends back to the execution core to replay the second copy of the first instruction when the second type of error is detected

Microprocessor comprising a.

5. The microprocessor of claim 4, further comprising an instruction cache for the microprocessor to store the first and subsequent instructions and to provide these instructions to the execution core.

The method of claim 5, wherein the microprocessor

And a selector coupled to receive a subsequent instruction from the instruction cache, a first copy of the first instruction from the first checker, and another instruction from the second checker,

The selector may be any one of the subsequent instructions from the instruction cache, the first copy of the first instructions from the first checker, and the other instructions from the second checker according to a predetermined priority scheme. A microprocessor provided to the execution core for executing an instruction.

7. The microprocessor of claim 6, wherein said selector comprises a multiplexer.

7. The method of claim 6, wherein the other instruction is given a first execution priority, the first copy of the first instruction is given a second execution priority that is lower than the first execution priority, and the subsequent instruction is assigned the first execution priority. 2 A microprocessor given a third execution priority lower than the execution priority.

2. The microprocessor of claim 1, wherein the error of the first kind is a subset of the error of the second kind.

2. The microprocessor of claim 1, wherein the first type of error is complimentary to the second type of error.

The method of claim 1, wherein the first type of error is an error indicating that the level zero cache path predictor is missed, an error indicating that the level zero cache CAM extension is mismatched, and a store forwarding buffer. ) A microprocessor selected from the group consisting of errors indicating that the data is unknown.

2. The microprocessor of claim 1, wherein the second type of error is selected from the group consisting of an error indicating a TLB miss and an error indicating an incorrect stored transfer according to a full physical address check.

4. The method of claim 3, wherein the first delay unit provides a first copy of the first instruction after a predetermined number of clock cycles in the first clock domain, wherein the predetermined number of clock cycles in the first clock domain Approximately a time delay in which the first instruction passes through the execution core.

7. The microprocessor of claim 6, further comprising means for generating instructions that the microprocessor is not in the instruction flow from the instruction cache.

15. The microprocessor of claim 14, wherein the selector is coupled to receive the generated instruction and send it to the execution core for execution.

16. The apparatus of claim 15, wherein the selector is low priority for instructions coming from the instruction cache, medium priority for replay instructions coming from the first checker, high priority for replay instructions coming from the second checker, and Microprocessor giving highest priority to generated instructions.

Performing data inference upon execution of the first instruction in the execution core;

Re-executing the first instruction over a first replay path in response to a first type of error indicating that there is an error in the data inference; And

Re-executing the first instruction over a second replay path in response to a second type of error indicating that there is an error in the data inference

How to include.

Means for performing data inference upon execution of the first instruction;

Means for re-executing the first instruction over a first replay path in response to a first type of error indicative of an error in the data inference; And

Means for re-executing the first instruction over a second replay path in response to a second type of error indicative of an error in the data inference

Microprocessor comprising a.

An instruction cache that stores the first instruction and provides for execution;

A scheduler coupled to the instruction cache, the scheduler scheduling the first instruction received from the instruction cache to dispatch for execution;

An execution core coupled to executing the first instruction dispatched from the scheduler, the execution core performing data inference upon execution of the first instruction;

A first replay mechanism for retransmitting a first copy of the first instruction to the execution core for redo when a first error is detected indicating an error in the data inference; And

A second replay mechanism for retransmitting a second copy of the first instruction to the execution core for redo when a second error is detected indicating an error in the data inference

Processing system comprising a.