KR19990066806A

KR19990066806A - How to improve microprocessors using pipeline synchronization

Info

Publication number: KR19990066806A
Application number: KR1019980046782A
Authority: KR
Inventors: 토마스 제이 주니어 헬러; 윌리엄 토드 보이드
Original assignee: 포만 제프리 엘; 인터내셔널 비지네스 머신즈 코포레이션
Priority date: 1998-01-20
Filing date: 1998-11-02
Publication date: 1999-08-16
Also published as: EP0930565A2; US6052771A; CN1224192A; TW417063B; CN1109968C; SG77216A1; KR100310900B1; MY114607A

Abstract

본 발명은 다중 파이프라인의 동기를 사용하여 레지스터 관리를 통해 마이크로프로세서 컴퓨터 시스템의 무순서 지원을 개선하고, 제1 및 제2 처리요소―각 처리요소는 각 처리요소의 범용 레지스터 및 제어 레지스터의 설정에 의해 결정되는 자신만의 상태를 가짐―를 갖는 컴퓨터 시스템의 순차 명령어 스트림의 처리를 제공하기 위한 시스템 및 방법을 제공한다. 상기 제 1 처리요소에 의해 상기 순차 명령어 스트림이 처리되는 도중의 임의의 시점에서, 상기 제2 처리요소가 동일한 순차 명령어 스트림의 연속된 처리를 인계하는 것이 이득이 되면, 상기 제1 및 제2 처리요소는 순차 명령어 스트림을 처리하며, 동일한 명령을 실행할 수 있으나, 상기 처리요소 중 오직 하나만이 상기 제1 및 제2 처리요소 상태들의 조합에 의해 결정되는 상기 컴퓨터 시스템의 전체 구조 상태를 변경할 수 있다. 제2 프로세서는 제 1 프로세서로 공급(feed)할, 제1 정순서 프로세서보다 많은 파이프라인 단계를 갖게 되어 유한 캐시 손실을 감소시키며 성능을 향상시킨다. 제2 프로세서 결과의 처리 및 기억은 컴퓨터 시스템의 구조 상태를 변화시키지 않는다. 결과는 제2 프로세서의 GPR 또는 개인용 기억 버퍼에 저장된다. 상태들을 보조 프로세서와 재동기시키는 것은 추론적 보조 프로세서로서 사용되는 보조 프로세서에 의한 처리에 대한 무효 op, 지연, 또는 계산된 특정 이득에 기초하여 발생한다. 대안적인 실시예에 있어서, 상기 처리는 나중에 컴퓨터 시스템의 전체 구조 상태를 변경시킬 수 있는 프로세서가 변화되는 경우 다른 프로세서에 의해 요구되는 장래에 사용하기 위한 구조 상태 변화를 전송하고 기억하도록 해주는 2개의 프로세서로 일반화된다.The present invention improves out of order support of microprocessor computer systems through register management using synchronization of multiple pipelines, and allows first and second processing elements—each processing element to set general purpose registers and control registers for each processing element. It provides a system and method for providing processing of a sequential instruction stream of a computer system having its own state, which is determined by. At any point in time during which the sequential instruction stream is being processed by the first processing element, if it is advantageous for the second processing element to take over processing of the same sequential instruction stream, the first and second processing An element processes a sequential instruction stream and may execute the same instruction, but only one of the processing elements may change the overall structural state of the computer system, which is determined by a combination of the first and second processing element states. The second processor will have more pipeline stages than the first ordered processor, which will feed to the first processor, reducing finite cache loss and improving performance. Processing and storing the second processor result does not change the structural state of the computer system. The result is stored in the GPR or personal memory buffer of the second processor. Resynchronizing the states with the coprocessor occurs based on an invalid op, delay, or calculated specific gain for processing by the coprocessor used as the speculative coprocessor. In an alternative embodiment, the processing allows two processors to transmit and store structural state changes for future use as required by other processors if a processor that later changes the overall structural state of the computer system is changed. Is generalized to

Description

How to Improve Microprocessors Using Pipeline Synchronization

본 발명은 컴퓨터 시스템에 관한 것으로, 좀 더 상세하게는 보조 프로세서가 결합된 마이크로프로세서 컴퓨터 시스템의 성능을 파이프라인 동기를 사용하여 개선시키는 방법에 관한 것이다.The present invention relates to a computer system, and more particularly, to a method for improving the performance of a microprocessor computer system incorporating a coprocessor using pipeline synchronization.

현재 마이크로프로세서의 성능은 중요한 작업부하(workload)의 대부분에 대한 유한 캐시 효과(finite cache effect)로 인해 매우 제한된다. 유한 캐시 효과는 마이크로프로세서의 1차 레벨 캐시가 무한히 커지면 해결될 성능 열화의 모든 원인을 포함한다. 칩외부(off chip)의 기억장치로부터 오퍼랜드(operand) 데이터를 대기하면서 마이크로프로세서가 소비하는 시간의 양은 대부분의 경우에 있어서 명령어(instruction)를 실행하는데 소비되는 시간의 양과 동일하다. 이러한 현상은 데이터베이스 및 트랜잭션(transaction) 처리와 연관된 작업부하인 경우 특히 그러하다.The performance of current microprocessors is very limited due to the finite cache effect on most of the critical workloads. The finite cache effect includes all the causes of performance degradation that will be resolved if the microprocessor's first level cache grows infinitely. The amount of time the microprocessor spends waiting for operand data from off-chip storage is in most cases equal to the amount of time spent executing instructions. This is especially true for workloads associated with database and transaction processing.

많은 현재의 마이크로프로세서 설계는 상기 유한 캐시 손실(finite cache penalty)의 감소를 목적으로 한다. 대규모 캐시, 다중 레벨 캐시, 고속 멀티칩 모듈(multichip module), 무순서 실행(out of order execution) 및 명령어 프리페치(instruction prefetch)는 널리 사용되며, 가장 성공적인 것으로 평가된다. 오퍼랜드 프리페치는 또한 종래의 무순서 처리와 함께 또는 종래의 무순서 처리 없이 특정 작업부하에 성공적으로 사용되어 왔다. 그러나 프리페치는 데이터베이스 및 트랜잭션 작업부하에는 특별한 효과가 없다. 대규모 캐시는 유한 캐시 효과를 감소시키지만 이 분야에서 증가된 다이 사이즈 또는 칩 개수의 비용-성능 관계에 의해 추가적 개선이 제한된다. 현재의 무순서 실행 기술은 유한 캐시 효과를 크게 감소시키지만, 이러한 기술에 수반되는 손실은 프로세서(processor) 클록 주파수 감소 및 설계상의 복잡도의 증가 형태로 나타난다. 따라서 종전에 바람직하다고 생각되던 무순서 실행 설계의 구현 비용을 실질적으로 감소시킬 수 있도록 마이크로프로세서 설계를 개선할 필요가 있다.Many current microprocessor designs aim to reduce the finite cache penalty. Large caches, multi-level caches, fast multichip modules, out of order execution, and instruction prefetch are widely used and are considered the most successful. Operand prefetch has also been used successfully with certain workloads with or without conventional random processing. However, prefetching has no special effect on database and transactional workloads. Large caches reduce the finite cache effect, but further improvements are limited in this area by the cost-performance relationship of increased die size or chip count. Current random execution techniques greatly reduce the finite cache effect, but the losses associated with these techniques appear in the form of reduced processor clock frequencies and increased design complexity. Thus, there is a need to improve the microprocessor design to substantially reduce the cost of implementing the random execution design previously considered desirable.

용어해설Glossary of Terms

CPI : 명령어 당 머신 주기(machine cycle)를 뜻함.CPI: Machine cycle per instruction.

SFE : 본 발명에 의해 제공되는 추론적 페치 엔진(speculative fetch engine)을 뜻함.SFE means the speculative fetch engine provided by the present invention.

uPCore : 주기 시간(cycle time), 설계 복잡도, 및 무한 L1 캐시 CPI 간의 상호 조정(tradeoff)을 위해 균형이 맞추어진 마이크로프로세서 설계를 나타냄.uPCore: Balanced microprocessor design for tradeoff between cycle time, design complexity, and infinite L1 cache CPI.

본 발명은 마이크로프로세서 컴퓨터 시스템 설계 방법 및 좀 더 상세하게는 결합된 보조프로세서를 갖는 컴퓨터 시스템에 사용되며, 파이프라인 동기를 사용하여 시스템 성능을 개선시키는 방법을 제공한다. 본 발명은 무순서 지원(out of order support)을 개선하고, 컴퓨터 시스템 및 특히 마이크로프로세서와 그 마이크로프로세서에 결합된 보조프로세서―여기서 보조프로세서는 유한 캐시 손실의 감소를 달성하기 위한 추론 엔진을 제공하여 시스템 성능 향상을 달성함―를 갖는 컴퓨터 시스템에 대규모 캐시 및 다중 레벨 캐시를 사용할 수 있는 능력을 제공한다.The present invention is used in a microprocessor computer system design method and more particularly in a computer system having a combined coprocessor, and provides a method for improving system performance using pipeline synchronization. The present invention improves out of order support and provides a computer system and in particular a microprocessor and a coprocessor coupled to the microprocessor, wherein the coprocessor provides an inference engine for achieving a reduction in finite cache loss. Provides the ability to use large caches and multilevel caches for computer systems with achieving system performance improvements.

개선된 바람직한 실시예에는 다중 파이프라인의 동기를 사용하여 레지스터 관리(register management)를 통해 개선된 마이크로프로세서 지원을 제공한다. 이러한 개선은 명령어를 반드시 정순서(in order)에 의해 처리하는 코어 마이크로프로세서와 함께 동작하는 다수의 실행 요소(execution element)를 갖는 추론적 페치 엔진(SFE)(필요에 따라서는 먼저 페치한 후 동시 로드(concurrent load)를 수행하는 것과 같은 수퍼스칼라 기술을 처리할 수 있음), 무순서 실행 방법, 복수의 마이크로프로세서를 갖는 동기화 방법, 및 SFE와 마이크로프로세서 코어(uPCore) 양자에 의해 공유된 저장 계층(storage hierarchy)에 대한 추론적 메모리 참조의 발생을 가능하게 하는 레지스터 관리 프로세스를 제공함에 의해 달성된다.An improved preferred embodiment provides improved microprocessor support through register management using synchronization of multiple pipelines. These improvements are inferential fetch engines (SFEs) that have multiple execution elements that work with the core microprocessor, which must process instructions in order, fetching them first if necessary and then simultaneously Can handle superscalar techniques such as performing a load), random execution method, synchronization method with multiple microprocessors, and storage layer shared by both SFE and microprocessor core (uPCore) is achieved by providing a register management process that enables the generation of speculative memory references to the storage hierarchy.

도 1a 및 1b는 IBM사의 립테이에게 부여된 특허 제4,901,233호에 의해 제공된 것과 동일한 설명을 제공하기 위한 것으로, 상기 특허는 IBM사에 의해 개발된 이후 IBM의 종래 메인프레임 및 인텔사의 펜티엄 마이크로프로세서, 디지털사의 알파 마이크로프로세서, 선마이크로 시스템사의 울트라스팍스 마이크로프로세서에 널리 사용되고 있는 종래 기술의 한계를 예시하기 위한 것이다.Figures 1a and 1b are intended to provide the same description as provided by Patent No. 4,901,233 granted to Liptei of IBM, which patent was developed by IBM, after IBM's conventional mainframe and Intel's Pentium microprocessor, It is intended to illustrate the limitations of the prior art, which is widely used in the Alpha microprocessor of Digital Corp. and the UltraSparse microprocessor of Sun Microsystems.

도 2는 바람직한 실시예의 대략의 개요를 도시한다.2 shows a schematic overview of a preferred embodiment.

도 3은 SFE의 상세(detail) 및 기억 버퍼, 캐시, uPCore로 상기 SFE의 인터페이스를 도시한다. 또한 SFE의 명령어 및 오퍼랜드 페치를 상기 SFE 및 uPCore에 의해 공유된 캐시를 통하여 라우팅하는 바람직한 하드웨어를 도시한다.3 shows the details of the SFE and the interface of the SFE with a memory buffer, a cache, and a uPCore. Also shown is the preferred hardware for routing instructions and operand fetches of SFEs through the cache shared by the SFEs and uPCores.

도 4는 상기 uPCore 및 SFE 사이의 동기 유닛을 좀 더 상세하게 도시한다.4 shows in more detail the synchronization unit between the uPCore and SFE.

도 5는 uPCore 및 SFE 사이의 동기를 위하여 립테이의 레지스터 재명명 발명에 대한 상세한 개선을 도시한다.5 shows a detailed refinement to the invention of register renaming of LipTay for synchronization between uPCore and SFE.

도 6은 uPCore용 바람직한 하드웨어를 도시한다.6 illustrates preferred hardware for uPCore.

도 7은 본 발명이 성능 향상을 위하여 채택한 방법을 도시하기 위하여 데이터 흐름도로써 상기 추론 엔진과 마이크로프로세서 uPCore간 상호작용을 상세하게 도시한다.FIG. 7 illustrates in detail the interaction between the inference engine and the microprocessor uPCore in a data flow diagram to illustrate the methodology employed by the present invention for improving performance.

〈도면의 주요부분에 대한 부호의 설명〉<Explanation of symbols for main parts of drawing>

10: 컴퓨터 시스템10: computer system

12 : 주메모리12: main memory

14 : 캐시 메모리 시스템14: cache memory system

16: 명령어 캐시16: instruction cache

18 : 데이터 캐시18: data cache

20 : 명령어 버퍼 유닛20: instruction buffer unit

22 : 명령어 레지스터 유닛22: instruction register unit

24, 26 : 범용 실행 유닛(GPE)24, 26: General Purpose Execution Unit (GPE)

28 ; 기억 버퍼 유닛28; Storage buffer unit

30 : 범용 레지스터 어레이30: general purpose register array

31 : 명령어 큐31: command queue

32 : 레지스터 관리 시스템32: register management system

36 : 인터럽트 제어요소36: interrupt control element

상기 uPCore 및 SFE는 모두 처리요소(processing element)이다. 시스템은 제1 및 제2 처리요소―각각의 처리요소는 자신의 범용 레지스터 및 제어 레지스터의 설정에 따라 결정되는 자신의 상태를 가짐―를 갖는 컴퓨터 시스템 내의 순차 명령어 스트림을 처리한다. 처리중 임의의 시점에서 제2 처리요소가 순차 명령어 스트림의 연속된 처리를 인계하는 것이 유리해지는 경우, 제1 및 제2 처리요소는 순차 명령어 스트림을 처리하는데 이들 양 처리요소는 동일한 명령어를 실행할 수 있지만, 상기 처리요소 중 오직 하나(본 바람직한 실시예에서는 uPCore)만이 제1 및 제2 처리요소의 상태들의 조합에 의해 결정되는 상기 컴퓨터 시스템의 전체 구조 상태(architectural state)를 변경할 수 있다.The uPCore and the SFE are both processing elements. The system processes a sequential instruction stream in a computer system having a first and a second processing element, each processing element having its own state determined by the setting of its general purpose registers and control registers. If at any point in the process it is advantageous for the second processing element to take over the successive processing of the sequential instruction stream, the first and second processing elements process the sequential instruction stream, both of which may execute the same instruction. However, only one of the processing elements (uPCore in this preferred embodiment) can change the overall architectural state of the computer system determined by the combination of states of the first and second processing elements.

바람직한 실시예에 있어서, 제2 프로세서는 제1 정순서 처리요소보다 많은 파이프라인 단계(pipeline stage)를 가져 무순서 실행이 가능하며, 그로 인해 유한 캐시 손실을 감소시키고 성능을 향상시킨다. 바람직한 실시예에 있어서, 상기 제2 처리요소의 결과를 처리하고 기억하는 것은 컴퓨터 시스템의 구조 상태를 변경하지 않는다. 상기 결과는 범용 레지스터 또는 개인용 기억 버퍼(personal storage buffer)에 기억된다. 두 개의 처리요소 상태의 재동기화는 무순서 보조프로세서(SFE)로서 보조 프로세서를 사용할 경우의 처리에 대한 무효 연산(invalid op), 지연(stall), 또는 계산된 특정 이득에 따라 발생한다.In a preferred embodiment, the second processor has more pipeline stages than the first order processing element, allowing for random execution, thereby reducing finite cache loss and improving performance. In a preferred embodiment, processing and storing the results of the second processing element does not change the structural state of the computer system. The result is stored in a general purpose register or personal storage buffer. Resynchronization of the two processing element states occurs according to an invalid op, stall, or calculated specific gain for processing when using the coprocessor as an out of order coprocessor (SFE).

상기 SFE는 상기 uPCore와 인터페이스하며, 따라서 본 발명은 동일 실리콘 칩상의 SFE 및 제1 프로세서 uPCore에 의해 쉽게 구현된다. 멀티칩 구현 또한 가능하며 본 발명의 현재 실시예와 양립할 수 있다. 상기 uPCore는 종래의 구조를 가지며, 바람직한 실시예에서 조합 시스템의 구조 상태를 유지하는 한편, 일반화된 동일 버전(mirror version)에서는 상기 구조 상태를 유지해야 할 책임은 변경되거나 양자 모두에 의해 공유된다. 본 발명의 바람직한 실시예에 있어서, SFE에 의해 호출된 동작이 상기 uPCore의 구조 상태를 직접적으로 변경시키지는 않는다. 상기 SFE는 명령어 및 오퍼랜드 데이터를 uPCore가 사용하기에 앞서 조합 시스템의 캐시에 명령어 및 오퍼랜드 데이터를 프라이밍(prime)하는 기억 참조(storage reference)를 발생하는데 사용된다. 이러한 개선점들로 인해 미국 특허 제4,901,233호(이하 립테이(Liptay)라 함) 및 제4,574,349호에 개시된 것과 같은 종래의 레지스터 재명명 방식에 의해 달성될 수 있는 시스템 성능이 확장된다.The SFE interfaces with the uPCore, so the present invention is easily implemented by the SFE and the first processor uPCore on the same silicon chip. Multichip implementations are also possible and compatible with current embodiments of the present invention. The uPCore has a conventional structure and in the preferred embodiment maintains the structural state of the combination system, while in the generalized version of the mirror version the responsibility for maintaining the structural state is changed or shared by both. In a preferred embodiment of the present invention, the action called by the SFE does not directly change the structural state of the uPCore. The SFE is used to generate a storage reference that primes the instruction and operand data in the cache of the combination system prior to the uPCore using the instruction and operand data. These improvements extend system performance that can be achieved by conventional register renaming schemes such as those disclosed in US Pat. Nos. 4,901,233 (hereinafter referred to as Liptay) and 4,574,349.

상기 개선 및 다른 개선들은 후술하는 발명의 상세한 설명에 개시된다. 발명의 상세한 설명 및 첨부된 도면을 참조하여 IBM사에 의해 처음 개발되어 널리 구현되고 있는 종래 설계에 비해 본 발명의 장점 및 특징을 좀 더 잘 이해할 수 있다.These and other improvements are disclosed in the following detailed description of the invention. With reference to the detailed description of the invention and the accompanying drawings, it is possible to better understand the advantages and features of the present invention as compared to conventional designs first developed and widely implemented by IBM.

본 발명의 바람직한 실시예를 상세하게 논의하기 앞서서, 예시를 통하여 IBM에 의해 최초로 개발되었으며, 립테이 특허 제4,901,233호 기술되어 있는 종래의 무순서 마이크로프로세서 설계 기술을 설명하기로 한다. 도 1은 미국 특허 제4,901,233호에 개시되어 있으며, 레지스터 관리 시스템(register management system, RMS)의 사용을 교시하고 있는 통상의 종래 무순서 마이크로프로세서 설계 기술을 도시한다. RMS는 정밀한 후분기 복귀(precise post-branch recovery)용으로 사용되고 또한 구조 시스템으로 명명된 물리 레지스터보다 범용으로 사용되는 더 많은 물리 레지스터를 사용할 수 있도록 한다. 레지스터 관리 시스템의 사용은 무순서 실행을 가능하게 하기 위하여 필수적이다. 무순서 실행은 캐시의 유한 손실을 상당히 감소시킨다는 점을 인식해야 하며, 이 점이 바로 본 발명의 핵심이다. 미국 특허 제4,901,233호에 개시된 바람직한 실시예는 종래 기술인 정순서 프로세서 설계를 갖는 기본적인 파이프라인에 대한 수정을 포함한다. 이러한 수정은 전체 시스템에 상기 RMS를 통합하고, 정순서 설계의 명령어 파이프라인보다 길거나(즉 단계가 많거나) 또는 각 단계 당 많은 논리(logic)를 포함하는 명령어 파이프라인을 만드는데 필요하다. 미국 특허 제4,901,233호의 바람직한 실시예는 종래의 정순서 설계에 비하여 무한 L1 캐시 CPI 및 유한 캐시 CPI 양쪽 모두의 개선을 가능하게 한다. 본 발명은 상기 무한 L1 CPI를 개선하기 위한 무순서 기술의 사용을 배제하지 않으며, 설계상의 복잡성과 주명령어 실행 파이프라인(main instruction execution pipeline) 내에서의 무순서 지원 간에 좀 더 바람직한 균형을 달성하기 위하여 이들 무순서 기술을 제한적으로 사용되도록 한다. 본 발명은 uPCore 파이프라인 길이나 또는 파이프라인 내 각 단계의 길이의 어떠한 증가도 없이 상기 유한 L1 CPI 가산기(adder)를 감소시키는 무순서 기술을 사용하는 것에 그 요지가 있다. 전체적인 결과는 미국 특허 제4,901,233호보다 시스템 성능이 월등히 우수한데, 그 이유는 데이터베이스 및 트랜잭션 작업 부하의 경우에 미국 특허 제4,901,233호에서 달성되는 무한 L1 캐시 CPI의 개선이 미약한데 비해 본 발명에서는 주기 시간이 개선되어 더 큰 성능을 제공하기 때문이다. 또한, 본 발명은 상기 uPCore가 정순서 설계로 구현된 경우, 상기 레지스터 관리 시스템을 상기 주명령어 처리 파이프라인으로부터 고립시켜서, 무순서 명령 처리에 관련된 모든 발행(issue)에 대한 설계상의 복잡도를 크게 감소시킬 수 있다. 상기 내용을 바탕으로 미국 특허 제4,901,233호에 구현되어 있는 도 1a 및 1b를 설명한다.Prior to discussing the preferred embodiment of the present invention in detail, a conventional unordered microprocessor design technique first developed by IBM and described in Ripta Patent No. 4,901,233 will be described. 1 shows a conventional conventional random order microprocessor design technique that is disclosed in US Pat. No. 4,901,233 and teaches the use of a register management system (RMS). RMS is used for precise post-branch recovery and also allows for the use of more general-purpose physical registers than the physical registers named architecture systems. The use of a register management system is essential to enable random execution. It should be appreciated that random execution significantly reduces the finite loss of cache, which is at the heart of the present invention. Preferred embodiments disclosed in US Pat. No. 4,901,233 include modifications to a basic pipeline with prior art processor designs. This modification is necessary to integrate the RMS into the overall system and to create an instruction pipeline that is longer than the instruction pipeline of the ordered design (ie more steps) or contains more logic for each step. The preferred embodiment of US Pat. No. 4,901,233 enables the improvement of both infinite L1 cache CPI and finite cache CPI over conventional ordered design. The present invention does not exclude the use of random order techniques to improve the infinite L1 CPI, and achieves a more desirable balance between design complexity and random order support within the main instruction execution pipeline. To ensure that these random techniques are used in a limited way. The present invention is directed to the use of a random order technique that reduces the finite L1 CPI adder without any increase in the uPCore pipeline length or the length of each step in the pipeline. The overall result is significantly better system performance than US Pat. No. 4,901,233 because of the insignificant improvement in infinite L1 cache CPI achieved in US Pat. No. 4,901,233 in the case of database and transactional workloads. This is because it is improved to provide greater performance. In addition, the present invention isolates the register management system from the main instruction processing pipeline when the uPCore is implemented in a sequential design, greatly reducing the design complexity for all issues related to the sequential instruction processing. You can. 1A and 1B, which are implemented in US Patent No. 4,901,233, will be described based on the above contents.

상기 립테이 발명은 범용 레지스터(예를 들어 n 개의 범용 레지스터)와 같은 복수의 주소지정 가능한(논리) 특정 레지스터의 구조 설계 요건을 갖는 컴퓨터 시스템용 레지스터 관리 시스템이다. 상기 립테이 설계의 여러 가지 요소는 이하에서 설명될 본 발명에 따른 시스템에 사용된다. m 개의 레지스터(여기서 m은 n보다 큼)를 갖는 레지스터 어레이(RA)는 상기 n 개의 범용 레지스터의 기능을 수행하도록 제공된다. 예시적인 실시예로, 주지의 IBM 시스템/370 구조에 따른 16 개의 범용 레지스터를 갖는 시스템이 미국 특허 제4,901,233호에 개시되어 있으며, 이 시스템은 오늘날까지도 현 S/390 머신에 사용된다. 상기 RA는 구조 레지스터의 기능을 수행하기 위하여 특정 RA 위치의 동적 할당을 제공한다. 특정 레지스터 할당 기능이 완료되면, 상기 RA 내의 위치는 해제(release)되고 적절한 절차를 거쳐 동일 구조 GPR 또는 다른 구조 GPR로 재할당될 수 있다.The RIPTA invention is a register management system for a computer system having a structural design requirement of a plurality of addressable (logical) specific registers, such as general purpose registers (e.g. n general purpose registers). Various elements of the riptain design are used in the system according to the invention to be described below. A register array RA having m registers, where m is greater than n, is provided to perform the function of the n general purpose registers. In an exemplary embodiment, a system having 16 general purpose registers in accordance with the well-known IBM system / 370 architecture is disclosed in US Pat. No. 4,901,233, which is still used in current S / 390 machines to this day. The RA provides dynamic allocation of specific RA locations to perform the function of the structure register. Once a particular register allocation function is completed, the location in the RA can be released and reassigned to the same structure GPR or another structure GPR with appropriate procedures.

레지스터 관리 시스템은 상기 전체 컴퓨터 구조에 종속적이지 않으며, 현 마이크로프로세서 설계에 사용되고 있는 것처럼 다양한 환경 하에서 구현될 수 있다. 따라서, 도 1a 및 1b에 도시된 컴퓨터 시스템(10)은 메인프레임 프로세서 또는 마이크로프로세서인지 여부를 불문하고, 캐시 메모리 시스템(14, cache memory system, CMS)에 연결된 주메모리(12, main memory)를 당연히 갖는다. 상기 캐시 메모리 시스템(14)은 적절한 임의의 방법으로 구성될 수 있으나, 본 실시예에서는 주메모리(12)에 연결되고, 각각 명령어 연산(instruction operation)과 데이터 연산(data operation)을 개별 처리하는 명령어 캐시(16) 및 데이터 캐시(18)로 도시된다. 종속 배열에 있어서, 계층적 메모리 설계는 메모리 속도 및 메모리 사이즈의 장점 모두를 제공하기 위한 하나 이상의 레벨을 캐시 메모리에 제공하며, 상기 계층적 메모리 설계는, 비록 도 1a 및 1b에 도시되지는 않았지만, 도 2에 도시된 바와 같이 본 발명과 모순되지 않는다.The register management system is not dependent on the entire computer architecture and can be implemented under various circumstances as is used in current microprocessor designs. Accordingly, the computer system 10 shown in FIGS. 1A and 1B may include a main memory 12 connected to a cache memory system (CMS) regardless of whether it is a mainframe processor or a microprocessor. Of course you have. The cache memory system 14 may be configured in any suitable manner, but in the present embodiment, the cache memory system 14 is connected to the main memory 12, and each instruction performs a separate operation for an instruction operation and a data operation. Shown as cache 16 and data cache 18. In a dependent arrangement, the hierarchical memory design provides one or more levels in the cache memory to provide both memory speed and memory size advantages, which are not shown in FIGS. 1A and 1B, although As shown in FIG. 2, this does not contradict the present invention.

도 1a 및 1b와 같이, 명령어는 명령어 캐시(16)로부터 명령어 버퍼 유닛(20)을 통해 명령어 레지스터 유닛(22)으로 전송된다. 설명 목적상, 상기 명령어 레지스터 유닛(22)은 하나 이상의 개별 명령어 레지스터를 가지며, 이러한 명령어 레지스터의 바람직한 개수는 2, 3 또는 4 개이다.As shown in FIGS. 1A and 1B, an instruction is transferred from the instruction cache 16 to the instruction register unit 22 through the instruction buffer unit 20. For illustrative purposes, the instruction register unit 22 has one or more individual instruction registers, and the preferred number of such instruction registers is two, three or four.

실행 유닛으로 기능하는 범용 실행 유닛은 산술 또는 논리, 스칼라 또는 벡터, 스칼라 또는 부동 소수점 등 수행되는 기능 타입(type)의 분야에 따라서 설계될 수 있다. 범용 레지스터 유닛을 갖는 임의의 장치도 모두 범용 레지스터(general purpose register, GPR)를 사용하기 때문에, 본 발명은 수적으로나, 컴퓨터의 범용 실행 유닛의 기능 배치 및 설계 면에서 많은 변형에 적용될 수 있다.A general purpose execution unit that functions as an execution unit may be designed according to the field of arithmetic or logic, scalar or vector, scalar or floating point type, etc. to be performed. Since any device having a general purpose register unit all uses a general purpose register (GPR), the present invention can be applied numerically or to many variations in terms of the functional layout and design of the general purpose execution unit of a computer.

설명 목적상, 상기 립테이 시스템은 각각 참조 번호(24 및 26)로 표시되는 범용 실행 유닛(GPE) 1 및 2와 함께 도시되었다. 범용 실행 유닛(24)은 기억 버퍼 유닛(28)에 연결되는 출력을 가지며, 상기 기억 버퍼 유닛(28)은 데이터 캐시(18)에 연결되는 출력을 갖는다. 범용 실행 유닛(24)은 실질적으로 단일 실행 유닛이거나 또는 본 실시예에서 도시된 바와 같이 유닛들의 조합이 될 수 있으며, 유닛(24)은 결과를 생성하는데 이들 결과는 명령어 완료시까지 보유되는 기억 버퍼(28)로 진행한 후 메모리에 기억될 수 있다. 범용 실행 유닛(26)은 본 발명에 따른 범용 레지스터 어레이(30)에 연결되는 출력을 갖는다. 범용 실행 유닛(26)은 즉시 기억되지 않는 것이 아니라 레지스터에서 사용가능하도록 요구되는 결과를 발생하는 명령어를 수행한다. 명령어 스택 또는 큐(31)가 제공되어 상기 명령어 레지스터 유닛(22)으로부터 명령어를 수신하고 이들 수신된 명령어를 적절한 GPE(24) 또는 GPE(26)로 전송한다. 다양한 타입의 다중 실행 유닛이 단일 RA 및 레지스터 관리 시스템과 함께 사용될 수 있다. RA(30)는 상기 구조에 의해 인식되는 16 개의 범용 레지스터의 기능을 수행하는 동적으로 할당가능한 32개의 실(물리적) 레지스터를 포함한다.For illustrative purposes, the riptain system is shown with general purpose execution units (GPEs) 1 and 2, denoted by reference numerals 24 and 26, respectively. The general purpose execution unit 24 has an output connected to the memory buffer unit 28, and the memory buffer unit 28 has an output connected to the data cache 18. The general purpose execution unit 24 may be a substantially single execution unit or may be a combination of units as shown in this embodiment, and the unit 24 generates a result, which results in a memory buffer (which is held until instruction completion). And then stored in memory. The general purpose execution unit 26 has an output connected to the general purpose register array 30 according to the invention. The general purpose execution unit 26 executes instructions which do not immediately be stored, but which produce the result required to be available in a register. An instruction stack or queue 31 is provided to receive instructions from the instruction register unit 22 and send these received instructions to the appropriate GPE 24 or GPE 26. Various types of multiple execution units may be used with a single RA and register management system. RA 30 includes 32 dynamically assignable physical registers that perform the function of the 16 general purpose registers recognized by the above structure.

RA(30)는 제어 버스(34)를 통하여 레지스터 관리 시스템(32, RMS)에 의해 제어되고, RMS에게 상태정보(status information)를 제공한다. 상기 RMS(32)는 다양한 타입의 상태정보를 수신 및 제공하기 위하여 몇 개의 다른 시스템에 연결된다. 인터럽트 제어 요소(36)는 명령어 레지스터(22), RMS(32) 및 RA(30)에 연결되어 인터럽트의 적절한 핸들링(handling)을 처리하고 요구되는 상태정보를 보존한다.The RA 30 is controlled by the register management system 32 (RMS) via the control bus 34 and provides status information to the RMS. The RMS 32 is connected to several different systems to receive and provide various types of status information. Interrupt control element 36 is coupled to instruction register 22, RMS 32, and RA 30 to handle proper handling of interrupts and to preserve required state information.

RMS(32)는 실행 및 입력과 출력 오퍼랜드용 레지스터의 할당을 통한 발행(issuance)으로부터의 명령어를 추종할 목적으로 명령어 레지스터 유닛(22) 및 GPE(24 및 26)에 연결된다.RMS 32 is coupled to instruction register unit 22 and GPEs 24 and 26 for the purpose of following instructions from issuance through the execution and allocation of registers for input and output operands.

도 1a 및 1b의 컴퓨터는 명령어 레지스터 유닛(22)으로부터 명령어를 수신하도록 연결되며, 명령어 주소 계산 요소(I-ACE, 52)에 대한 출력을 갖는 명령어 큐(50)를 갖는다. I-ACE(52)는 또한 RA(30)로부터 직접 입력을 수신하도록 연결되고, 명령어 캐시(16)에 연결되는 출력을 갖는다. 명령어 큐(50)는 RMS(32)에 상태정보를 제공하도록 연결된다.The computers of FIGS. 1A and 1B are connected to receive instructions from the instruction register unit 22 and have an instruction queue 50 with outputs for the instruction address computation element (I-ACE) 52. I-ACE 52 is also coupled to receive input directly from RA 30 and has an output coupled to instruction cache 16. The command queue 50 is connected to provide status information to the RMS 32.

도 1a 및 1b의 컴퓨터는 명령어 레지스터 유닛(22)으로부터 출력을 수신하기 위해 연결된 주소 큐(60)를 갖는다. 주소 큐(60)의 출력은 데이터 주소 계산 요소(A-ACE, 62)에 입력으로 연결된다. D-ACE(62)는 상태정보를 제공하기 위하여 RMS(32)에 연결된다.The computer of FIGS. 1A and 1B has an address queue 60 connected to receive output from the instruction register unit 22. The output of the address queue 60 is connected as an input to the data address calculation element (A-ACE) 62. D-ACE 62 is coupled to RMS 32 to provide status information.

D-ACE(62)의 출력은 주소 페치 큐(64)에 연결되고, 이 주소 페치 큐(64)는 데이터 캐시(18)에 입력으로 연결되는 제1 출력 및 주소 기억 큐(66)에 입력으로 연결되는 제2 출력을 갖는다. 주소 기억 큐(66)는 데이터 캐시(18)에 연결된 출력을 가지며, 상태정보를 제공하기 위하여 RMS(32)에 연결된다.The output of the D-ACE 62 is connected to an address fetch queue 64 which, as an input to the first output and address memory queue 66, is connected as an input to the data cache 18. Has a second output connected. Address storage queue 66 has an output coupled to data cache 18 and coupled to RMS 32 to provide status information.

컴퓨터는 부동 소수점 산술 유닛(70)을 가지며, 이 부동 소수점 산술 유닛(70)은 또한 상태정보를 제공하도록 RMS(32)에 연결된다. 이하에서 설명되는 바와 같이, RMS(32)가 RA(30)와 연관되지 않은 유닛 및 레지스터와 함께 작업할 수 있다는 점에 유의해야 한다. 예를 들어 하나의 RMS가 하나 이상의 레지스터 어레이와 함께 작업할 수 있다. 좀 더 상세하게, 하나의 RMS는 동일 또는 다른 타입의 다중 실행 요소에 차례로 연결될 수 있는 두 RA를 제어할 수 있다.The computer has a floating point arithmetic unit 70, which is also coupled to the RMS 32 to provide status information. As described below, it should be noted that the RMS 32 may work with units and registers that are not associated with the RA 30. For example, one RMS can work with more than one register array. More specifically, one RMS can control two RAs that can in turn be connected to multiple execution elements of the same or different types.

부동 소수점 유닛(FPU, 70)에 대한 입력은 부동 소수점 명령어 큐(72) 및 부동 소수점 데이터 레지스터(74)에 의해 제공된다. 부동 소수점 명령어 큐(72)는 명령어 레지스터(22)로부터 입력을 수신한다. 부동 소수점 데이터 레지스터(74)는 데이터 캐시(18) 및 FPU(70)로부터 입력을 수신한다. 상기 부동 소수점 유닛(70, FPU)의 출력은 출력이 데이터 캐시(18)의 입력으로 연결되는 출력을 갖는 기억 버퍼 유닛(76)에 연결된다.Input to the floating point unit (FPU) 70 is provided by the floating point instruction queue 72 and the floating point data register 74. Floating point instruction queue 72 receives input from instruction register 22. Floating point data register 74 receives input from data cache 18 and FPU 70. The output of the floating point unit 70 (FPU) is connected to a memory buffer unit 76 having an output whose output is connected to the input of the data cache 18.

본 발명을 더욱 상세히 설명하면, 본 발명에 기술될 시스템은 도 2에 도시된 바와 같이 대규모 캐시 및 캐시의 다중 레벨이 제공될 수 있는 경우에 효과적으로 사용할 수 있다는 것을 알 수 있다. 본 발명은 기존 캐시의 성능을 향상시키고, 추론적 페치 동작은 캐시 각 레벨의 미스율(miss rate)을 개선한다. 전체적인 성능 이득은 많은 경우에 실리콘 사이즈의 SFE에 의해 칩 내장형 캐시의 수가 증가된 경우에 얻어질 수 있는 성능과 비교해 측정되어야 한다. 이러한 비교가 반드시 유효하지 않은 예로서 L1 캐시의 경우가 있는데, 그 이유는 L1 캐시의 경우 면적이 아니라 주기 시간 제한이 일반적으로 결정적인 제한 요소가 되기 때문이다. 사전 실험 결과 칩 내장형 제2 캐시 사이즈의 대략 1/4 내지 1/2인 SFE를 사용하면 15 내지 20%의 성능 향상이 가능함을 보여주고 있다.In more detail, it will be appreciated that the system to be described in the present invention can be used effectively where a large cache and multiple levels of cache can be provided as shown in FIG. The present invention improves the performance of existing caches, and the speculative fetch operation improves the miss rate of each level of cache. The overall performance gain should be measured in comparison with the performance that can be achieved in many cases when the number of on-chip caches is increased by a silicon size SFE. An example where such a comparison is not necessarily valid is in the case of the L1 cache, because in the case of the L1 cache, the cycle time limit is generally the decisive limiting factor. Preliminary experiments have shown that using an SFE of approximately 1/4 to 1/2 of the on-chip second cache size provides a performance improvement of 15 to 20%.

도 2 - 바람직한 실시예Figure 2-Preferred Embodiment

도 2에 도시된 본 발명의 바람직한 실시예와 같이, 구성 요소들의 상호 연결은 uPCore(200)와 동기유닛(SU 201) 사이, SFE와 명령어 캐쉬 및 데이터 캐시(203) 사이의 인터페이스와 같은 다양한 인터페이스에 의해 제공된다. 캐시 메모리 시스템은 적절한 임의의 방법으로 구성될 수 있지만, 본 실시예에서는 캐시 메모리에 하나 이상의 레벨을 갖는 캐시 메모리(예를 들어 203', ⃛ ,203'')를 제공하여 종속 배열에 있어서 메모리 속도 및 메모리 사이즈의 장점을 모두 제공하는 상기 계층적 메모리의 주 메모리(204)에 연결되는 조합된 명령어 및 데이터 캐시(203)로 도시되어 있으며, 이러한 계층적 메모리 설계는 본 발명과 모순되지 않는다. 분할 명령어 캐쉬 및 데이터 캐시(split instruction and data cache) 또한 본 발명과 모순되지 않는다.As in the preferred embodiment of the present invention shown in FIG. 2, the interconnection of the components is a variety of interfaces, such as the interface between the uPCore 200 and the synchronization unit (SU 201), between the SFE and the instruction cache and data cache 203. Provided by The cache memory system may be configured in any suitable manner, but in the present embodiment, cache memory having one or more levels in the cache memory (for example, 203 ', ⃛ 203 '') is shown as a combined instruction and data cache 203 coupled to the main memory 204 of the hierarchical memory providing both memory speed and memory size advantages in a dependent arrangement, This hierarchical memory design does not contradict the present invention. Split instruction and data caches are also inconsistent with the present invention.

임의 개수의 uPCore(200, ⃛ ,200', ⃛ ,200'')가 임의 개수의 SFE(202, ⃛ ,202', ⃛ ,202'')와 함께 사용될 수 있다. SFE는 임의의 주어진 시간에 단일 uPCore와 연관될 수 있으나, 동기 기능이 수행된 후에야 임의의 다른 uPCore로 연관을 변경할 수 있다. 각 SFE는 하나의 기억 버퍼 및 하나의 SU와 연관된다. 예를 들어 201',202' 및 205'는 함께 사용되어 요구되는 SFE 기능을 제공한다. 임의 개수의 SFE는 동일 시간에 단일 uPCore와 연관될 수 있다. 바람직한 실시예는 단일 SFE와 복수의 uPCore를 갖는다.Any number of uPCores (200, ⃛ , 200 ', ⃛ , 200 '') represents any number of SFEs (202, ⃛ , 202 ', ⃛ , 202 ''). The SFE may be associated with a single uPCore at any given time, but may only change the association to any other uPCore after the synchronization function has been performed. Each SFE is associated with one memory buffer and one SU. For example, 201 ', 202' and 205 'are used together to provide the required SFE functionality. Any number of SFEs can be associated with a single uPCore at the same time. The preferred embodiment has a single SFE and a plurality of uPCores.

바람직한 실시예의 상세한 하드웨어 설명에 앞서서, uPCore가 다르게 동작할 수 있는, 대안적이고 일반화된 실시예를 도 2로부터 인식할 수 있다. 일반화된 도 2는 본 명세서에 도시되고 설명된 기능과 동일한 기능을 수행하지만, 200, 200',200'' 및 202, 202', 202'' 사이에서 구조 제어가 교대로 변하는 경우에도 본 발명에서 상세히 설명되는 것과 완전히 동일한 기능이 수행된다.Prior to the detailed hardware description of the preferred embodiment, it can be seen from FIG. 2 that an alternative, generalized embodiment in which uPCore may operate differently. Generalized FIG. 2 performs the same functions as shown and described herein, but also in the present invention even when structural control alternates between 200, 200 ', 200' 'and 202, 202', 202 ''. The exact same function as described in detail is performed.

따라서, 바람직한 실시예는 대안적인 일반화된 실시예(제1 종래 처리 요소 uPCore 200, 200',200'' 및 제2 처리 요소 SFE 202, 202', 202''가 서로 일치하여 동작하면서, 교대로 머신의 구조 상태 제어를 가짐―의 특정한 바람직한 예이다. 도 2에 도시된 바람직한 실시예에서, 제1 처리요소는 구조 상태의 제어를 가지며, 순차 명령어 스트림의 대부분 명령어를 정순서로 처리한다. 그래서 일반적으로 제1 및 제2 처리요소―여기서 각 처리요소는 각각의 범용 레지스터 및 제어 레지스터의 설정에 의해 결정되는 상태를 가짐―를 갖는 컴퓨터 시스템 내에서 순차 명령어 스트림을 처리하는 방법은 상기 순차 명령어 스트림의 개시 명령어(initial instruction)를 처리요소 중 하나인 제1 처리요소(예를 들어 참조 번호(200))로 보내는 것으로 시작한다. 컴퓨터 시스템 구조 상태의 임의의 변화를 제2 처리요소로 전송하는 제1 처리요소를 계속 사용하여 상기 순차 명령어 스트림의 처리가 이루어진다. 그러나 상기 제1 처리요소(예를 들어 uPCore 200)에 의해 순차 명령어 스트림의 처리 도중 임의의 시점에서, 예를 들어 SFE(202)와 같은 제2 처리요소가 동일한 순차 명령어 스트림의 연속된 처리를 시작하는 것이 이득이 되는 경우, 컴퓨터 시스템의 제2 처리요소는 전송된 상태를 복구하고, 제2 처리요소로 상기 순차 명령어 스트림을 처리함으로써 상기 동일한 순차 명령어 스트림의 연속된 처리를 시작한다.Thus, the preferred embodiment alternates with alternative generalized embodiments, where the first conventional processing elements uPCore 200, 200 ', 200 " and the second processing elements SFE 202, 202', 202 " It is a particular preferred example of having the machine's structural state control .. In the preferred embodiment shown in Figure 2, the first processing element has the control of the structural state and processes most of the instructions in the sequential instruction stream in order. Thus, a method of processing a sequential instruction stream in a computer system having a first and a second processing element, wherein each processing element has a state determined by the setting of a respective general purpose register and a control register, is described in detail. Begin by sending an initial instruction to one of the processing elements, the first processing element (eg, reference numeral 200.) Computer System Architecture Status Processing of the sequential instruction stream is continued using a first processing element that sends any change in the second processing element, but during the processing of the sequential instruction stream by the first processing element (e.g. uPCore 200). At any point in time, for example, if it is beneficial for a second processing element, such as SFE 202, to begin successive processing of the same sequential instruction stream, the second processing element of the computer system may recover the transmitted state and The processing of the sequential instruction stream with a second processing element starts a continuous process of the same sequential instruction stream.

그 후 상기 제2 처리요소는 제1 처리요소에 의해 요구되는 컴퓨터 시스템의 구조 상태의 모든 변화를 제1 처리요소로 전송한다.The second processing element then sends all changes in the structural state of the computer system required by the first processing element to the first processing element.

교대 제어(alternating control)에 대한 대안적인 실시예 및 본 발명의 바람직한 실시예 양자의 경우, 제1 및 제2 프로세서는 동일한 명령어를 실행할 수 있으나, 상기 처리요소 중 오직 하나만이 제1 및 제2 처리요소 상태들의 조합에 의해 결정되는 상기 컴퓨터 시스템의 전체적인 구조 상태(architectural state)를 변경할 수 있다. 본 발명의 바람직한 실시예에서, 상기 조합은 제1 처리요소에 의해 결정된다. 비록 대안적인 실시예에서는 시스템의 구조 상태가 전적으로 또는 부분적으로 제2 처리요소의 상태에 의해 결정될 수 있지만, 본 발명의 바람직한 실시예에서는 제2 처리요소 SFE의 동작이 시스템의 구조 상태를 변경시키지 않는다. 본 발명의 바람직한 실시예에 있어서, uPCore 파이프라인이 대부분의 모든 순차 명령어를 정순서로 핸들링하는 한편, SFE는 uPCore 및 SFE에 의해 공유된 캐시를 프라이밍(prime)하고, 구조 상태의 제어를 갖는 uPCore와 가능한 한 자주 재동기를 수행하도록 사용되는 명령어를 사전처리(preprocess)하는 동시에 SFE의 결과가 별도의 개인용 기억 버퍼에 기억됨에 따라 SFE가 명령어를 사전처리하는 경우 유한 캐시 손실이 감소된다.In both the alternative embodiment for alternating control and the preferred embodiment of the present invention, the first and second processors may execute the same instruction, but only one of the processing elements is the first and second processing. It is possible to change the overall architectural state of the computer system, which is determined by the combination of element states. In a preferred embodiment of the invention, the combination is determined by the first processing element. Although in an alternative embodiment the structural state of the system may be determined in whole or in part by the state of the second processing element, in a preferred embodiment of the present invention the operation of the second processing element SFE does not change the structural state of the system. . In a preferred embodiment of the present invention, the uPCore pipeline handles most of all sequential instructions in order, while the SFE primes the cache shared by uPCore and SFE, and has uPCore with control of the structure state. The finite cache loss is reduced if the SFE preprocesses the instructions as the result of the SFE is stored in a separate private memory buffer while preprocessing the instructions used to perform resynchronization as often as possible.

상기 바람직한 실시예에서는 구조 상태의 스위칭이 일어나지 않지만, 교대 제어 실시예에서는 구조 상태의 제어가 스위칭된다.Switching of the rescued state does not occur in the above preferred embodiment, while control of the rescued state is switched in the alternate control embodiment.

일반화된 방법에서, 상기 제1 및 제2 처리요소 각각은 자신의 범용 레지스터 및 제어 레지스터의 설정에 의해 결정되는 상태를 가지며, 순차 명령어 스트림을 처리하는 동안 동일한 명령어를 실행할 수 있지만, 상기 처리요소 중 오직 하나만이 제1 및 제2 처리요소 상태들의 일부 조합에 의해 결정되는 컴퓨터 시스템의 전체 구조 상태를 변경할 수 있는데, 여기서 구조 상태를 제어하는 처리요소는 제1 처리요소로부터 제2 처리요소로 변경되고, 제2 처리요소로부터 제1 처리요소로 변경될 수 있다. 이러한 프로세스는 상기 제2 처리요소에 의해 요구되는 상기 컴퓨터 시스템의 구조 상태의 임의의 변화를 제2 처리요소에 전송하고, 이러한 전송된 변화를 장래의 소정 시점에서 제2 처리요소용 구조 상태를 위해 사용하기 위해 누적시키면서, 처리 개시를 위하여 상기 제1 처리요소를 사용하여 순차 명령어 스트림을 처리한다. 그 후, 상기 제1 처리요소에 의해 상기 순차 명령어 스트림의 처리 중 임의의 시점에서 상기 제2 처리요소가 동일한 순차 명령어 스트림의 연속된 처리를 인계하는 것이 이득이 된다고 결정되면, 상기 제2 처리요소는 종전에 상기 제1 처리요소로부터 전송된 누적된 구조 상태를 복구하며, 상기 순차 명령어 스트림을 처리하여 동일한 순차 명령어 스트림의 연속된 처리를 인계한다. 상기 제2 처리요소가 순차 명령어 스트림의 처리를 제어하는 도중에, 상기 제2 처리요소는 상기 제1 처리요소에 의해 요구되는 컴퓨터 시스템의 구조 상태에 대한 임의의 변화를 장래의 소정 시점에 사용될 구조 상태용으로 상기 변화를 누적 및 사용하기 위하여 제1 처리요소에 전송한다. 제어는 다시 변화할 수 있으며, 상기 제2 처리요소에 의해 상기 순차 명령어 스트림을 처리하는 도중 임의의 시점에서 상기 제1 처리요소가 제어를 다시 시작하고, 동일한 순차 명령어 스트림의 연속된 처리를 인계하는 것이 이득이 되는 경우, 상기 제1 처리요소는 종전에 상기 제2 처리요소로부터 전송된 누적된 구조 상태를 복구하며, 상기 순차 명령어 스트림을 처리하여 동일한 순차 명령어 스트림의 연속된 처리를 인계한다.In a generalized method, each of the first and second processing elements has a state determined by the setting of its own general purpose register and a control register, and may execute the same instruction while processing the sequential instruction stream, but among the processing elements Only one can change the overall structural state of the computer system, which is determined by some combination of the first and second processing element states, wherein the processing element controlling the structural state is changed from the first processing element to the second processing element and It may be changed from the second processing element to the first processing element. This process sends to the second processing element any change in the structural state of the computer system required by the second processing element and, at a given point in time, for the structural state for the second processing element. Accumulating for use, the first processing element is used to process the sequential instruction stream for processing initiation. Then, if it is determined by the first processing element that it is beneficial for the second processing element to take over the continuous processing of the same sequential instruction stream at any point in the processing of the sequential instruction stream, then the second processing element Recovers the cumulative structural state previously transmitted from the first processing element, and processes the sequential instruction stream to take over processing of the same sequential instruction stream. While the second processing element controls the processing of the sequential instruction stream, the second processing element may use any change to the structural state of the computer system required by the first processing element at a predetermined point in time to be used in the future. And transmit the change to the first processing element for accumulation and use. Control may change again, wherein at any point during the processing of the sequential instruction stream by the second processing element, the first processing element resumes control and takes over the continuous processing of the same sequential instruction stream. If so, the first processing element recovers the cumulative structural state previously transmitted from the second processing element, and processes the sequential instruction stream to take over the subsequent processing of the same sequential instruction stream.

제1 처리요소 및 제2 처리요소는 다중 프로세서로 기능할 수 있다. 또한 200, 200' 및 200''으로 도시된 것처럼 제1 프로세서는 단일 SFE 또는 다중 SFE와 함께 멀티프로세서로 기능할 수 있는 복수의 제1 처리요소를 포함할 수 있다. 그러나 다중 SFE는 단일 uPCore와 함께 사용되지 않는다. 즉 멀티프로세서는 하나 이상의 제1 처리요소 및 적어도 하나의 제2 처리요소로 이루어진 집합의 조합과 함께 기능할 수 있다. 바람직한 실시예에 있어서, 하나의 동기 유닛 SU(201, 201', 201'') 형태의 동기 능력(synchronization capability)이 제2 처리요소 각각에 제공된다. 상기 SU는 제2 처리요소 SFE(202, 202' 및 202'')가 제1 처리요소 uPCore에 의해 처리되고 있는 명령어와 동일한 명령어의 처리를 언제 시작할지 결정하여 명령어 스트림을 처리한다. 따라서 하나의 동기유닛이 각 SFE용으로 제공되며, SU는 uPCore에 의해 처리되고 있는 프로세싱 스트림의 동일 명령어 또는 다음 명령어를 SFE가 언제 처리를 시작할지 결정한다. 상기 SU는 SFE 처리요소에 의한 명령어 처리가 언제 정지되거나 무시되어야 하는지를 결정한다. 이러한 결정은 제1 및 제2 처리요소로부터 동기유닛으로 제공되는 입력을 갖는 전체 컴퓨터 시스템에 대한 계산된 이득 결정(computed benefit determination)에 의해 이루어진다. 상기 입력은 동기유닛에 즉시 제공되거나 또는 도4에서와 같이 카운터(407, 408)가 정보를 제공하는 시스템에 기억된 정보로부터 제공될 수 있다.The first processing element and the second processing element may function as multiple processors. The first processor may also include a plurality of first processing elements capable of functioning as a multiprocessor with a single SFE or multiple SFEs as shown at 200, 200 'and 200' '. However, multiple SFEs are not used with a single uPCore. In other words, the multiprocessor may function with a combination of sets of one or more first processing elements and at least one second processing element. In a preferred embodiment, synchronization capabilities in the form of one synchronization unit SU 201, 201 ′, 201 ″ are provided for each of the second processing elements. The SU processes the instruction stream by determining when the second processing element SFEs 202, 202 ′ and 202 ″ start processing the same instruction as the instruction being processed by the first processing element uPCore. Therefore, one sync unit is provided for each SFE, and the SU determines when the SFE starts processing the same instruction or the next instruction of the processing stream being processed by uPCore. The SU determines when instruction processing by the SFE processing element should be stopped or ignored. This determination is made by a calculated benefit determination for the entire computer system having inputs provided from the first and second processing elements to the synchronization unit. The input may be provided immediately to the synchronization unit or from information stored in a system in which the counters 407 and 408 provide the information, as in FIG.

도 7의 단계(709)에서와 같이 제1 처리요소에 의해 명령어를 처리하는 도중 지연(stall) 결정이 있으면, 동기유닛은 처리 중인 명령어와 동일한 명령어를 제2 처리요소가 언제 처리를 시작할지 결정한다. 제2 처리요소가 핸들링하도록 설계되지 않은 경우, 즉 이용 가능한 유효 op가 없는 경우(707)의 동작(operation)이 존재할 때, 명령어 처리요소의 처리 중 동기유닛은 SFE 및 uPCore 상태를 재동기시킴으로써 언제 컴퓨터 시스템의 구조 상태를 제2 처리요소의 상태와 재동기시킬지 결정한다. 제2 처리요소가 명령어 스트림의 처리 도중 컴퓨터 시스템에 어떠한 이득(특정 이득 결정(208))도 제공하지 않는다는 결정이 있으면, 동기유닛은 제2 처리요소의 상태를 컴퓨터 시스템의 구조 상태와 언제 재동기시킬지 결정한다. 도 7의 707, 708, 709로 예시된 모든 결정은 동기유닛에 의해 언제 재동기화가 수행되는지에 대한 결정뿐만 아니라 어느 처리요소가 상태를 재동기화하는지에 대한 결정이기도 하다. 명령어를 사전처리하는 프로세서(SFE)는 자신의 결과를 자신과 연결된 개인용 범용 레지스터 또는 기억 버퍼(205, 205' 또는 205'')에 기억한다. 이러한 기억은 다른 처리요소의 구조 상태에 영향을 미치지 않기 때문에, 이들 별개의 동기는 SFE로 하여금 순차 스트림의 명령어의 대부분을 핸들링하는 프로세서의 성능을 개선하도록 하며, SFE는 제1 처리요소에 의해 처리되고 있는 처리 중인 스트림의 동일 또는 다음번 명령어를 처리할 수 있으며, 상기 SU는 제2 처리요소의 명령어 처리가 언제 정지되거나 무시되어야 하는지를 결정할 수 있다. 제1 처리요소는 페치를 목적으로 제1 및 제2 처리요소 모두에 의해 공유된 데이터 캐시 및 명령어 캐시로부터 데이터를 페치한다.If there is a stall determination while processing the instruction by the first processing element as in step 709 of FIG. 7, the sync unit determines when the second processing element starts processing the same instruction as the instruction being processed. do. When there is an operation when the second processing element is not designed to handle, i.e., when there is no valid op available (707), the synchronization unit during the processing of the instruction processing element may cause the synchronization unit to resynchronize the SFE and uPCore states. Determine whether to resynchronize the structural state of the computer system with the state of the second processing element. If it is determined that the second processing element does not provide any gain (specific gain determination 208) to the computer system during processing of the instruction stream, then the synchronization unit resynchronizes the state of the second processing element with the structural state of the computer system. Decide if you want to. All of the decisions illustrated by 707, 708, and 709 of FIG. 7 are not only a determination as to when resynchronization is performed by the synchronization unit, but also as to which processing element resynchronizes the state. The processor SFE, which preprocesses the instructions, stores its results in a private general purpose register or memory buffer 205, 205 'or 205 " associated with it. Since this memory does not affect the structural state of the other processing elements, these separate synchronizations allow the SFE to improve the performance of the processor handling most of the instructions in the sequential stream, which is processed by the first processing element. The same or next instruction of the stream being processed may be processed, and the SU may determine when instruction processing of the second processing element should be stopped or ignored. The first processing element fetches data from the data cache and the instruction cache shared by both the first and second processing elements for fetching purposes.

본 발명의 바람직한 실시예의 상기 방법은 제1 처리요소용 캐시를 프라이밍하고, 무순서 프로세서로서 사전처리를 핸들링하기 위해 상기 순차 명령어 스트림을 사전처리하는데 SFE가 사용될 수 있도록 해준다. 재동기화 도중 및 제2 처리요소에 의한 명령어 처리가 정지되거나 무시되어야 할 때, 상기 제2 처리요소는 재동기 전에 제1 처리요소용 명령어 스트림의 사전처리의 일부 및 전체의 결과를 소거(purge)한다.The method of the preferred embodiment of the present invention allows the SFE to be used to prime the cache for the first processing element and to preprocess the sequential instruction stream to handle the preprocessing as an out of order processor. During resynchronization and when instruction processing by the second processing element is to be stopped or ignored, the second processing element purges the results of some and all of the preprocessing of the instruction stream for the first processing element before resynchronization. do.

따라서, 바람직한 실시예에서 SFE(202)용 개인용 기억 버퍼(205)뿐만 아니라 SFE, 동기유닛 및 (복수를 뜻하는) 2개의 uPCore가 상기 설명되고 도 7에 예시된 방법에 사용된다는 것을 알 수 있다. 동기유닛(201)은 도 7에 도시된 바와 같이, SFE(202)의 상태를 포함한다. 가능한 상태로는 실행(A, running), 소거(B, purging), SFE의 uPCore(200)와의 재동기(C), 및 SFE의 uPCore(200')와의 재동기(D)가 있다. 초기 SFE 상태는 C이다. 상기 상태 C에서, SFE는 가장 최근에 퇴거(retire)된 명령어 주소를 uPCore(200)로부터 수신하여 그 주소에서 무순서 실행의 시작을 준비한다. 상기 동기유닛(201)은 SFE와 함께 기능하는 각 uPCore를 이용하여 캐시 미스(cache miss)에 의해 uPCore가 지연되었음을 표시하기 위한 uPCore에 대한 SU의 인터페이스를 연속적으로 감시한다. uPCore는 실행중이며, 인터페이스(210)를 통하여 캐시 기억장치 및 주기억장치를 계속 참조한다. 명령어 및 오퍼랜드 데이터는 상기 명령어 및 데이터 캐시(203)로부터 인터페이스를 통하여 uPCore로 복귀된다.Thus, in the preferred embodiment it can be seen that the SFE, synchronization unit and two uPCores (meaning plural) as well as the personal memory buffer 205 for the SFE 202 are used in the method described above and illustrated in FIG. . The synchronization unit 201 includes the state of the SFE 202, as shown in FIG. Possible states are A, running, B, purging, resynchronization C with uPCore 200 of SFE, and resynchronization D with uPCore 200 'of SFE. The initial SFE state is C. In state C, the SFE receives the most recently retired instruction address from uPCore 200 and prepares to begin random execution at that address. The synchronization unit 201 continuously monitors the interface of the SU to the uPCore for indicating that the uPCore has been delayed by a cache miss using each uPCore functioning with the SFE. uPCore is running and continues to reference cache memory and main memory via interface 210. Instruction and operand data is returned from the instruction and data cache 203 to uPCore via an interface.

SFE의 레지스터 관리 시스템이 uPCore와 연관된 SRAL의 내용을 SFE의 DRAL로 적재(load)하면, 재동기로부터 SFE 실행(상태 A)으로 상태변경이 발생한다. SFE 실행 상태로 진입하면, SFE는 상기 uPCore로부터 인터페이스(206)를 통하여 가장 최근에 수신한 명령어 주소에서 명령어 페치 및 실행을 시작한다. 동일 명령어 주소에 의해 지정된 명령어가 퇴거되면 상기 uPCore가 가졌던 상태와 동일한 상태를 상기 SFE의 GPR 상태가 반영한다. SFE가 실행되고 있는 동안, 인터페이스(206)를 통하여 수신된 GPR의 결과가 계속 일반 레지스터 어레이에 기록되지만, 레지스터 관리 시스템은 이들 결과를 동기 레지스터 할당 목록(Synchronization Register Assignment List, SRAL)과 연관시킨다. GPR의 결과는 동기 발생 후 SFE에서 실행되는 명령어에 의해서만 사용된다. 이러한 방법으로 SFE는 자신과 연관된 각 uPCore의 GPR 상태의 별도의 이미지―후에 액세스할 수 있음―를 보유한다. 그 동안 SFE의 RMS는 SFE 실행 결과만을 사용하여 명령어 스트림의 SFE 실행에 사용되는 GPR의 이미지를 갱신한다.When the register management system of the SFE loads the contents of the SRAL associated with the uPCore into the DRAL of the SFE, a state change occurs from resynchronization to execution of the SFE (state A). Upon entering the SFE execution state, the SFE initiates instruction fetch and execution at the most recently received instruction address from the uPCore via the interface 206. When the instruction designated by the same instruction address is evicted, the GPR state of the SFE reflects the same state as that of the uPCore. While the SFE is running, the results of the GPR received through the interface 206 continue to be written to the general register array, but the register management system associates these results with a Synchronization Register Assignment List (SRAL). The result of the GPR is used only by instructions executed in the SFE after synchronization occurs. In this way, the SFE holds a separate image of the GPR state of each uPCore associated with it, which can be accessed later. Meanwhile, the RMS of the SFE updates the image of the GPR used for the SFE execution of the instruction stream using only the result of the SFE execution.

SFE가 실행 상태로 진입한 직후, 무순서 명령어 실행이 시작되면, uPCore는 데이터 캐시(203)로부터 자신의 명령어를 자신의 속도(pace)대로 페치하는 동작을 계속 실행하는데, 여기서 데이터 캐시(203)는 종래 uPCore 처리요소가 사용하기에 앞서, 명령어 및 오퍼랜드 데이터의 캐시 기억 장치(203)에 공급되는 추론 엔진 처리요소 SFE 기억 참조를 포함한다. 바람직한 실시예에 있어서, 정순서 프로세서 혹은 정순서 처리용으로 최적화된 프로세서로 배타적으로 설계되거나, 또는 실질적으로 모든 명령어의 95% 이하가 예측(prediction)된 이득을 얻지 못하는 명령어의 처리를 핸들링할 수 있는 프로세서가 될 수 있다. 따라서 L1 캐시 미스(cache miss)의 경우에 파이프라인 지연이 발생될 수 있다. 상기 SFE는 무순서 실행을 할 수 있기 때문에, 지연의 원인이 되는 과거의 명령어를 계속 실행할 수 있다. SFE가 실행되고 있는 시간 동안, 인터페이스(207)를 통해서는 명령어 및 데이터 캐시에 전송되고, 인터페이스(208)를 통해서는 기억 버퍼에 전송되는 페치 참조(fetch reference)를 발생한다. 만일 캐시 및 기억 버퍼가 소망의 데이터를 갖고 있지 않으면, 캐시 미스가 발생한다. 상기 명령어 및 오퍼랜드는 상기 기억 버퍼 내에 관련 엔트리(entry)가 없는 경우에 인터페이스(207)를 통하여 복귀되며, 만일 기억 버퍼 내에 관련 엔트리가 있으면, 인터페이스(208)를 통하여 SFE로 복귀된다. 이러한 방법으로 SFE가 기억한 명령어의 결과는 uPCore 및 캐시의 구조 상태를 변경시키지 않고 SFE 상에서 실행되는 후속 명령어에서 사용될 수 있다. SFE의 모든 기억은 기억 버퍼에 보관된다.Immediately after the SFE enters the execution state, when random instruction execution begins, uPCore continues to fetch its instruction from its data cache 203 at its own pace, where the data cache 203 Includes the inference engine processing element SFE storage reference supplied to the cache storage 203 of instructions and operand data prior to use by the conventional uPCore processing element. In a preferred embodiment, exclusively designed as an ordered processor or a processor optimized for ordered processing, or substantially less than 95% of all instructions can handle the processing of instructions that do not yield a predicted gain. Can be a processor. Therefore, pipeline delay may occur in case of L1 cache miss. Since the SFE can perform random execution, the SFE can continue to execute past instructions that cause delays. During the time that the SFE is running, a fetch reference is generated that is sent to the instruction and data cache via interface 207 and to a memory buffer through interface 208. If the cache and storage buffer do not have the desired data, a cache miss occurs. The instruction and operand are returned via interface 207 if there is no associated entry in the memory buffer, and if there is an associated entry in the memory buffer, then returned to the SFE via interface 208. In this way, the result of the instruction stored by the SFE can be used in subsequent instructions executed on the SFE without changing the structure state of the uPCore and cache. All memories of the SFE are stored in the memory buffer.

동기유닛은 인터페이스(209)를 통하여 SFE의 동작을 감시한다. 만일 SFE가 무순서 지원 명령어를 실행하거나, 핸들링할 수 있도록 설계되지 않거나 그렇지 않으면 다른 유효하지 않다는 인터럽트나 예외를 만나는 경우, 이것은 인터페이스(209) 상에 표시된다. 그 후 동기유닛은 SFE가 도 7의 소거상태(B)가 되도록 한다. 동기유닛은 또한 uPCore의 명령어 디코드 및 SFE의 명령어 퇴거를 감시한다. 만일 더 이상 유효한 연산(707)이 없거나 SFE가 추론적 프리페치 이득(708)을 제공하지 않는 것으로 결정되면, SFE는 uPCore 실행으로부터 매우 뒤쳐져 있다고 가정되며, 이 경우 또한 소거상태(B)가 된다. 만일 SFE와 현재 연관된 uPCore가 결정 시점(709, decision point)에서 아직 지연되고 있으면, 소거상태로 되는 것은 방지되며, SFE는 계속 실행상태를 유지한다. SFE 이득에 대한 기타 다수의 표시 방법은 SFE가 언제 소거상태로 진입해야 하는지를 결정하는데 사용될 수 있으며, 본 발명과 모순되지 않는다.The synchronization unit monitors the operation of the SFE via the interface 209. If the SFE encounters an interrupt or exception that is not designed to execute, handle or otherwise handle random support instructions, then it is indicated on interface 209. Thereafter, the synchronization unit causes the SFE to be in the erased state B of FIG. The sync unit also monitors instruction decode of uPCore and instruction retirement of SFE. If there is no longer a valid operation 707 or it is determined that the SFE does not provide a speculative prefetch gain 708, it is assumed that the SFE is far behind the uPCore implementation, in which case it is also in erased state (B). If the uPCore currently associated with the SFE is still being delayed at decision point 709, it is prevented from being erased, and the SFE remains running. Many other methods of indicating the SFE gain can be used to determine when the SFE should enter the erased state and are not inconsistent with the present invention.

일단 SFE가 소거상태(B)에 진입하면, 모든 명령어, 명령어의 일부 및 부분적 결과가 SFE의 데이터 경로(data path) 및 제어구조로부터 소거되기 전까지 이러한 상태를 유지한다. 이 시간 동안 명령어 및 데이터 캐시에는 어떠한 요구도 전송되지 않는다. SFE는 상기 조건이 달성되면(706) 소거상태를 떠나 C 또는 D 상태 중 하나로 이동할 수 있다. 상기 SFE는 uPCore(200) 또는 uPCore(200')와 재동기될 수 있다. SFE에 의해 결정되는 이러한 두 동작 간의 선택(704)은 본 발명과 모순되지 않는 모든 다양한 인자(factor)에 기초하여 이루어질 수 있다. 본 발명의 바람직한 실시예는 가장 최근에 SFE와 동기된 uPCore의 단순한 표시를 사용하며, 이 경우 SFE는 다른 uPCore를 사용하여 동기를 실행한다. 다른 알고리즘을 사용하여 결정 시점(704)을 통해 동일한 uPCore가 여러번 선택될 수 있다. 일단 재동기화가 완료되면, 상기 상태는 다시 실행 상태로 바뀌며 주기가 다시 시작된다.Once the SFE enters the erased state B, it remains in this state until all instructions, some and part of the instructions, are erased from the SFE's data path and control structure. During this time no request is sent to the instruction and data caches. The SFE may leave the erased state and move to either the C or D state once the condition is achieved (706). The SFE may be resynchronized with the uPCore 200 or the uPCore 200 ′. The choice 704 between these two operations, determined by the SFE, can be made based on all the various factors that do not contradict the invention. The preferred embodiment of the present invention uses a simple representation of uPCore most recently synchronized with the SFE, in which case the SFE uses another uPCore to perform synchronization. The same uPCore may be selected multiple times through decision point 704 using different algorithms. Once resynchronization is complete, the state changes back to the running state and the cycle begins again.

SFE(추론적 페치 엔진)Inferential Fetch Engine (SFE)

SFE는 종래의 무순서 처리를 사용하며, 또한 추론적 오퍼랜드 및 명령어 페치를 발생하도록 수퍼스칼라 기술이라고 불리는 특정 기능 또는 특정 기술을 사용한다. 이러한 기술은 레지스터 재명명(renaming), 명령어 재순서화(reordering), 완료 스코어보딩(completion scoreboarding) 등을 포함한다. SFE는 광범위하고 다양하게 구현 가능하다. 최적 설계의 판정 기준은 현 세대(current generation)의 무순서 설계와는 매우 다른 주기시간 및 면적 제한을 포함한다. 도 3은 SFE 및 시스템의 다른 요소에 대한 SFE의 인터페이스를 자세하게 설명하고 있다. 매우 단순화된 도면은 신규의 레지스터 관리 시스템(RMS)과 일반 레지스터 어레이 및 명령어 처리 파이프라인과의 상호작용을 강조하기 위한 것이다. 이것은 도 1a 및 1b와 유사하나, 중요한 차이가 있다. 첫째, GPR과 uPCore 간의 인터페이스(206)의 일부를 형성하는 부가적 인터페이스(306)이다. 이러한 인터페이스(306)는 uPCore GPR의 갱신을 SFE로 운반하는데 사용된다. 둘째는 본 발명의 RMS(301)는 동기 레지스터 할당 목록(SRAL)의 사용을 포함하도록 수정되었다는 점이다. 셋째는 메모리 계층에 대한 기억은 립테이 등에 의한 미국 특허 제4,901,233호에 개시된 바와 같이, 명령어 및 데이터 캐시로 전송되는 대신 기억 버퍼(205)로 전송된다는 점이다. SFE에서는 데이터 스트림이 도 3에 예시된 바와 같은 립테이 등에게 부여된 미국 특허 제4,901,233호로부터 도 1a 및 1b에 도시된 방법으로 계속하여 기억 버퍼(205)에 전송된다.SFE uses conventional random processing and also uses a specific function or specific technique called superscalar technique to generate speculative operands and instruction fetches. Such techniques include register renaming, instruction reordering, completion scoreboarding, and the like. SFE can be implemented in a wide variety of ways. Criteria for optimal design include cycle time and area constraints that are very different from the random generation design of the current generation. 3 details the interface of the SFE to the SFE and other elements of the system. A very simplified diagram is intended to highlight the interaction of the new register management system (RMS) with a generic register array and instruction processing pipeline. This is similar to FIGS. 1A and 1B but with significant differences. First, there is an additional interface 306 that forms part of the interface 206 between GPR and uPCore. This interface 306 is used to convey the update of uPCore GPR to the SFE. Secondly, the RMS 301 of the present invention has been modified to include the use of Sync Register Allocation List (SRAL). Third, memory for the memory hierarchy is transferred to the memory buffer 205 instead of to the instruction and data caches, as disclosed in US Pat. No. 4,901,233 to Ripta et al. In the SFE, the data stream is continuously sent to the memory buffer 205 from the U.S. Patent No. 4,901,233 granted to Ripta et al as illustrated in FIG.

인터페이스(302, 303, 304 및 305)는 인터페이스(209)의 일부를 구성하며, 동기주소, 소거 표시자(purge indicator), 재동기 표시, 및 디코드된 명령어 표시 각각을 운반한다. 상기 동기주소는 SFE에 의해 SFE가 uPCore의 구조 상태와 재동기화된 후 즉시, SFE에 의해 명령어 페치 및 실행의 출발점으로서 사용된다. 상기 소거 SFE 표시는 SFE로 하여금 모든 명령어 결과 및 부분 결과를 폐기하도록 하고, SFE의 기억 버퍼 내용을 소거하도록 한다. 재동기 표시는 SFE에 의해 사용되어 상기 SFE가 어느 uPCore와 동기되어야 하고, 재동기가 언제 이루어져야 하는지를 결정한다. SFE는 상기 SU에게 명령어가 성공적으로 디코드되었음을 표시하기 위하여 명령어 완료 인터페이스(instruction completed interface)를 사용한다. SU는 SFE가 추론적 페치 이득을 제공하고 있는지의 여부를 결정하는데 상기 정보를 사용한다. 상기 SFE는 명령어 및 오퍼랜드 페치 요구를 인터페이스(307)를 통하여 명령어 및 데이터 캐시에 전송하고, 인터페이스(308)를 통하여 기억 버퍼에 전송한다. uPCore가 지연 후에 실행을 재시작할 때 동일한 페치를 요구하는 시점 이전에, 인터페이스(307)를 통하여 전송되는 추론적 페치가 SFE에 의해 이루어진다. 이렇게 하여 uPCore는 소망 라인이 최근에 액세스되고 캐시의 가장 근접한 레벨(closest level of cache)에 설치되기 때문에, 이들 페치 요구에 필요한 대기 시간의 개선을 달성한다.Interfaces 302, 303, 304, and 305 form part of interface 209 and carry a sync address, a purge indicator, a resynchronization indication, and a decoded instruction indication, respectively. The synchronization address is used by the SFE as a starting point for instruction fetch and execution immediately after the SFE is resynchronized with the structure state of the uPCore. The erase SFE indication causes the SFE to discard all command results and partial results and to clear the contents of the memory buffer of the SFE. The resynchronization indication is used by the SFE to determine which uPCore the SFE should be synchronized with and when the resynchronization should occur. The SFE uses an instruction completed interface to indicate to the SU that the instruction was successfully decoded. The SU uses this information to determine whether the SFE is providing speculative fetch gain. The SFE sends an instruction and operand fetch request to the instruction and data cache via interface 307 and to a storage buffer via interface 308. The speculative fetch sent over interface 307 is made by the SFE before the point where uPCore requires the same fetch when restarting execution after a delay. In this way, uPCore achieves the improvement in latency required for these fetch requests because the desired line is recently accessed and installed at the closest level of cache.

SFE는 uPCore의 구조 상태에 독립적이기 때문에, 무순서 명령어 처리의 구현은 많은 구조상 관련(concern)으로부터 자유롭다. 이것은 스케줄을 개선하고 전체 설계의 주기시간에 대한 영향을 감소시킨다. SFE와 연관된 구현 리스크(implementation risk)는 현재 uPCore로부터 완전히 분리되었다. 크고 다양한 명령어 세트의 필요성을 충족시키는데 필요한 uPCore에는 가능하지 않은 추론 페치의 발생을 위하여 SFE가 최적화될 수 있다. SFE는 자주 사용되지 않는 임의의 명령어, 예외 핸들링 연산(exception handling operation) 또는 복구 알고리즘을 구현할 필요는 없다. 이러한 자주 발생하지 않는 임의의 이벤트인 경우에, SFE는 명령어 스트림의 실행을 중지하고 이것을 동기유닛에 표시한다. uPCore는 결국 지연상태를 벗어나며, 만일 자주 발생하지 않는 이벤트가 계속된다면, 이것을 정순서 설계의 훨씬 더 단순한 방법(approach)으로 핸들링한다.Since SFE is independent of the structure state of uPCore, the implementation of random order processing is free from many structural concerns. This improves the schedule and reduces the impact on the cycle time of the overall design. The implementation risk associated with SFE is now completely separate from uPCore. SFE can be optimized for the generation of inferential fetches that are not possible with uPCore needed to meet the needs of a large and diverse instruction set. SFE need not implement any instructions, exception handling operations, or recovery algorithms that are not frequently used. In case of any of these infrequent events, the SFE stops executing the instruction stream and marks it in the sync unit. uPCore eventually gets out of delay and handles this in a much simpler way of ordered design if an infrequent event continues.

SFE 설계는 빠르지만, 반드시 무한 CPI에 대해 꼭 들어맞아야 하는 것은 아닌 많은 수의 명령어를 디코드하여 발행(issue)하기 위하여 최적화되어야 한다. 상기 SFE는 종래 설계에 비하여 무한 L1 캐시 성능의 영향을 크게 고려할 필요 없이 긴 명령어 파이프라인으로 설계될 수 있다. SFE 및 uPCore 양자와 함께, 전체 시스템의 무한 L1 캐시의 성능은 SFE가 아니라 uPCore 파이프라인에만 의존한다.The SFE design is fast, but must be optimized to decode and issue a large number of instructions that are not necessarily true for infinite CPI. The SFE can be designed with a long instruction pipeline without having to consider the effects of infinite L1 cache performance significantly compared to the conventional design. With both SFE and uPCore, the performance of the entire system's infinite L1 cache depends only on the uPCore pipeline, not SFE.

본 발명의 설계에 따르면, 오퍼랜드 프리페치(operand prefetching)는 uPCore에 의해 수행될 필요가 없으며, 필요한 경우 SFE 시스템을 사용하면 본 발명의 상기 특징 및 uPCore로부터 나타나는 연관된 복잡도를 제거한다. 오퍼랜드 사전 페치가 uPCore 내에 유지될 필요가 있는 소정의 경우도 있으며, 이것은 본 발명과 모순되지 않는다.According to the design of the present invention, operand prefetching does not need to be performed by uPCore, and the SFE system, if necessary, eliminates the above described features of the present invention and the associated complexity emerging from uPCore. In some cases operand prefetch may need to be maintained within uPCore, which is not inconsistent with the present invention.

RMS에 대한 혁신적 변화의 상세한 사항은 본 발명의 바람직한 실시예에 따라 SFE가 자신과 연관된 각 uPCore용 동기 레지스터 할당 목록(SRAL)을 보유하는 도 5에 예시되어 있다. SRAL의 사용을 위한 확장(extension)을 포함하는 본 발명의 RMS는 전체 컴퓨터 구조에 종속되지 않으며, 다양한 환경하에서 구현될 수 있다. 따라서, 본 발명의 범위를 제한하지 않고, 본 발명에 따라서 도 3에 도시된 SFE는 16개 범용 레지스터(GPR)를 갖는 IBM사의 시스템 390 구조에 따르는 것으로 기술된다. RMS와 함께 GPR 레지스터 어레이는 구조화된 레지스터의 기능을 충족시키기 위하여, 특정 RA 위치의 특정 레지스터 할당을 동적으로 할당한다. 특정 레지스터의 기능이 완료되면, RA 내의 해당 위치는 해제되고 적절한 절차를 거쳐 동일 또는 다른 GPR로서 재할당될 수 있다.Details of the innovative change to RMS are illustrated in FIG. 5 in which the SFE maintains a Sync Register Allocation List (SRAL) for each uPCore associated with it in accordance with a preferred embodiment of the present invention. The RMS of the present invention, including extensions for the use of SRALs, is not dependent on the overall computer architecture and may be implemented under various circumstances. Thus, without limiting the scope of the present invention, the SFE shown in FIG. 3 in accordance with the present invention is described as conforming to IBM's System 390 architecture with 16 general purpose registers (GPR). In conjunction with the RMS, the GPR register array dynamically allocates specific register allocations at specific RA locations to meet the functionality of the structured registers. Once the function of a particular register is complete, that location in the RA can be released and reassigned as the same or another GPR through appropriate procedures.

본 발명의 실시예에 있어서, RA는 동적으로 할당 가능한 48 개의 실(물리적) 레지스터를 포함하여, 상기 구조에 의해 인식되는 16 개의 GPR 기능을 충족한다. 디코드 레지스터 할당 목록(decode register assignment list, DRAL)은 GPR 할당을 RA 할당으로 번역(translate)하기 위하여 명령어가 디코드될 때 사용된다. 각 명령어가 디코드되면, 명령어가 참조한 GPR은 DRAL에서 탐색되어 RA 위치 중 어느 위치가 GPR에 할당되는지를 결정하며, 새로운 RA 위치가 결과를 수신하기 위하여 할당되면, 이러한 할당을 반영하기 위하여 DRAL이 갱신된다. 이러한 방법으로 GPR을 사용하는 각 명령어는 그 GPR을 참조하기 위하여 가장 최근의 명령어에 할당되는 RA 위치를 찾도록 DRAL에 의해 지정된다.In an embodiment of the present invention, the RA includes 48 dynamically assignable physical (physical) registers to satisfy the 16 GPR functions recognized by the structure. A decode register assignment list (DRAL) is used when an instruction is decoded to translate a GPR assignment into an RA assignment. As each instruction is decoded, the GPR referenced by the instruction is retrieved from the DRAL to determine which of the RA positions are assigned to the GPR, and when a new RA position is assigned to receive the result, the DRAL is updated to reflect this assignment. do. In this way, each instruction that uses a GPR is specified by DRAL to find the RA location assigned to the most recent instruction to refer to that GPR.

백업 레지스터 할당 목록(back-up register assignment list, BRAL)은 대기(waiting)없이 하나, 둘 또는 세개의 조건 분기(conditional branch) 각각을 처리할 수 있도록 해준다. 이것은 DRAL과 동일한 구조를 가지며, 한 주기 동안 DRAL의 모든 내용이 BRAL에 복사되거나 또는 그 반대로 복사되도록 DRAL에 연결된다. 이러한 전송은 논리유닛(505)에 의해 제어된다. 예를 들어 분기가 취해졌는지 여부에 대한 추측(guess)이 틀렸다고 확인된 경우에, DRAL의 내용을 저장하기 위한 조건 분기를 만나는 경우에 사용된다.The back-up register assignment list (BRAL) allows you to process each of one, two or three conditional branches without waiting. It has the same structure as DRAL and is connected to DRAL so that all contents of DRAL are copied to BRAL or vice versa during one cycle. This transmission is controlled by the logic unit 505. For example, it is used when a conditional branch for storing the contents of a DRAL is encountered, when it is confirmed that the guesses about whether a branch has been taken are wrong.

어레이 제어 목록(ACL)은 RA 및 SFE의 나머지 부분으로부터 상태정보를 수신하고 제어정보를 전송하도록 연결된다. 논리유닛(505)은 ACL을 제어하고, ACL, DRAL 및 BRAL의 연산(operation)을 조정한다. GPR을 지원하는 각 RA을 위하여, RA와 관련된 상태정보를 기억하는 ACL 레지스터가 있다. 어레이의 각 레지스터 위치에는 하나의 엔트리가 있다.The array control list (ACL) is coupled to receive status information from the rest of the RA and SFE and to transmit control information. The logical unit 505 controls the ACL and coordinates the operation of the ACL, DRAL and BRAL. For each RA that supports GPR, there is an ACL register that stores status information associated with the RA. There is one entry for each register position in the array.

RMS에 SRAL을 부가하는 것은 SFE의 기능에 매우 중요하며 따라서 본 발명에도 매우 중요하다. SRAL은 DRAL과 동일한 구조를 가지며, 한 주기 동안에 SRAL의 모든 내용이 DRAL로 복사되도록 SRAL이 DRAL에 연결된다.Adding SRAL to the RMS is very important to the function of the SFE and therefore also very important to the present invention. SRAL has the same structure as DRAL, and SRAL is connected to DRAL so that all contents of SRAL are copied to DRAL during one cycle.

SFE와 연관된 각 uPCore에 하나의 SRAL이 제공된다. uPCore가 GPR 및 CR 갱신을 발생하면, 이들 GPR 및 CR 갱신은 인터페이스(206)를 통하여 SFE로 전송된다. 결과는 uPCore에 대한 주기시간 효과(cycle time impact)를 최소화하기 위하여 한 주기 동안 지연될 수 있다. 상기 GPR 갱신은 RA에 기록되며, uPCore 소스(source)와 연관된 SRAL은 RA 위치를 지정하기 위하여 갱신된다. 본 발명의 실시예에 있어서, uPCore는 정순서 실행 설계로서 정상적으로 기능하기 때문에, 인터페이스(206) 상의 GPR 갱신은 퇴거 명령어용 GPR 갱신을 반영하며 따라서 SRAL이 현재 표시하는 동일한 RA에 항상 기록될 수 있다. 재동기 동작 중에, uPCore로부터 연속적 갱신이 수용될 수 있도록 보장하는 새로운 16 개의 RA 엔트리가 SRAL에 공급되어야 한다. 현재 실시예에 있어서, SRAL과 연관되지 않은 다른 모든 RA 엔트리를 해제하는 SFE 소거가 재동기 동작보다 항상 앞서기 때문에, 이것은 문제가 되지 않는다. SRAL 내의 SFE 복사(copy)의 uPCore GPR 상태는 항상 최소 한 주기가 지연된다. SFE가 uPCore와 동기될 필요가 있는 경우에, SRAL 내용을 DRAL로 단순한 이동시키면 상기 작업이 달성된다. 이러한 동작은 오예측 분기(mis-predicted branch)의 경우에 마이크로프로세서 상태를 복구하기 위해 립테이가 사용한 BRAL과 유사하다.One SRAL is provided for each uPCore associated with the SFE. If uPCore generates GPR and CR updates, these GPR and CR updates are sent to the SFE via interface 206. The result may be delayed for one cycle to minimize cycle time impact on uPCore. The GPR update is recorded in the RA, and the SRAL associated with the uPCore source is updated to specify the RA location. In the embodiment of the present invention, since uPCore normally functions as a sequential execution design, the GPR update on interface 206 reflects the GPR update for the retirement instruction and can therefore always be written to the same RA currently indicated by SRAL. . During resynchronization operation, 16 new RA entries must be supplied to SRAL to ensure that successive updates from uPCore can be accepted. In the current embodiment, this is not a problem because SFE erasure that releases all other RA entries not associated with SRAL always precedes the resynchronization operation. The uPCore GPR state of the SFE copy in SRAL is always delayed by at least one cycle. If the SFE needs to be synchronized with the uPCore, then simply moving the SRAL contents to the DRAL accomplishes this. This behavior is similar to the BRAL used by Liptei to recover microprocessor state in the case of a mis-predicted branch.

본 발명의 SRAL의 기능은 립테이의 BRAL과는 매우 다르다. 첫째로, uPCore용으로 사용될 수 있기 때문에 다른 명령어 처리 파이프라인으로부터의 GPR 갱신으로 SRAL이 기록되는 점이 다르다.The function of the SRAL of the present invention is very different from the BRAL of Liptei. First, because it can be used for uPCore, SRALs are written to GPR updates from other instruction processing pipelines.

둘째로, SRAL의 내용을 DRAL로 이동하게 하는 트리거(trigger)는 립테이의 BRAL의 내용을 DRAL로 이동하게 하는 트리거와는 매우 다르다. 립테이의 경우에는, 오예측 분기가 트리거가 된다. 본 발명에 있어서, 프리페치 이득이 없다는 표시가 트리거로서 사용되며 따라서 미국 특허 제4,901,233호 및 그 상업적 실시예가 SRAL의 기능에 있어서 본 발명과는 완전히 구별된다는 것이 이해될 것이다. BRAL은 상기 목적을 위하여는 사용될 수 없으며, 본 발명에서는 분기 추측 방향이 잘못되었다는 결정이 이루어진 후에 프로세서 상태를 복구하는 립테이에 의해 도입되었던 기능과 동일한 기능을 위하여 사용된다. 세번째 중요한 구별은 SRAL의 내용이 DRAL로 이동될 때, SRAL의 모든 엔트리가 즉시 16 개의 새로운 RA 위치를 지정하도록 변환된다는 것이다. 립테이에 있어서, 미결정 분기(unresolved branch)를 디코드하는 경우에, BRAL은 DRAL로부터 직접 적재된다.Secondly, the trigger for moving the contents of the SRAL to the DRAL is very different from the trigger for moving the contents of the BRAL of the RIPTA to the DRAL. In the case of RIPTA, a misprediction branch is triggered. In the present invention, it will be understood that an indication that there is no prefetch gain is used as a trigger and that US patent 4,901,233 and its commercial embodiment are completely distinct from the present invention in the function of SRAL. BRAL cannot be used for this purpose, and in the present invention, it is used for the same function that was introduced by RIPTA to recover the processor state after a decision was made that the branch guess direction was wrong. The third important distinction is that when the contents of an SRAL are moved to a DRAL, all entries in the SRAL are immediately translated to specify 16 new RA locations. In RIPTA, when decoding an unresolved branch, BRAL is loaded directly from DRAL.

SFE가 하나 이상의 uPCore에 동기되도록 해주는데 하나 이상의 SRAL이 사용될 수 있다. 둘 이상의 uPCore는 동일한 SFE를 사용하여 프리페치 이득을 제공할 수 있지만, 그 둘 이상의 uPCore가 동시에 SFE를 사용하는 것은 불가능하다. 각각의 부가적인 SRAL에는 동기를 위한 연관된 uPCore GPR 결과 버스 및 연관된 기억 버퍼가 수반되어야 한다.One or more SRALs can be used to ensure that the SFE is synchronized to one or more uPCores. More than one uPCore can use the same SFE to provide prefetch gain, but it is not possible for more than one uPCore to use SFE at the same time. Each additional SRAL must be accompanied by an associated uPCore GPR result bus and an associated storage buffer for synchronization.

uPCoreuPCore

본 발명의 바람직한 실시예의 uPCore 설계는 종래의 마이크로프로세서이다(모토롤라 및 IBM사에 의해 시판되고 있는 PowerPC 601과 같은 수퍼스칼라 설계(superscalar design)가 바람직하지만, 인텔 286과 같은 좀 더 구식의 설계도 가능하다). 시스템이 하나 이상의 범용 실행 유닛(general purpose execution unit)을 갖는 것은 컴퓨터 설계분야에서 공지되어 있다. 예를 들어, 상기 범용 실행 유닛은 수행되는 기능 타입의 분야에 따라 설계될 수 있다. 비록 상기 uPCore 내에 범용 실행 유닛이 두 개만 도시되어 있지만, 임의 개수의 범용 실행 유닛을 사용하더라도 본 발명과 모순되지 않는다. 본 발명의 uPCore 부분은 종래 마이크로프로세서에 대하여 도 6에 도시된 부분의 예외를 제외하고는 어떠한 특정 변경도 필요하지 않다. 도 6은 가장 최근에 퇴거된 명령어의 주소가 어떻게 래치(604)되고 그 후에 인터페이스(604')를 통하여 SFE로 어떻게 전송되는지를 도시한다. 범용 실행 유닛(601, 602)으로부터 운반된 GPR의 결과는 래치되며(603) 그 후 인터페이스(603')를 통하여 SFE로 전송된다. 도 6에 도시된 uPCore는 정순서 설계에 해당되나 현재 상용(commercial use)되고 있는 마이크로프로세서와 같이 무순서 설계 요소의 사용이 본 설계와 모순되는 것은 아니다.The uPCore design of the preferred embodiment of the present invention is a conventional microprocessor (a superscalar design such as the PowerPC 601 sold by Motorola and IBM is preferred, but more outdated designs such as the Intel 286 are possible). ). It is known in the computer design art that the system has one or more general purpose execution units. For example, the general purpose execution unit may be designed according to the field of function type to be performed. Although only two general purpose execution units are shown in the uPCore, the use of any number of general purpose execution units does not contradict the present invention. The uPCore portion of the present invention does not require any specific modification except for the portion shown in FIG. 6 with respect to the conventional microprocessor. 6 shows how the address of the most recently retired instruction is latched 604 and then sent to the SFE via interface 604 '. The results of the GPR carried from the universal execution units 601, 602 are latched 603 and then sent to the SFE via the interface 603 ′. Although the uPCore shown in FIG. 6 corresponds to a sequential design, the use of an sequential design element, such as a microprocessor that is currently in commercial use, does not contradict this design.

동기유닛Sync unit

동기유닛(201, SU)은 uPCore 및 SFE간 상호작용을 제어하기 위하여 요구되는 모든 논리 기능을 포함한다. SU는 상태머신(state machine) 및 연관 레지스터(404,405,406)를 포함한다. 상태 머신의 출력은 소거 기능 및 RMS로의 입력을 제어하는, SFE에 대한 인터페이스(209)를 포함한다. RMS에 대한 라인은 동기 동작의 경우에 SRAL을 DRAL로 적재하는 것을 제어한다.The synchronization unit 201, SU includes all the logic functions required to control the interaction between the uPCore and the SFE. The SU includes a state machine and associated registers 404, 405, and 406. The output of the state machine includes an interface 209 to the SFE, which controls the erase function and input to the RMS. The line to RMS controls loading SRAL into DRAL in case of synchronous operation.

동기유닛은 SFE가 전체 시스템에 프리페치 이득을 제공하는지의 여부를 결정하는데 사용되는 논리 기능을 포함한다. 본 실시예는 상기 기능을 제공하기 위하여 2 개의 명령어 카운터(407,408)를 사용한다. 제1 카운터(408)는 uPCore가 명령어를 퇴거할 때마다 증가된다. 제2 카운터(407)는 SFE가 명령어를 디코드할 때마다 증가된다. 양 카운터는 재동기 동작 중 0으로 재설정된다. 재동기화 후, SFE가 uPCore에 도움이 되는 추론적 페치 참조를 발생할 기회를 가졌는지의 여부를 결정하기 위하여 카운터의 비교가 사용된다. 만일 SFE가 uPCore의 실행보다 훨씬 많이 앞서서 명령어를 디코드하지 않는 한, 이득의 가능성은 없다. 2 개의 카운터를 비교하면 도 7의 특정 이득 결정 시점(708)에 대한 입력으로서 이득의 가능성을 정확하지는 않지만 충분히 표시할 수 있다. 본 실시예는 상기 사용을 위해 임계값을 10으로 사용한다. 만일 SFE 디코드 카운트가 uPCore 퇴거 카운트보다 적어도 10만큼 크지 않으면 동기유닛은 이득을 표시하지 않는다.The sync unit includes a logic function used to determine whether the SFE provides a prefetch gain for the entire system. This embodiment uses two instruction counters 407 and 408 to provide the functionality. The first counter 408 is incremented each time uPCore retires the instruction. The second counter 407 is incremented each time the SFE decodes an instruction. Both counters are reset to zero during resynchronization. After resynchronization, a comparison of counters is used to determine whether the SFE has had a chance to generate a speculative fetch reference that is helpful to uPCore. If the SFE does not decode the instructions much earlier than the execution of uPCore, there is no potential for gain. Comparing the two counters may provide an incomplete but sufficient indication of the likelihood of gain as input to the particular gain determination point 708 of FIG. 7. This embodiment uses a threshold of 10 for this use. If the SFE decode count is not at least 10 greater than the uPCore retirement count, the sync unit does not indicate a gain.

동기유닛은 또한 SFE가 현재 어떤 uPCore와 연관되었는지의 표시를 보유하고 있다. 각 SFE는 단일 동기유닛을 갖지만, 각 SFE는 임의 개수의 uPCore와 연관될 수 있다. 본 실시예는 2 개의 uPCore와 연관되는 하나의 SFE를 갖는다.The sync unit also holds an indication of which uPCore the SFE is currently associated with. Each SFE has a single sync unit, but each SFE can be associated with any number of uPCores. This embodiment has one SFE associated with two uPCores.

CP 및 SE 간의 상호작용에 대한 대안적 확장Alternative extension to the interaction between CP and SE

CP 및 SE 간의 상호작용에 대한 기타 다른 확장이 가능하다. 하나의 예로 SE가 SE 및 CP 양자에 의해 공유된 분기 예측 테이블(branch prediction table)을 갱신하도록 하는 경우가 포함될 수 있다. SE는 또한 CP가 파이프라인 붕괴(disruption)를 회피하도록 하기 위한 다른 조건 또는 가능한 명령어 예외(potential instruction exception)에 대한 CP 힌트(hint)를 제공할 수 있다. SFE 페치 요구에 응답하여 페치된 명령어 및 오퍼랜드 데이터는 uPCore에 직접 전송된다. 따라서, 데이터는 추론적 요구가 정확한 경우에 uPCore 범용 실행 유닛 및 명령어 디코드 논리에 근접한 상태에 있다. 이것은 소정의 구현에 있어서는 유한 캐시 손실을 추가로 감소시킬 수 있다.Other extensions to the interaction between CP and SE are possible. One example may include a case in which the SE updates a branch prediction table shared by both the SE and the CP. The SE may also provide CP hints for other conditions or potential instruction exceptions to allow the CP to avoid pipeline disruption. Fetched instruction and operand data in response to the SFE fetch request is sent directly to uPCore. Thus, the data is in close proximity to the uPCore general purpose execution unit and instruction decode logic when the speculative needs are correct. This may further reduce finite cache loss in certain implementations.

본 발명의 바람직한 실시예에 대하여 기술하였는 바, 현재 및 장래에 있어서 당해 기술분야에 통상의 지식을 가진 당업자는 본 발명의 개시 범위 내에서 본 발명에 대하여 여러 가지로 개량하고 개선할 수 있다.Having described the preferred embodiments of the present invention, those skilled in the art, both now and in the future, can make various improvements and improvements to the present invention within the scope of the present invention.

본 발명을 다양하게 개선한 사람은 본 발명의 성능 분석은 무순서(또는 무순차) 실행이 무한 L1 캐시 CPI의 감소에 비하여 유한 L1 캐시 CPI의 감소에 더 많은 이익을 제공하는 것을 나타내고 있음을 알 수 있다. 현재의 기술 경향은 유한 캐시 효과가 급격히 성장하고 있음을 보이며, 이것은 유한 L1 CPI 이득이 무한 L1 CPI 이득보다 훨씬 더 큰 값을 갖도록 해준다.Various improvements to the present invention have shown that performance analysis of the present invention indicates that random (or random) execution provides more benefits in finite L1 cache CPI reduction compared to infinite L1 cache CPI reduction. Can be. The current technology trend shows that the finite cache effect is growing rapidly, which allows the finite L1 CPI gain to be much larger than the infinite L1 CPI gain.

상술한 코어 마이크로프로세서를 지원하는 추론적 페치 엔진(SFE), 및 SFE와 마이크로프로세서 코어 양자에 의해 공유된 기억 계층(storage hierarchy)에 대한 추론적 메모리 참조를 가능하게 하면서 일치 동작 중 구조 상태를 유지하는 코어 마이크로프로세서와 일치하는 상호작용을 제공함으로써, 무순서 실행을 사용하는 종래 기술 설계에 대한 상당한 단순화를 달성하고자 하거나 또는 무순서 실행을 사용하지 않는 종래 기술 설계에 대하여 상당한 성능 개선을 달성하고 싶은 사람에게 도움이 된다. 이상적으로는, 본 발명은 시스템 성능 향상의 추구에 있어서, 무순서 실행의 사용과 관련된 설계 상호 조정(tradeoff)에 따른 더 양호한 최적화를 가능하게 해준다. 본 발명은 최근의 일부 설계에서 사용되는 증가된 많은 단계(stage)와는 반대로, 주파이프라인에 매우 큰 무순서 실행 복잡도를 부가하지 않고, 마이크로프로세서 설계를 주파수가 높고, 복잡도는 낮으며, 낮은 무한(low infinite) L1 캐시 CPI용으로 최적화할 수 있다.Inferential fetch engine (SFE) supporting the above-described core microprocessor, and inferential memory reference to the storage hierarchy shared by both the SFE and the microprocessor core while maintaining structure state during matching operation By providing a consistent interaction with the core microprocessor, which is intended to achieve significant simplification for prior art designs using random execution or to achieve significant performance improvements over prior art designs that do not use random execution. It is helpful to people. Ideally, the present invention allows for better optimization in the pursuit of system performance improvements due to design tradeoffs associated with the use of random execution. In contrast to the increased number of stages used in some recent designs, the present invention does not add very large random execution complexity to the main pipeline, making the microprocessor design high frequency, low complexity, and low infinite. (low infinite) Can be optimized for L1 cache CPI.

동시에, 보조프로세서는 마이크로프로세서 및 보조프로세서 양자 모두에 유한 캐시 효과의 감소를 추구함에 있어서 훨씬 더 넓은 범위에 걸쳐서 무순서 실행기술을 사용할 수 있다. 보조프로세서는 구조화된 명령어의 전체 집합(full set) 또는 명령어 실행과 연관된 예외 및 인터럽트의 전체 집합을 지원할 필요가 없으므로, 보조프로세서 내에서의 무순서 실행의 복잡도가 완화된다. 후술하는 특허청구범위는 추가되는 개선을 포괄하고 처음 개시된 발명에 대한 적절한 보호를 유지하도록 해석되어야 한다.At the same time, the coprocessor can use a random execution technique over a much wider range in seeking to reduce the finite cache effect on both the microprocessor and the coprocessor. The coprocessor does not need to support a full set of structured instructions or a full set of exceptions and interrupts associated with instruction execution, thereby alleviating the complexity of random execution within the coprocessor. The claims set forth below should be construed to encompass further improvements and to maintain appropriate protection for the first disclosed invention.

본 발명은 상술한 구성에 의해 마이크로프로세서의 성능을 개선시키는 효과가 있다.The present invention has the effect of improving the performance of the microprocessor by the above-described configuration.

Claims

10. A method for processing a sequential instruction stream of a computer system having a first and a second processing element, each processing element having its own state determined by the setting of its general purpose registers and control registers.

a) sending a start instruction of a sequential instruction stream to a first of said processing elements;

b) continuing processing of the sequential instruction stream using the first processing element and sending any change in state of the computer system architecture to a second processing element; And

c) if it is beneficial for the second processing element to take over the continuous processing of the same sequential instruction stream at any point during the processing of the sequential instruction stream by the first processing element, the Recovering to a second processing element, and taking over the continuous processing of the same sequential instruction stream by processing the sequential instruction stream with the second processing element.

Including,

The second processing element transmits any change in the state of the computer system structure to the first processing element while it is processing the sequential instruction stream,

This allows the first and second processing elements to execute the same instruction during processing of the sequential instruction stream, but only one of the processing elements is determined by the combination of states of the first and second processing elements. Can change the overall structural status of

How to process sequential instruction streams.

2. The method of claim 1, wherein the first processing element comprises a plurality of first processing elements functioning as a multiprocessor.

2. The method of claim 1, wherein the combination of states is determined by the first processing element.

The method of claim 1, wherein the combination of states of the first and second processing elements is determined by the first processing element and a set of at least one second processing element that functions as a multiprocessor with at least one first processing element. How the sequential instruction stream is determined.

5. The method of claim 4, wherein one sync unit is provided for each of the second processing elements.

5. The synchronizing apparatus according to claim 4, wherein one synchronizing unit is provided for each of the second processing elements, and wherein the synchronizing unit is configured to process the same instruction as the instruction being processed by the first processing element when the second processing element is processed. How to process a sequential instruction stream to determine if it starts.

5. A synchronization unit as claimed in claim 4, wherein one synchronization unit is provided for each of the second processing elements, wherein the synchronization unit is configured to execute the same instruction as the instruction of the processing stream being processed by the first processing element or the next instruction. 2 A method of processing a sequential instruction stream that determines when processing elements begin processing.

5. A synchronization unit as claimed in claim 4, wherein one synchronization unit is provided for each of the second processing elements, wherein the synchronization unit is configured to execute the same instruction as the instruction of the processing stream being processed by the first processing element or the next instruction. 2 A method of processing a sequential instruction stream that determines when a processing element begins to process and also when the instruction processing of the second processing element should be interrupted or ignored.

9. The method of claim 8, wherein the determination as to when the instruction processing of the second processing element should be interrupted or ignored is calculated for the entire computer system having an input provided from the first and second processing elements to the synchronization unit. A method of processing a sequential instruction stream based on a gained benefit determination.

10. The method of claim 9, wherein the input provided to the synchronization unit is information determined immediately or includes information stored in the system.

11. The method of claim 10, wherein the information is information stored in an instruction counter of a synchronization unit.

7. The method of claim 6, wherein when there is a delay in the instruction processing by the first processing element, the sync unit causes the second processing element to process the same instruction as the instruction being processed by the first processing element. How to process a sequential instruction stream to determine if it starts.

7. The synchronization unit of claim 6, wherein when there is an operation that the second processing element is not designed to be handled during the processing of the instruction processing element, when the synchronization unit resynchronizes the state of the second processing element with the rescue state. How to process sequential instruction streams.

7. The synchronization unit of claim 6, wherein when it is determined that the second processing element provides no benefit to the computer system when processing the instruction stream, the sync unit resynchronizes the state of the second processing element with the structural state. How to process sequential instruction streams.

7. The method of claim 6, wherein when there is a delay in processing an instruction by the first processing element, the sync unit is configured to send the same instruction as the instruction being processed by the first processing element and when the second processing element is used. 1 A method of processing a sequential instruction stream that determines which of the processing elements to begin processing with.

7. The synchronization unit of claim 6, wherein when there is an operation in which the second processing element is not designed to be handled during the processing of the instruction processing element, the synchronization unit is configured to change the state of the second processing element from the structural state and the first processing. A method of processing a sequential instruction stream that determines which of the elements to resynchronize.

7. The synchronization unit of claim 6, wherein when there is a determination that the second processing element provides no benefit to the computer system in processing the instruction stream, the sync unit determines the state of the second processing element and the structural state. A method of processing a sequential instruction stream that determines which of the first processing elements to resynchronize with.

7. The method of claim 6, wherein the result of the second processing element is stored in a private general purpose register or a memory buffer.

2. The synchronizing apparatus as claimed in claim 1, wherein one synchronizing unit is provided for the second processing element, and the synchronizing unit outputs the same instruction as the instruction of the processing stream being processed by the first processing element or the next instruction. Determining when a processing element begins processing, and also determining when instruction processing of the second processing element should be interrupted or ignored.

2. The method of claim 1, wherein the second processing element stores the result of the instruction processing of the sequential instruction stream in a personal general purpose register or personal storage buffer coupled to the second processing element, the first processing element performing a fetch operation. Fetching data from the instruction cache and the data shared by both the first and second processing elements.

21. The method of claim 20, wherein the second processing element is used to process some of the same instructions in the instruction stream processed by the first processing element.

22. The method of claim 21, wherein the second processing element is a random processor.

16. The system of claim 15, wherein the first and second processors are synchronized after a predetermined delay has occurred, and during resynchronization, the second processing element is a total result of preprocessing the instruction stream for the first processing element before resynchronization; A method of processing a sequential instruction stream that erases some results.

17. The method of claim 16, wherein during resynchronization the second processing element erases all and some results of preprocessing the instruction stream for the first processing element prior to resynchronization.

18. The method of claim 17, wherein during resynchronization the second processing element erases all and some results of preprocessing the instruction stream for the first processing element prior to resynchronization.

10. A method for processing a sequential instruction stream of a computer system having a first and a second processing element, each having its own state determined by the setting of its general purpose registers and control registers.

b) continue processing of the sequential instruction stream using the first processing element, sending any change in the state of the computer system structure required by the second processing element to a second processing element, and the transmitted change Accumulating to use for the structural state for the second processing element at a predetermined point in time; And

c) from the first processing element if it is advantageous for the second processing element to take over processing of the same sequential instruction stream at any point during the processing of the sequential instruction stream by the first processing element. Restoring successive processing of the same sequential instruction stream by recovering the cumulative structural state previously transmitted to the second processing element and processing the sequential instruction stream with the second processing element.

Including,

The second processing element accumulates the change to the structural state to be used at a given point in time in the future by using any change in the state of the computer system structure required by the first processing element while processing the sequential instruction stream. To the first processing element for

This allows the first and second processing elements to execute the same instruction during processing of the sequential instruction stream, but only one of the processing elements is determined by the combination of some states of the first and second processing elements. To change the overall structural state of the system.

How to process sequential instruction streams.

27. The method of claim 26, wherein if it is advantageous for the first processing element to take over processing of the same sequential instruction stream at any point during the processing of the sequential instruction stream by the second processing element, Recovering the accumulated structural state previously transmitted from the second processing element to the first processing element, and taking over the continuous processing of the same sequential instruction stream by processing the sequential instruction stream with the first processing element, 1 The processing element comprises a plurality of first processing elements that function as multiprocessors.