KR101667772B1

KR101667772B1 - Translation look-aside buffer with prefetching

Info

Publication number: KR101667772B1
Application number: KR1020157006634A
Authority: KR
Inventors: 로렌트 몰; 장-자크 레클레어; 필립 보카드
Original assignee: 퀄컴 테크놀로지스, 인크.
Priority date: 2012-08-18
Filing date: 2013-08-16
Publication date: 2016-10-19
Also published as: KR20160122278A; US9465749B2; US9852081B2; US9396130B2; US9141556B2; US20140052956A1; EP2885713A4; EP2885713A2; US20140052955A1; CN104583976B; KR20150046130A; US20140052954A1; US20140052919A1; WO2014031495A3; WO2014031495A2; CN104583976A

Abstract

시스템 TLB는, 개시자들로부터 변환 프리페치 요청들을 수락한다. 미스들은 워커 포트에 외부 변환 요청들을 발생시킨다. ID, 어드레스, 및 클래스와 같은, 요청의 속성들뿐만 아니라 TLB의 상태는 변환 테이블들의 다수의 레벨들 내에서 변환들의 할당 정책에 영향을 미친다. 변환 테이블들은 SRAM으로 구현되고, 그룹들로 조직된다.The system TLB accepts conversion prefetch requests from initiators. Misses generate external conversion requests on the worker port. The state of the TLB as well as the attributes of the request, such as ID, address, and class, affect the assignment policy of transformations within multiple levels of translation tables. Conversion tables are implemented in SRAM and organized into groups.

Description

[0001] TRANSLATION LOOK-ASIDE BUFFER WITH PREFETCHING [0002]

관련 출원들Related Applications

더 나아가, 본 출원은, ___자로 출원되고 발명의 명칭이 SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING으로 명명된 미국 일반 특허 출원들 ___(ART-024US1), ___자로 출원되고 발명의 명칭이 SYSTEM TRANSLATION LOOK-ASIDE BUFFER INTEGRATED IN AN INTERCONNECT로 명명된 미국 일반 특허 출원들 ___(ART-024US2), ___자로 출원되고 발명의 명칭이 DMA ENGINE WITH STLB PREFETCH CAPABILITIES AND TETHERED PREFETCHING로 명명된 미국 일반 특허 출원들 ___(ART-024US3), 및 ___자로 출원되고 발명의 명칭이 STLB PREFETCHING FOR A MULTI-DIMENSION ENGINE로 명명된 미국 일반 특허 출원들 ___(ART-024US4)에 관한 것이고, 이들 각각은 인용에 의해 본 명세서에 통합된다.Further, the present application is related to US generic patent applications ___ (ART-024US1), filed as ___ and entitled SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING, United States Patent Application Serial No. ______ (ART-024US2), entitled SYSTEM TRANSLATION LOOK-ASIDE BUFFER INTEGRATED IN INTERCONNECT, entitled " DMA ENGINE WITH STLB PREFETCH CAPABILITIES AND TETHERED PREFETCHING " (ART-024US3), filed as " STLB PREFETCHING FOR A MULTI-DIMENSION ENGINE ", each of which is assigned to the assignee of the present application, Which is incorporated herein by reference.

기술 분야Technical field

본 명세서에 개시된 발명은, 컴퓨터 시스템 설계, 특히 시스템-온-칩 반도체 디바이스들의 분야이다.The invention disclosed herein is in the field of computer system design, particularly system-on-chip semiconductor devices.

메모리 관리 유닛들(MMU들; Memory Management Units)은 보통 가상 메모리 능력을 제공하기 위해 마이크로프로세서들에서 이용된다. 가상 메모리가 인에이블될 때, 프로세서 상에서 실행중인 소프트웨어는 오직 가상 어드레스(VA; Virtual Address)들만을 관찰하고 이용한다. MMU는, VA를, 프로세서 내부 및 외부에서 이후에 이용될 수 있는 물리적 어드레스(PA; Physical Address)로 변환하도록 임무가 주어진다. 가상 메모리를 이용하는 것은, 실질적으로 이용가능한 것보다 더 많은 메모리가 있는 것으로의 착각(illusion)을 부여할 수 있는 것, 소프트웨어에 의해 지원되는 것보다 더 많은 어드레스 비트들을 갖는 물리적 메모리 시스템으로의 액세스를 부여하는 것, 그리고 액세스 권한들을 변경시키는 것을 통해 물리적 메모리를 보호하는 것을 포함하는 다수의 이점들을 갖는다.Memory Management Units (MMUs) are typically used in microprocessors to provide virtual memory capabilities. When virtual memory is enabled, software running on the processor only observes and uses virtual addresses (VA). The MMU is tasked to convert the VA into a physical address (PA) that can be used later in and out of the processor. Using virtual memory can give illusion of having more memory than is available, access to a physical memory system with more address bits than is supported by software And protecting the physical memory through altering the access rights. &Lt; Desc / Clms Page number 2 >

가상화를 지원하는 일부 현대의 시스템들은, VA들과 PA들 사이에서 2개 레벨들의 어드레스 변환(address translation)을 갖는다. 제 1 레벨은 가상화되지 않은(non-virtualized) 시스템상에서 발견되는 것과 유사하지만, PA는 최종 PA가 아니다. 이는, 중간 물리적 어드레스(IPA; Intermediate Physical Address) 또는 게스트 물리적 어드레스(GPA; Guest Physical Address)로 지칭될 수 있다. 제 2 레벨은 그 중간 어드레스를 최종 PA로 맵핑한다. 이러한 시스템들에서, 프로세서상에서 구동하는 임의의 소프트웨어의 경우, 제 1 레벨 또는 제 2 레벨 또는 둘 다 인에이블될 수 있다.Some modern systems that support virtualization have two levels of address translation between VAs and PAs. The first level is similar to that found on a non-virtualized system, but PA is not the final PA. This may be referred to as an Intermediate Physical Address (IPA) or a Guest Physical Address (GPA). The second level maps the intermediate address to the final PA. In such systems, for any software running on the processor, the first level or the second level, or both, may be enabled.

일반적으로, 가상 어드레스 공간은 페이지들로 분할된다. 페이지들은 보통 수 킬로바이트들이지만, 다른 페이지 크기들도 이용될 수 있다. 시스템들은 종종 수 킬로바이트들 내지 수 메가바이트들 또는 심지어는 기가바이트들의 다수의 페이지 크기들을 지원하여 변환 효율을 증가시킨다. 일 페이지 내의 전체 어드레스들은 동일한 방식으로 변환되고, 모든 액세스 권한 정보는 동일하다. VA들과 PA들 사이에서의 변환은 (종종 멀티-레벨의) 페이지 테이블을 통해서 행해진다. VA를 PA로 변환하기 위해 페이지 테이블을 통해 진행하는 프로세스는 종종, 이 프로세스가 페이지 테이블 룩업들의 시퀀스를 포함하기 때문에, 워킹(walking)으로 지칭된다.Generally, the virtual address space is divided into pages. Pages are usually several kilobytes, but other page sizes can be used. Systems often support multiple page sizes of several kilobytes to several megabytes or even gigabytes to increase conversion efficiency. All addresses in a page are converted in the same manner, and all access right information is the same. Conversions between VAs and PAs are made through (often multi-level) page tables. The process that proceeds through the page table to convert VA to PA is often referred to as walking, since this process includes a sequence of page table lookups.

MMU는 종종 2개의 부분들을 포함한다. 제 1 부분은, 변환 색인 버퍼(TLB; Translation Look-aside Buffer)로 지칭된다. 제 1 부분은, 변환들이 프로세서에 매우 빠르게 액세스가능하게 되도록 변환들을 캐싱(cache)하여, 캐싱된 변환들에 대해, 프로세서는 거의 딜레이 없이 실행될 수 있다. 제 2 부분은, TLB가 변환을 포함하지 않을 때 페이지 테이블들을 워킹(walk)하는 워커(walker)이다. 일부 시스템들에서, TLB와 워커 사이에는 더 많은 캐싱이 존재한다. 예를 들어, TLB는 2 레벨들의 캐싱을 갖는다. 워커는 그 자체가 캐시를 포함할 수 있다.The MMU often includes two parts. The first part is referred to as a translation look-aside buffer (TLB). The first part caches the transformations so that the transformations are very quickly accessible to the processor so that for the cached transformations, the processor can be executed with little delay. The second part is a walker that walks the page tables when the TLB does not contain a translation. In some systems, there is more caching between the TLB and the worker. For example, the TLB has two levels of caching. A worker can itself include a cache.

시스템 MMU(SMMU)는 MMU의 기능을 미러링하지만, 마이크로프로세서들 이외의 개시자(initiator)들에 적용된다. 몇몇 이러한 개시자들은 GPU들, DSP들, 및 DMA 엔진들이다. SMMU의 경우, 이러한 개시자들은 가상 메모리 및 가상화의 혜택들의 이점들을 취할 수 있다. MMU와 유사하게, SMMU는 페이지들 상에서 동작하고, 페이지 테이블들을 이용하여 변환들을 계산한다. 일부 경우들에서, SMMU는, SMMU의 개시자가 접속된 프로세서의 MMU와 동일한 페이지 테이블 포맷들을 이용한다. 그 경우, 페이지 테이블은 MMU와 SMMU 사이에서 공유될 수 있다.The system MMU (SMMU) mirrors the functionality of the MMU, but applies to initiators other than microprocessors. Some such initiators are GPUs, DSPs, and DMA engines. In the case of SMMU, these initiators can take advantage of the benefits of virtual memory and virtualization. Similar to the MMU, the SMMU operates on pages and calculates transformations using page tables. In some cases, the SMMU uses the same page table formats as the MMU of the processor to which the initiator of the SMMU is connected. In that case, the page table can be shared between MMU and SMMU.

MMU가 TLB 및 워커를 포함하기 때문에, SMMU는 이에 따라 시스템 TLB(STLB) 및 워커를 포함한다. STLB는 변환들을 위한 캐시로서 동작하여 개시자들의 피크 성능을 유지하도록 돕는다. 일부 경우들에서, 다수의 STLB들은 효율성 이유로 인해 단일의 워커를 공유할 수 있다. 일부 경우들에서, 다수의 개시자들은 STLB를 공유할 수 있다.Since the MMU includes a TLB and a worker, the SMMU accordingly includes a system TLB (STLB) and a worker. STLB acts as a cache for transforms to help maintain the peak performance of initiators. In some cases, multiple STLBs may share a single worker for efficiency reasons. In some cases, multiple initiators may share the STLB.

대부분의 경우, 프로세서들 내부의 TLB들은 프로세서와 밀접하게(tightly) 통합되는데, 이는 (예를 들어, 캐시 코히어런시(cache coherency)로 볼 수 있는 캐시들의 경우) 프로세서 내부에 물리적 어드레스들이 필요하기 때문이다. 대조적으로, STLB는 개시자 내부에 통합되어야 하는 것은 아니다. STLB는, 어떠한 부정적인 영향도 없이 개시자 외부에 위치될 수 있다. 수많은 경우들에서, 다수의 개시자들은 단일의 STLB를 공유한다. STLB는 변환 서비스들을 제공하기 위해 단지 요청의 소스와 목적지 사이에 있기만 하면 된다. 대부분의 칩 설계들은 타겟들로의 개시자들에 의한 액세스를 제공하는 인터커넥트를 갖는다. STLB들은, 개시자들과 인터커넥트 사이에, 또는 인터커넥트 내부에서 개시자들 가까이에 위치될 수 있다.In most cases, the TLBs within the processors are tightly integrated with the processor, which requires physical addresses within the processor (e.g., in the case of caches viewed as cache coherency) . In contrast, STLBs need not be integrated within the initiator. The STLB may be located outside the initiator without any negative impact. In many cases, multiple initiators share a single STLB. The STLB only needs to be between the source and destination of the request to provide transformation services. Most chip designs have interconnects that provide access by initiators to targets. The STLBs may be located between the initiators and the interconnect, or within the interconnect, near the initiators.

개시자에 가상 메모리를 부가하는 것은, 개시자의 성능에 대한 심각하게 부정적인 영향들을 가질 수 있다. 과도한 로직으로 인한 레이턴시 증가뿐만 아니라, STLB가 미스(miss)할(즉, 개시자로부터의 요청을 프로세싱하기 위해 필수적인 변환이 캐싱되지 않을) 때마다, 그 변환이 캐시의 다른 레벨 또는 워크에 의해 해결될 때까지, 요청은 정지(stall)되어야만 한다. 이는, 요청의 성능에 영향을 줄 것이며, 뒤따르는 요청들에도 또한 영향을 줄 수 있다. 일 페이지 테이블 워크가 통상적으로 1 내지 20개의 순차적 액세스들을 취하고 각각의 액세스가 통상적으로 10 내지 100개의 클록 사이클들을 취하기 때문에, 개시자 요청은 장시간(a large amount of time) 동안 정지될 수 있고, 개시자의 성능은 대폭 저하될 수 있다.Adding virtual memory to the initiator may have serious negative impacts on the initiator's performance. In addition to increasing the latency due to excessive logic, whenever the STLB misses (i.e., the translation required to process requests from the initiator will not be cached), the translation is resolved by another level or work in the cache , The request must be stalled. This will affect the performance of the request and may also affect subsequent requests. Since the one page table walk typically takes 1 to 20 sequential accesses and each access typically takes between 10 and 100 clock cycles, the initiator request can be stopped for a large amount of time, The performance of the reader can be significantly degraded.

STLB 미스들을 감소시키는 것은, STLB 캐시의 크기를 증가시킴으로써 그리고 워크 딜레이를 감소시킴으로써 행해질 수 있다. 그러나, 이는 불충분하다. 일부 미스들은, (예를 들어, 페이지가 처음으로 보여지고, 캐시가 변환을 이미 포함하고 있을 가능성이 없을 때의) 강제적인(compulsory) 미스들이며, 이들은 캐시 크기에 의해 개선되지 않는다. 몇몇 극적인 예시들에서, 불량한 페이지 국부성(locality)을 갖는 대량의 데이터가 개시자에 의해 요청되어, 다수의 강제적인 미스들을 트리거한다.Reducing STLB misses can be done by increasing the size of the STLB cache and by reducing the work delay. However, this is insufficient. Some misses are compulsory misses (e.g., when the page is first seen and the cache is not likely to already contain the translation), and they are not improved by cache size. In some dramatic examples, a large amount of data with bad page locality is requested by the initiator to trigger a number of mandatory misses.

이에 따라, STLB에서 미스하는 요청의 확률이 감소되거나 또는 제거되도록 STLB로 변환들을 프리페칭하기 위한 메커니즘이 필요하다. 이는, 시스템 TLB들에 특히 적용가능한데, 이는 개시자들이 예측가능한 메모리 액세스 패턴들을 갖는 경향이 있어서 대부분의 경우에 미래의 메모리 액세스 패턴들을 예측하는 진보된 프리페치 패턴들을 발생시키도록 또한 강화될 수 있기 때문이다.There is thus a need for a mechanism for prefetching transformations into the STLB so that the probability of a missed request in the STLB is reduced or eliminated. This is particularly applicable to system TLBs because it tends to have predictable memory access patterns so that it can also be enhanced to generate advanced prefetch patterns that predict future memory access patterns in most cases Because.

각각의 STLB는 프로토콜을 이용하여 메모리 요청들을 행하는 타겟 측 인터페이스를 갖는다. 상이한 I/O 디바이스들은 상이한 프로토콜들을 요구한다. 이는, 상이한 STLB들의 설계를 일관성 없게 하여 이에 따라 더욱 복잡하게 만든다. 실리콘 영역을 이용하고 동작 속도를 제한하는 불필요한 중복(redundant) 로직을 포함하는 어드레스 디코딩이, STLB에서 그리고 인터커넥트에서 수행된다. 요청들을 STLB들로부터 그들의 워커들에 수송(transport)하기 위한 인터페이스 프로토콜은, 인터커넥트 내에서 요청들을 개시자들로부터 타겟들로 수송하는데 이용되는 프로토콜과는 상이하다. 이는, 검증 및 시스템 레벨 모델링의 복잡도를 증가시킨다. 게다가, 별도로 설계된 로직 블록들을 통합하기 위해 서브시스템 인터커넥트들을 이용할 때, 변환 정보 및 변환 프리페치 요청들을 개시자들로부터 인터커넥트를 통해 TLB들로 트랜스퍼하기 위한 어떠한 방법도 존재하지 않는다. 게다가 또한, 공유된 변환들에 액세스하는 다수의 STLB들은 그들의 요청들의 공유된 국부성으로부터 어떠한 이득도 갖지 않는다.Each STLB has a target-side interface that makes memory requests using a protocol. Different I / O devices require different protocols. This makes the design of the different STLBs inconsistent and thus more complex. Address decoding involving redundant logic that utilizes the silicon region and limits the operating speed is performed in the STLB and on the interconnect. The interface protocol for transporting requests from STLBs to their workers is different from the protocol used to transport requests from initiators to targets within the interconnect. This increases the complexity of verification and system level modeling. In addition, when using subsystem interconnects to integrate separately designed logic blocks, there is no way to transfer the translation information and translation prefetch requests from the initiators to the TLBs via the interconnect. In addition, multiple STLBs accessing shared transforms also have no benefit from the shared localization of their requests.

몇몇 마이크로프로세서들과 같은 몇몇 I/O 디바이스들은, 보조 저장소(backing store)로부터 캐시로 데이터가 이동하도록 야기하기 위한 프리페치 동작들을 수행하여, 가능성이 있는 미래의 요청이 캐시 내에서 미스(miss)하지 않고 히트(hit)하게 될 것이다. 프리페칭은 일반적으로, 더 낮은 평균 메모리 액세스 레이턴시의 이점을 위해 보조 저장소 대역폭을 희생시킨다.Some I / O devices, such as some microprocessors, perform prefetch operations to cause data to move from a backing store to a cache so that potential future requests may miss in the cache, It will hit without doing it. Prefetching generally sacrifices auxiliary storage bandwidth for the benefit of lower average memory access latency.

TBL을 갖는 시스템에서, 임의의 액세스는 TLB 변환 테이블에서의 미스를 야기할 수 있다. 대부분의 동작 조건들에서의 캐시 미스들보다 덜 빈번하긴 하지만 그래도 TLB 미스들은 노멀 캐시 미스보다 현저하게 더 긴 딜레이들을 야기할 수 있다.In a system with TBL, any access may cause a miss in the TLB translation table. Although less frequent than cache misses in most operating conditions, TLB misses can still cause significantly longer delays than normal cache misses.

수많은 데이터 프로세싱 분야들에서 자신의 메모리 구조(memory organization)를 따르지 않는 방식으로 데이터 세트에 액세스하는 것은 흔하다. 특히, 2-차원 어레이들은 통상적으로, 2차원 중 하나를 따르는 액세스들이 메모리 내에서 순차적이 되도록, 메모리 내에 놓인다. 그러나, 다른 차원을 따르는 그 동일한 어레이에 액세스하는 것은, 메모리에 대한 비-순차적 액세스들을 요구한다. 액세스들의 이러한 유형이 발생하는 분야들은, 비디오 및 이미지 캡쳐 및 디스플레이, 2D 프로세싱뿐만 아니라 매트릭스-기반 데이터 프로세싱을 갖는 다른 분야들을 포함한다. 선형 어드레스 공간으로서 조직된 메모리를 갖는 시스템에서 2개 또는 그 초과의 차원들(예를 들어, 2D 표면)을 갖는 어레이를 표현하기 위해, 어드레스 공간이 변환된 페이지들로 분할되고 그리고 어레이 차원들이 페이지 크기보다 훨씬 작지 않다면, 특정 심각한 성능-저해 문제들이 발생한다. 표면 데이터의 모든 각각의 데이터 요소 또는 원자 단위(예를 들어, 픽셀)가 회전의 판독 또는 기록 단계 중 어디에서든 상이한 페이지들에 액세스할 것이다. 이는 적어도, 표면의 시작부에서 STLB 미스들의 동요(flurry)를 야기한다. 액세스되고 있는 로우들의 수가 STLB 캐시에서 맵핑들의 수를 초과하면, 전체 표면에서의 모든 각각의 픽셀은 STLB 미스를 야기한다.In many data processing applications it is common to access data sets in a manner that does not follow their memory organization. In particular, two-dimensional arrays are typically placed in memory such that accesses along one of the two dimensions are sequential in memory. However, accessing that same array along a different dimension requires non-sequential accesses to the memory. The fields in which this type of access occurs include video and image capture and display, 2D processing, as well as other fields with matrix-based data processing. To represent an array having two or more dimensions (e.g., a 2D surface) in a system having memory organized as a linear address space, the address space is divided into converted pages and the array dimensions are divided into pages If not much smaller than the size, certain serious performance-inhibition problems arise. Every individual data element or atomic unit (e.g., pixel) of the surface data will access different pages in either the read or write phase of rotation. This, at least, causes a flurry of STLB misses at the beginning of the surface. If the number of rows being accessed exceeds the number of mappings in the STLB cache, every pixel on the entire surface will cause an STLB miss.

본 발명은, STLB에서의 미스하는 요청의 확률이 감소되거나 또는 제거되도록, 변환들을 STLB로 프리페칭하기 위한 메커니즘이다. 요청들은 STLB에 입력된다. 요청은 노멀 요청(데이터 판독 또는 기록) 또는 프리페치일 수 있다. 프리페치는, 개시자와 타겟 사이에서 데이터가 트랜스퍼되도록 야기하지 않는 요청이다. 프리페치는, 변환이 요청될 가능성이 있는 시기에 앞서 STLB로 하여금 페이지 테이블로부터의 변환의 판독을 요청하도록 야기한다.The present invention is a mechanism for prefetching transformations into the STLB such that the probability of a missed request in the STLB is reduced or eliminated. Requests are entered into the STLB. The request may be a normal request (data read or write) or prefetch. Prefetch is a request that does not cause data to be transferred between the initiator and the target. Prefetch causes the STLB to request a read of the translation from the page table prior to a time when translation is likely to be requested.

STLB 프리페치는 CPU 프리페치와 상이하다는 점에 주목한다. CPU 프리페치는 일반적으로 특정 명령에 의해 개시된다. CPU 프리페치는, 캐시 내에서 히트(hit)하는 CPU 판독의 확률을 개선하기 위해, 데이터를 메모리로부터 CPU 데이터 또는 명령 캐시로 이동시킨다. STLB 프리페치는, STLB 변환 캐시 내에서 히트하는 개시자 트랜잭션 요청 어드레스 변환의 확률을 개선하기 위해, 맵핑을 메모리 내의 페이지 테이블로부터 캐시로서 동작하는 STLB 변환 테이블로 이동시킨다. 본 개시의 나머지 범위는 STLB 프리페치들과 관련되고, 프리페치라는 용어는 앞에 이미 언급한 것을 의미하는 것으로 이해되어야 한다.Note that STLB prefetch is different from CPU prefetch. CPU prefetch is typically initiated by a specific instruction. CPU prefetch moves data from memory to CPU data or instruction cache to improve the probability of CPU reads hitting in the cache. STLB prefetch moves the mapping from a page table in memory to an STLB translation table that acts as a cache to improve the probability of initiator transaction request address translation hitting in the STLB translation cache. It should be understood that the remainder of this disclosure relates to STLB prefetches, and the term prefetch refers to what has already been mentioned above.

개시된 발명은, 개선된 STLB 및 이를 포함하는 시스템이다. STLB는, 인터커넥트 내에서 I/O 디바이스들에 가까이에 위치된다. 이는, 상이한 인터페이스 프로토콜들의 I/O 디바이스들 사이에서 재사용가능하게 되도록 개시자 측 및 타겟 측 상에서의 범용 인터페이스(generic interface)를 이용한다. 이는, 데이터 경로 수송 토폴로지를 통해 보편적인 수송 프로토콜을 이용하여 공유 워커에 접속된다.The disclosed invention is an improved STLB and a system including the same. The STLB is located close to the I / O devices in the interconnect. It uses a generic interface on the initiator side and the target side to be reusable between the I / O devices of different interface protocols. It is connected to the shared walker using a universal transport protocol over the data path transport topology.

개시된 발명은, 변환 할당 정보 및 변환 프리페치 커맨드들이 개시자들로부터 인터커넥트를 통해 인터커넥트의 타겟 측에 위치된 SLTB들로 전달되도록 허용한다. 게다가, 다수의 STLB들은, 상이한 I/O 디바이스들 사이에서 요청들의 국부화의 이점을 취하기 위해, 공유된 중간-레벨 변환 캐시를 이용할 수 있다.The disclosed invention allows transform assignment information and transform prefetch commands to be transmitted from initiators to interconnected SLTBs located on the target side of the interconnect. In addition, multiple STLBs may use a shared mid-level translation cache to take advantage of the localization of requests between different I / O devices.

본원에 개시된 발명은, I/O 디바이스들에 대한 TLB 변환 프리페치들을 개시하기 위한 효율적인 시스템이다. 이는, 평균 메모리 액세스 레이턴시에 상당한 감소를 제공할 수 있다. 본 발명은, TLB에 접속된 DMA 엔진에 채용된다. DMA 엔진은, 메모리로부터의, 메모리로의, 또는 이 둘 모두의 데이터의 블록들을 이동시키는, 중앙 프로세싱 유닛(central processing unit)을 제외한 임의의 디바이스이다. DMA 엔진들은, I/O 인터페이스들과 접속될 수 있거나, 프로세싱 엔진들 내에 집적될 수 있거나, 독립형일 수 있거나, 또는 그렇지 않으면 칩들 내에 통합될 수 있다.The invention disclosed herein is an efficient system for initiating TLB transform prefetches for I / O devices. This can provide a significant reduction in the average memory access latency. The present invention is employed in a DMA engine connected to a TLB. A DMA engine is any device other than a central processing unit that moves blocks of data from memory, into memory, or both. DMA engines can be connected to I / O interfaces, integrated within processing engines, stand alone, or otherwise integrated into chips.

프리페칭은, 요청들이 특정 VA들을 가지고 전송되는 것의 결과로서 발생한다. 프리페칭의 유효성을 위한 비결(key)은, DMA 엔진 내에서의 프리페치 가상 어드레스 발생기의 기능이다.Prefetching occurs as a result of requests being sent with certain VAs. The key to the effectiveness of prefetching is the function of the prefetch virtual address generator in the DMA engine.

개시된 발명은, STLB를 갖는 SMMU를 통해서 메모리에 접속될 수 있는 다차원 엔진이다. 다차원 엔진은, 표면들 또는 다른 유형들의 데이터 어레이들에 대한 데이터에 액세스한다. 이는, 비-선형 방식으로 데이터에 액세스할 수 있다. 다차원 엔진은, 페이지 내의 데이터 엘리먼트들에 액세스하기 위한 이후의 요청들에 앞서, 페이지들 내에서 어드레스들의 변환들을 위한 요청들을 전송하고, 이에 의해 데이터 엘리먼트 액세스 요청이 전송될 때 변환 페칭을 위한 정지(stalling)를 최소화하거나 또는 회피한다. 이러한 데이터가 없는(data-less) 변환들에 대한 요청들은, STLB 프리페치 요청들로서 알려져 있다. 다차원 엔진은, STLB의 커패시티 내에 포함될 수 있는 오직 약간의(a number of) 맵핑들만을 이용하는 데이터 엘리먼트들의 그룹들에 액세스하여, 이에 의해 변환 캐시 미스들의 전체 수를 감소시킨다.The disclosed invention is a multidimensional engine that can be connected to memory via SMMUs with STLBs. The multidimensional engine accesses data for surfaces or other types of data arrays. It can access data in a non-linear manner. The multidimensional engine sends requests for translations of addresses in pages prior to subsequent requests for accessing data elements within the page, thereby causing a stop for conversion fetching when a data element access request is sent minimizing or avoiding stalling. Requests for such data-less transforms are known as STLB prefetch requests. The multi-dimensional engine accesses groups of data elements that use only a number of mappings that can be included in the capacity of the STLB, thereby reducing the total number of translation cache misses.

도 1은, 본 발명의 다양한 양상들에 따른 간단한 STLB를 예시한다.
도 2는, 본 발명의 다양한 양상들에 따라 할당 로직(allocation logic)을 갖는 STLB를 예시한다.
도 3은, 본 발명의 다양한 양상들에 따라 명료한 프리페칭을 수행하는 STLB를 예시한다.
도 4는, 본 발명의 다양한 양상들에 따라 STLB 및 워커를 포함하는 SMMU를 예시했다.
도 5는, 본 발명에 따라 시스템 TLB들 및 인터커넥트의 종래의 시스템을 예시한다.
도 6은, 본 발명에 따라 시스템 TLB들이 개시자 네트워크 인터페이스 유닛들 내에 집적된 인터커넥트를 예시한다.
도 7은, 본 발명에 따라 하나의 인터커넥트의 변환 요청들이 다른 인터커넥트에 집적된 시스템 TLB들에 의해 지원되는, 2개의 인터커넥트들의 시스템을 예시한다.
도 8은, 본 발명에 따라 중간-레벨 변환 캐시를 공유하는 시스템 TLB들을 예시한다.
도 9는, 본 발명에 따라 인터커넥트에 대한 시뮬레이션 환경을 예시했다.
도 10은, 본 발명에 따라 가상 어드레스 발생기, STLB, 및 타겟을 갖는 DMA 엔진의 시스템을 예시한다.
도 11은, 본 발명에 따라 노멀 및 프리페치 가상 어드레스 발생기들, STLB, 및 타겟을 갖는 DMA 엔진의 시스템을 예시한다.
도 12는, 프리페치 가상 어드레스 스트림 및 노멀 가상 어드레스 스트림을 생성하는 프로세스를 예시한다.
도 13은, 일 표면의 회전을 예시한다.
도 14는, 회전된 라인 내에서 데이터의 어드레스들의 관련성(correspondence)을 도시한다.
도 15는, 소스로부터 픽셀들의 그룹들을 판독하고 그리고 이들을 목적지에 기록하는 것을 도시한다.
도 16은, 메모리 페이지들로의 표면의 로우들의 맵핑을 도시한다.
도 17은, 회전 엔진, SMMU, 및 메모리의 어레인지먼트를 예시한다.
도 18은, 데이터 요청들 및 프리페치 요청들에 대해 별도의 채널들을 갖는 다른 이러한 어레인지먼트를 예시한다.
도 19는, 회전 엔진 내에서 어드레스 발생기들의 2개의 가능한 어레인지먼트들을 예시한다.
도 20은, 상이한 페이지들에 걸쳐 퍼져 있는 라인들을 갖는 액세스 영역의 프리페치 윈도우의 애플리케이션을 도시한다.Figure 1 illustrates a simple STLB in accordance with various aspects of the present invention.
Figure 2 illustrates an STLB with allocation logic in accordance with various aspects of the present invention.
Figure 3 illustrates an STLB that performs clear pre-fetching in accordance with various aspects of the present invention.
Figure 4 illustrates an SMMU that includes an STLB and a walker in accordance with various aspects of the present invention.
Figure 5 illustrates a conventional system of system TLBs and interconnects in accordance with the present invention.
Figure 6 illustrates an interconnect in which system TLBs are integrated into initiator network interface units in accordance with the present invention.
Figure 7 illustrates a system of two interconnects in which translation requests of one interconnect are supported by system TLBs integrated into another interconnect in accordance with the present invention.
Figure 8 illustrates system TLBs sharing an intermediate-level translation cache in accordance with the present invention.
Figure 9 illustrates a simulation environment for an interconnect in accordance with the present invention.
10 illustrates a system of a DMA engine having a virtual address generator, STLB, and target, in accordance with the present invention.
11 illustrates a system of DMA engines having normal and prefetch virtual address generators, STLBs, and targets in accordance with the present invention.
12 illustrates a process for generating a pre-fetch virtual address stream and a normal virtual address stream.
13 illustrates the rotation of one surface.
Figure 14 shows the correspondence of the addresses of the data in the rotated line.
15 illustrates reading groups of pixels from a source and writing them to a destination.
Figure 16 shows the mapping of the rows of surfaces to memory pages.
Figure 17 illustrates the arrangement of a rotating engine, SMMU, and memory.
Figure 18 illustrates another such arrangement with separate channels for data requests and prefetch requests.
Figure 19 illustrates two possible arrangements of address generators in a rotation engine.
Figure 20 shows an application of a prefetch window of an access area having lines spread over different pages.

이제 도 1을 참조하면, STLB(100)는 입력 포트(110)를 통해서 입력 요청들을 수신하고, 어드레스를 변환하고, 출력 포트(120)를 통해서 하나 또는 그 초과의 대응 출력 요청을 송신한다. 요청들은, 어드레스, 동작(판독/기록), 크기, 보안 표시(security indication), 특권 레벨(privilege level) 및 범용 측파대 정보를 포함하는(그러나, 이에 제한되지 않음), 변환과 관련된 요청의 부분들을 추출하는 콘텍스트 로직(150)으로 전송된다. 이러한 부분들을 이용하면, 콘텍스트 로직은 변환 테이블(160)에서 룩업된(looked up) 적어도 콘텍스트 식별자와 어드레스의 조합을 생성한다.Referring now to FIG. 1, STLB 100 receives input requests over input port 110, translates addresses, and sends one or more corresponding output requests through output port 120. The requests may include portions of the request associated with the conversion, including but not limited to address, operation (read / write), size, security indication, privilege level and general purpose sideband information Lt; / RTI > With these parts, the context logic creates at least a combination of context identifier and address looked up in translation table 160. [

변환 테이블(160)은 하나 또는 그 초과의 엔트리들을 포함하고, 이들 엔트리들 각각은 일련의(a range of) 어드레스들 및 일 세트의 콘텍스트들에 대응한다. 입력 요청 어드레스 및 콘텍스트가 엔트리와 매치(match)할 때, 변환 테이블은 변환된 어드레스 및 허가(permission)들을 포함하는 다수의 파라미터들을 리턴한다. 변환 테이블(160)은, 수많은 방식들로, 예를 들어, 완전히 연관되거나(associative) 또는 세트-연관된 CAM(Content-Addressable Memory)의 형태로 구현될 수 있다. 일부 양상들에 따르면, STLB(100)는, 더 작은 수의 엔트리들이지만 더 낮은 레이턴시 변환 테이블을 첫 번째로, 그후 더 큰 수의 엔트리들이지만 더 높은 레이턴시 변환 테이블을 두 번째로 갖는 멀티-레벨 캐시의 방식으로 배열된 다수의 레벨들의 변환 테이블을 포함한다.The translation table 160 includes one or more entries, each of which corresponds to a range of addresses and a set of contexts. When the input request address and the context match the entry, the translation table returns a number of parameters including the translated address and permissions. The translation table 160 may be implemented in a number of ways, e.g., in the form of a CAM (Content-Addressable Memory) that is fully associative or set-related. According to some aspects, the STLB 100 may be configured to have a lower number of entries but a lower latency translation table first, then a higher number of entries, but a multi-level And a conversion table of a plurality of levels arranged in a manner of a cache.

콘텍스트 로직(150)에서 행해진 콘텍스트 계산 및 변환 테이블(160)의 변환 테이블 룩업의 결과에 기초하여, 제어 로직(170)은 이하의 동작들 중 하나를 취한다:Based on the results of the context lookup performed in the context logic 150 and the translation table lookup in the translation table 160, the control logic 170 takes one of the following actions:

A) 요청으로부터의 정보 및 몇몇 내부적 구성 상태를 이용하는 콘텍스트 로직(150)이 요청이 변환을 필요로 하지 않는다고 결정하는 경우, 어떠한 변형도 없이 또는 약간의 변형만으로 통과(pass through)하도록 허용되어, 변환 테이블(160)은 인테로게이팅되지(interrogated) 않는다.A) If the context logic 150 using information from the request and some internal configuration state determines that the request does not require a transformation, it is allowed to pass through with no or only a few modifications, The table 160 is not interrogated.

B) 룩업이 변환 테이블(160)에서 히트하고 허가가 유효한 경우, 요청은 변환된다. 변환된 어드레스를 이용하기 위해 적어도 어드레스 필드가 변형된다. 통상적으로, 최상위(most significant) 비트들은 변환 테이블(160)로부터 획득된 변환된 어드레스에 의해 대체되고, 여기서 비트들의 수는 변환 테이블(160)에서 엔트리에 대응하는 페이지 크기로부터 도출된다. 몇몇 경우들에서, 요청의 다른 속성들이 또한 변형된다.B) If the lookup hits in the translation table 160 and the permission is valid, the request is translated. At least the address field is modified to use the translated address. Typically, the most significant bits are replaced by the translated address obtained from the translation table 160, where the number of bits is derived from the page size corresponding to the entry in the translation table 160. In some cases, other attributes of the request are also modified.

C) 허가가 실패한 경우, 요청은 진행하도록 허용되지 않거나 또는 에러로 마킹된다.C) If the authorization fails, the request is not allowed to proceed or is marked as error.

D) 룩업이 변환 테이블(160)에서 미스하는 경우(즉, 어떠한 대응하는 엔트리도 존재하지 않는 경우), STLB(100)가 워커 포트(130)를 통해서 외부 변환 요청을 위치시키고 동일한 워커 포트(130)로부터 응답을 수신할 때까지, 요청은 진행되지 않는다. 외부 변환 요청이 펜딩하는 동안, 오리지널 요청은 STLB(100)에서 정지된다(stalled). 이는 STLB(100)를 통해서 진행하는 모든 트래픽을 정지시킨다. 몇몇 STLB들은 TLB에서 미스되었던 요청들을 일시적으로 저장하는데 이용되는 버퍼(180)를 갖는다. 이는, (버퍼(180)가 꽉 찰 때까지) STLB(100)를 통한 진행을 유지하도록 하는 이후의 요청들을 허용하기 때문에, HUM(Hit-Under-Miss) 및/또는 MUM(Miss-Under-Miss) 버퍼로서 보통 지칭된다. 워커 포트(130)를 통한 외부 변환 요청이 응답을 수신한 후, 정지된 요청은 A), B), 또는 C)에 따라 변환되고 처리된다.D) If the lookup misses in the translation table 160 (i.e., there is no corresponding entry), the STLB 100 places an external translation request through the worker port 130 and the same worker port 130 , The request is not processed. While the external conversion request pending, the original request is stalled in STLB 100. [ This stops all traffic going through the STLB 100. Some STLBs have a buffer 180 that is used to temporarily store missed requests in the TLB. This may be a Hit-Under-Miss (HUM) and / or a Miss-Under-Miss (MUM) because it allows subsequent requests to keep progress through the STLB 100 (until the buffer 180 is full) ) Buffer. After the external conversion request via the worker port 130 receives the response, the suspended request is converted and processed according to A), B), or C).

외부 변환 요청이 워커 포트(130)를 통해서 전송되는 경우, 응답이 다시 돌아올 때, 새로운 변환이 변환 테이블(160)에 부가된다. 새로운 변환을 위치시킬(put)(그리고 이전에 저장된 변환을 폐기할) 엔트리를 선택하기 위해 이용될 수 있는 수많은 종류들의 알고리즘들이 존재한다. 이는, 특히, LRU(Least Recently Used), Pseudo-LRU, NRU(Not-recently-Used), FIFO(First-In-First-Out)와 같은 공통의 캐시 할당 대체 정책들을 포함한다.When an external translation request is sent through the worker port 130, a new translation is added to the translation table 160 when the response comes back. There are a number of different types of algorithms that can be used to select the entry to put a new transform (and discard the previously saved transform). This includes in particular common cache allocation replacement policies such as Least Recently Used (LRU), Pseudo-LRU, Not-recently-Used (NRU), and First-In-First-Out (FIFO).

본 발명의 몇몇 양상들에 따르면, 외부 변환 요청들은 일 유형을 포함한다. 이 유형은, 외부 변환 요청이 노멀 입력 요청에 응답하여 개시되었는지 또는 프리페치 요청에 응답하여 개시되었는지 여부를 나타낸다. 본 발명의 몇몇 양상들에 따르면, 이 유형은 워커 포트(130)의 프로토콜 내에서 대역-내 인코딩된다. 본 발명의 다른 양상들에 따르면, 이 유형은 측파대 신호로 표시된다.According to some aspects of the present invention, external conversion requests include one type. This type indicates whether the external conversion request was initiated in response to a normal input request or in response to a prefetch request. According to some aspects of the invention, this type is in-band encoded within the protocol of the worker port 130. [ According to other aspects of the invention, this type is represented by a sideband signal.

본 발명의 몇몇 양상들에 따르면, 변환 테이블(160)은 완전히 연관된다. 본 발명의 몇몇 양상들에 따르면, 변환 테이블(160)은 세트-연관된다. 본 발명의 몇몇 양상들에 따르면, 변환 테이블(160)은 다수의 레벨들의 변환 테이블로 구축된다. 본 발명의 몇몇 양상들에 따르면, 변환 테이블(160)은, 제 1-레벨의 완전히 연관된 변환 테이블 및 제 2-레벨의 완전히-연관된 또는 세트-연관된 변환 테이블로 구축된다. 본 발명의 몇몇 양상들에 따르면, 변환 테이블(160)의 일부는 하나 또는 그 초과의 SRAM(static random access memory) 어레이들로 구축된다.According to some aspects of the invention, the conversion table 160 is fully related. According to some aspects of the invention, the conversion table 160 is set-related. According to some aspects of the present invention, the conversion table 160 is constructed with a plurality of levels of conversion tables. According to some aspects of the present invention, the conversion table 160 is constructed with a first-level fully associated conversion table and a second-level fully-associated or set-associated conversion table. According to some aspects of the present invention, portions of the translation table 160 are constructed with one or more static random access memory (SRAM) arrays.

이제 도 2를 참조하면, 본 발명에 따라, 할당 로직(261)은, 교체 정책(replacement policy)에 기초하여, 엔트리가 변환 테이블(160) 내의 요청에 대해 할당되어야만 하는지 여부 그리고 만약 그렇게 할당되었다면 어떤 엔트리가 이용되어야만 하는지를 결정하기 위해 입력 포트(110)로부터 도출되는 요청의 몇몇 특징들을 이용한다. 본 발명의 몇몇 양상들에 따르면, 교체 정책은 온더플라이(on the fly)로 변경될 수 있다. 본 발명의 다른 양상에 따르면, 교체 정책은 트랜잭션 요청으로 표시된 바와 같이 트랜잭션(per-transaction) 단위로 선택된다.Referring now to FIG. 2, in accordance with the present invention, allocation logic 261 determines, based on a replacement policy, whether an entry should be allocated for a request in translation table 160 and, if so, Some features of the request derived from the input port 110 are used to determine if an entry should be used. According to some aspects of the present invention, the replacement policy can be changed on the fly. According to another aspect of the invention, the replacement policy is selected per transaction, as indicated by the transaction request.

본 발명의 몇몇 양상들에 따르면, 변환 테이블(160)은 세트 연관되고 다수의 그룹들로 세분되며, 여기서 각각의 그룹은 하나 또는 그 초과의 웨이(way)들을 포함한다. 본 발명의 몇몇 양상들에 따르면, 그룹들은 단지 하나의 엔트리이다. 본 발명의 다른 양상들에 따르면, 그룹들은 상대적으로 더 크고 SRAM들에 저장된다.According to some aspects of the invention, the conversion table 160 is set related and subdivided into a plurality of groups, where each group includes one or more ways. According to some aspects of the invention, the groups are only one entry. According to other aspects of the invention, the groups are relatively larger and stored in SRAMs.

본 발명의 몇몇 양상들에 따르면, 입력 포트(110)로부터 착신되는 요청의 속성들은, 요청이 엔트리를 할당할 그룹들을 결정하는데 이용된다. 이러한 속성들은, 요청의 소스, 요청의 식별자(ID)(종종 태그로서 지칭됨), 또는 할당에 전용되는 특수한 측파대들, 또는 이들의 임의의 조합을 포함한다.According to some aspects of the present invention, the attributes of the incoming request from the input port 110 are used to determine the groups to which the request will assign the entry. These attributes include the source of the request, the identifier (ID) of the request (often referred to as a tag), or special sidebands dedicated to allocation, or any combination thereof.

본 발명의 몇몇 양상들에 따르면, 할당 로직(261)은 결정에 이용되는 요청의 속성들에 기초하여 허용가능한 할당 그룹들을 결정하기 위해 이용되는 설정가능한(configurable) 또는 프로그래밍가능한 레지스터 어레이를 포함한다. 설정가능한 어레이는, 시스템 하드웨어 설계 시간에서 셋 업된다. 프로그래밍가능한 어레이는, 소프트웨어에 의해 셋 업되고, 칩의 구동 동안 변경될 수 있다.According to some aspects of the invention, allocation logic 261 includes a configurable or programmable register array that is used to determine allowable allocation groups based on attributes of the request used for the determination. The configurable array is set up at system hardware design time. The programmable array is set up by software and can be changed during chip operation.

본 발명의 몇몇 양상들에 따르면, 요청들의 클래스들은 요청들의 속성들을 이용하여 결정된다. 본 발명의 몇몇 양상들에 따르면, 일 클래스는 변환 테이블(160)에 최대수의 엔트리들을 할당한다. 본 발명의 몇몇 양상들에 따르면, 일 클래스는 변환 테이블(160)에 최소수의 엔트리들을 할당한다.According to some aspects of the invention, the classes of requests are determined using attributes of the requests. According to some aspects of the invention, a class assigns a maximum number of entries to the translation table 160. According to some aspects of the present invention, a class assigns a minimum number of entries to the translation table 160.

본 발명의 몇몇 양상들에 따르면, (워커 포트(130)를 통한) 워커로의 펜딩 요청들의 수가 한정된 경우, 펜딩 요청 엔트리들은 클래스별로 규제되어 이용되는 다수의 엔트리들의 그룹들에 할당된다. 펜딩 요청들의 수에 대한 한계치(limit)에 도달되는 동안 행해진 외부 변환 요청은, 응답이 펜딩 외부 변환 요청들의 수를 다시 한계치 미만으로 가져올 때까지 외부 변환 요청이 일시적으로 차단되도록 야기한다. 본 발명의 다른 양상들에 따르면, 한계치는 프리페치 입력 요청의 결과인 일 유형의 전체 펜딩 외부 변환 요청들에 적용된다. 본 발명의 다른 양상들에 따르면, 한계치는 특정 그룹의 전체 펜딩 외부 변환 요청들에 적용된다. 본 발명의 다른 양상들에 따르면, 한계치는 특정 클래스의 전체 펜딩 외부 변환 요청들에 적용된다.According to some aspects of the invention, when the number of pending requests to a worker (through the worker port 130) is limited, the pending request entries are assigned to groups of multiple entries that are regulated and used classwise. An external conversion request made while reaching the limit on the number of pending requests causes the external conversion request to be temporarily blocked until the response returns the number of pending external conversion requests below the threshold again. According to other aspects of the present invention, the threshold is applied to all pending external conversion requests of a type that are the result of a prefetch input request. According to other aspects of the invention, the threshold is applied to the entire group of pending external conversion requests. According to other aspects of the invention, the threshold applies to all pending external conversion requests of a particular class.

본 발명의 일 양상에 따르면, 워커로의 펜딩 요청들의 수가 한계치에 도달할 때, 펜딩 요청들의 수가 한계치 미만이 될 때까지, 프리페치들로부터 야기되는 요청들은 일시적으로 차단된다. 본 발명의 다른 양상에 따르면, 워커로의 펜딩 요청들의 수가 한계치에 도달할 때, 프리페치들로부터 야기되는 요청들은 폐기된다.According to an aspect of the invention, when the number of pending requests to the worker reaches the limit, requests resulting from the prefetches are temporarily blocked until the number of pending requests becomes less than the limit. According to another aspect of the invention, when the number of pending requests to the worker reaches a limit, requests resulting from prefetches are discarded.

이제 도 3을 참조하면, 부가적인 기능은 명시적 프리페칭을 지원한다. 본 발명의 일 양상에 따르면, 프리페치 식별 로직(355)은 프리페치 요청들을 식별한다. 프리페치 요청들은 여러 가지 속성들을 이용하여 식별된다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청들은, 요청들을 가지고 입력 포트(310)를 통해서 또한 전송된 측파대 신호를 이용하여 식별된다. 본 발명의 다른 양상들에 따르면, 프리페치 요청들은 요청의 ID를 이용하여 식별된다. 본 발명의 다른 양상들에 따르면, 프리페치 요청들은 요청들의 어드레스 비트들을 이용하여 식별된다. 다른 필드들 또는 필드들의 조합이 프리페치 요청들을 식별하기 위해 이용된다. 본 발명의 다른 양상들에 따르면, 프리페치 요청들은 입력 포트(310)와는 분리된 전용 프리페치 포트를 갖는다.Referring now to FIG. 3, additional functions support explicit prefetching. According to an aspect of the invention, prefetch identification logic 355 identifies prefetch requests. Prefetch requests are identified using various attributes. According to some aspects of the present invention, prefetch requests are identified using the transmitted sideband signal also through input port 310 with requests. According to other aspects of the present invention, prefetch requests are identified using the identity of the request. According to other aspects of the invention, the prefetch requests are identified using the address bits of the requests. Other fields or a combination of fields are used to identify prefetch requests. According to other aspects of the invention, the prefetch requests have a dedicated prefetch port separate from the input port 310. [

본 발명의 몇몇 양상들에 따르면, 프리페치 요청들은 사용시에 표준 트랜잭션 프로토콜에 따라 합법적으로 형성된다. 일부 표준 트랜잭션 프로토콜들은 AMBA(Advanced Microcontroller Bus Architecture) AXI(Advanced extensible Interface) 및 OCP(Open Cores Protocol)이다. 본 발명의 몇몇 양상들에 따르면, 노멀 요청들은 풀 캐시 라인을 위한 것인 반면, 프리페치 요청들은 0 바이트들 또는 1 바이트와 같은 소량의 데이터를 위한 것이다. 0 바이트 요청은 몇몇 프로토콜들에 따르면 불법적이지만, 프리페치의 표시로서 이용될 수 있다. 1 바이트 요청은, 대부분의 프로토콜에 따르면 합법적이며, 프리페치 크기로서 이용될 수 있다. 1 바이트 요청(바이트는 데이터 요청의 최소 원자 단위임)은, 요청된 데이터가, 액세스 제한들의 정의된 어드레스 범위들 또는 카덴스들 사이의 임의의 바운더리들, 예컨대, 페이지 크기로 정렬된 범위들을 초과하지 않을 것임을 보장하는 이점을 갖는다. 게다가, 1 바이트 요청은 인터페이스상에서 데이터 트랜스퍼의 정확하게 1 사이클을 요구한다. 대안적으로, 입력 요청들은, 더 낮은 어드레스 비트들을 폐기함으로써와 같이, 페이지 정렬되도록 강제될 수 있다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청들은 어떠한 데이터도 포함하지 않아야만 한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청은 4KB와 같은 특정 바운더리로 정렬되어야만 한다. 본 발명의 다른 양상들에 따르면, 프리페치 요청들은 프로토콜에 의해 예상되지(anticipated) 않는 특수 형태들을 취한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청 ID는 노멀 요청들의 ID와는 상이해야만 한다. 본 발명의 몇몇 양상들에 따르면, 펜딩 프리페치 요청들은 모두 상이한 ID를 가져야만 한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 식별 로직(355)은 콘텍스트 로직(350)에 통합된다.According to some aspects of the present invention, prefetch requests are legitimately formed in use in accordance with standard transaction protocols. Some standard transaction protocols are AMBA (Advanced Microcontroller Bus Architecture) AXI (Advanced Extensible Interface) and OCP (Open Cores Protocol). According to some aspects of the invention, the normal requests are for a full cache line, while the prefetch requests are for a small amount of data, such as zero bytes or one byte. A 0 byte request is illegal according to some protocols, but can be used as an indication of prefetch. A one-byte request is legal according to most protocols and can be used as a prefetch size. A one-byte request (the byte is the minimum atomic unit of the data request) indicates that the requested data does not exceed the defined address ranges of the access restrictions or any boundaries between the cadences, e.g., It will have the advantage of ensuring that it will not. In addition, a one byte request requires exactly one cycle of data transfer on the interface. Alternatively, input requests can be forced to page align, such as by discarding lower address bits. According to some aspects of the invention, prefetch requests must not contain any data. According to some aspects of the present invention, the prefetch request must be aligned to a specific boundary, such as 4 KB. According to other aspects of the invention, prefetch requests take special forms that are not anticipated by the protocol. According to some aspects of the present invention, the prefetch request ID must differ from the IDs of normal requests. According to some aspects of the invention, pending prefetch requests must all have different IDs. According to some aspects of the present invention, the prefetch identification logic 355 is incorporated into the context logic 350.

본 발명의 몇몇 양상들에 따르면, 프리페치 요청들은 또한 워커 쓰루 포트(330)에 또한 전송된다. 본 발명의 몇몇 양상들에 따르면, 이 포트(330) 상에 이용되는 요청 프로토콜은, 워커가 특정 경우들에서 상이하게 거동하게 하는 프리페치 표시를 포함한다. 본 발명의 몇몇 양상들에 따르면, 요청이 프리페치 요청으로서 식별될 때, 이는, 워커에 의해 추측된 것으로서(speculative) 고려되며, 인터럽트들과 같은 어떠한 부작용도 트리거되지 않는다. 본 발명의 몇몇 양상들에 따르면, 포트(330) 상의 응답 프로토콜은, 요청이 프리페치가 아니었다면 워커에 보통 부작용을 트리거했을 것이라는 표시를 포함한다. 본 발명의 몇몇 양상들에 따르면, 이 응답 표시는 변환 테이블(360) 또는 프리페치 테이블(390)에 대응하는 엔트리를 부가하지 않기 위해 STLB(300)에서 이용된다. 본 발명의 다른 양상들에 따르면, 이 응답 표시는, 비-프리페치 요청이 이 엔트리와 매치할 때 디스에이블된 프리페치 표시를 갖는 워커 쓰루 포트(330)에 새로운 요청이 전송되도록, 변환 테이블(360) 또는 프리페치 테이블(390) 내에서 엔트리를 특수하게 마킹하는데 이용된다.According to some aspects of the present invention, prefetch requests are also sent to the worker through port 330 as well. According to some aspects of the present invention, the request protocol used on this port 330 includes a prefetch indication that causes the walker to behave differently in certain cases. According to some aspects of the invention, when a request is identified as a prefetch request, it is considered speculative by the worker and no side effects, such as interrupts, are triggered. According to some aspects of the invention, the response protocol on port 330 includes an indication that the walker would normally have triggered side effects if the request was not prefetch. According to some aspects of the invention, this response indication is used in STLB 300 to not add a corresponding entry to translation table 360 or prefetch table 390. According to other aspects of the present invention, this response indication may include a translation table (e.g., a translation table) such that when a non-prefetch request matches this entry, a new request is sent to the worker through port 330 having a disabled prefetch indication 360 or in the prefetch table 390 to mark the entry specifically.

본 발명의 몇몇 양상들에 따르면, 프리페치 테이블(390)은 펜딩 프리페치 요청들을 저장하는데 이용된다. 본 발명의 몇몇 양상들에 따르면, 프리페치 테이블(390)은 워커에 전송되지 않았던 프리페치 요청들을 포함할 수 있다. 본 발명의 다른 양상들에 따르면, 프리페치 테이블(390)은 워커 쓰루 포트(330)에 전송되었던 프리페치 요청들을 포함할 수 있다. 본 발명의 다른 양상들에 따르면, 프리페치 테이블(390)은, 워커 쓰루 포트(330)에 전송되었던 그리고 워커 쓰루 포트(330)를 통해서 응답을 수신했던 프리페치 요청들을 포함할 수 있다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청에 대한 응답이 STLB(300)에 의해 수신될 때, 변환 테이블(360)에 응답 변환이 할당된다.According to some aspects of the present invention, prefetch table 390 is used to store pending prefetch requests. According to some aspects of the present invention, the prefetch table 390 may include prefetch requests that were not sent to the worker. According to other aspects of the present invention, the prefetch table 390 may include prefetch requests that have been transmitted to the worker through port 330. According to other aspects of the present invention, the prefetch table 390 may include prefetch requests that have been sent to the worker through port 330 and received a response through the worker through port 330. According to some aspects of the present invention, when a response to the prefetch request is received by the STLB 300, a translation transformation is assigned to the translation table 360.

제어 로직(170)(A, B, C, D)에 대해 도 1에 설명된 4개의 가능한 동작들에 부가하여, 본 발명의 다양한 양상들에 따르면, 제어 로직(370)은 또한 프리페치 요청으로서 식별된 요청을 프리페치 요청 종결 로직(395)으로 전송하도록 인에이블된다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청 종결 로직(395)은 포트(320)를 통해서 프리페치 요청들을 전송하지 않고 그 프리페치 요청들에 자동으로 응답한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청 종결 로직(395)은, 오직 프리페치 요청들이 (통신 쓰루 포트(330)를 통해서) 워커로부터 응답을 수신한 이후에만, 변환 테이블(360)에서 미스하는 프리페치 요청들에 응답한다. 변환 테이블(360)에서 히트하는 프리페치 요청들은 프리페치 요청 종결 로직(395)에 의해 즉각적으로 응답된다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청 종결 로직(395)은 펜딩 프리페치 요청의 페이지에 매치하는 프리페치 요청들에 곧바로 응답한다. 이렇게 하여, 입력 포트(310)를 통해서 접속된 STLB(300)를 이용하는 에이전트는, 펜딩 프리페치 요청들의 수를 계속 파악하고 있고, STLB(300)에 의해 추적될 수 있는 것보다 더 많은 수의 프리페치 요청들을 전송하지 않음을 보장한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청이 수신되지만 프리페치 테이블(390)이 이용가능한 엔트리를 갖지 않을 때, 프리페치 요청은, 워커 쓰루 포트(330)로의 요청을 트리거하지 않지만 그 대신에 프리페치 요청 종결 로직(395)에 의해 곧바로 응답된다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청이 수신되지만 프리페치 테이블(390)이 이용가능한 엔트리를 갖지 않을 때, 프리페치 요청은, 공간이 프리페치 테이블(390) 내에서 이용가능하게 될 때까지 정지된다. 본 발명의 몇몇 양상들에 따르면, 전술한 2개의 거동들(무시, 정지) 사이에서의 선택은 프리페치 요청에서의 표시에 의해 행해진다. 본 발명의 다른 양상들에 따르면, 선택은 설정가능한 비트에 의해 행해진다.In addition to the four possible operations described in FIG. 1 for the control logic 170 (A, B, C, D), according to various aspects of the present invention, the control logic 370 may also be implemented as a prefetch request And to send the identified request to prefetch request termination logic 395. [ According to some aspects of the present invention, the prefetch request termination logic 395 automatically responds to the prefetch requests without sending prefetch requests over the port 320. According to some aspects of the present invention, the prefetch request termination logic 395 is operable only after the prefetch requests have received a response from the walker (via the communication through port 330) Lt; / RTI > The prefetch requests hitting in the translation table 360 are immediately responded by the prefetch request termination logic 395. According to some aspects of the present invention, the prefetch request termination logic 395 responds directly to prefetch requests that match the page of the pending prefetch request. Thus, the agent using the STLB 300 connected through the input port 310 continues to be aware of the number of pending prefetch requests and is likely to have a greater number of prefetch requests than can be tracked by the STLB 300. [ Ensuring that no fetch requests are sent. According to some aspects of the present invention, when a prefetch request is received but the prefetch table 390 does not have an available entry, the prefetch request does not trigger a request to the worker through port 330, And is immediately answered by prefetch request termination logic 395. [ According to some aspects of the present invention, when a prefetch request is received but the prefetch table 390 does not have an available entry, the prefetch request is made when a space is available in the prefetch table 390 . According to some aspects of the present invention, the selection between the above two behaviors (ignore, stop) is made by the indication in the prefetch request. According to other aspects of the invention, the selection is made by a configurable bit.

본 발명의 몇몇 양상들에 따르면, STLB(300)는 도 3에 설명된 바와 같은 할당 로직을 포함한다. 프리페치 테이블(390)의 할당은, 클래스에 의해 규제된 다수의 엔트리들 또는 그룹들을 포함한다. 이에 더해, 프리페치 표시는 또한, 변환 테이블(360) 및 프리페치 테이블(390) 내에서 허용가능한 할당 그룹들을 결정하기 위해 이용된다.According to some aspects of the present invention, the STLB 300 includes allocation logic as described in FIG. The assignment of the prefetch table 390 includes a plurality of entries or groups regulated by the class. In addition, the prefetch indication is also used to determine the allowable allocation groups in the translation table 360 and the prefetch table 390.

본 발명의 몇몇 양상들에 따르면, 워커(쓰루 포트(330))로의 펜딩 요청들의 수가 한정된 경우, 펜딩 요청 엔트리들은 클래스에 의해 규제되어 이용되는 다수의 엔트리들 또는 그룹들에 할당되며, 여기서 입력 포트(310) 상의 요청의 프리페치 표시는 그룹 또는 클래스 결정시에 이용된다.According to some aspects of the invention, when the number of pending requests to the worker (through port 330) is limited, the pending request entries are assigned to a number of entries or groups that are regulated and used by the class, The prefetch indication of the request on the requestor 310 is used in determining the group or class.

본 발명의 몇몇 양상들에 따르면, 프리페치 요청들에 대해 별도의 포트가 제공된다. 이 경우에, 프리페치 요청은 노멀 요청들과 구별될 필요는 없다. 그러나, 몇몇 양상들에 따르면, 이들은 특정 ID, 크기, 및 얼라인먼트가 되도록 여전히 요구된다.According to some aspects of the invention, a separate port is provided for prefetch requests. In this case, the prefetch request need not be distinguished from the normal requests. However, according to some aspects, they are still required to be specific ID, size, and alignment.

본 발명의 몇몇 양상들에 따르면, STLB(300)은 입력 포트(310)에서 관찰되는 패턴들에 기초하여 프리페치들을 자동으로 발생시키는 로직을 포함한다. 본 발명의 몇몇 양상들에 따르면, 자동 프리페칭은, (현재 페이지 크기, 프로그래밍된 페이지 크기 또는 최소 페이지 크기를 포함하는 페이지 크기에 기초하여) 뒤따르는 페이지를 자동으로 프리페칭하고 있다. 본 발명의 몇몇 양상들에 따르면, 자동 프리페칭 로직은 스트라이드들(strides) 또는 2-차원 액세스들 및 이에 따른 프리페치를 검출한다. 다른 보통의 프리페칭 기법들이 이용될 수 있지만, 이들은 페이지-입도(page-granularity)에 적응되고, 그리고 동일한 페이지에 대해 다수 회 자동 프리페치들을 행하는 것을 회피한다. 본 발명의 몇몇 양상들에 따르면, 자동 프리페칭 로직으로부터 착신되는, 워커 쓰루 포트(330)에 전송된 프리페치들은, 워커가 인터럽트들과 같은 워크의 부작용을 인에이블하는 것을 방지하기 위해 추측에 근거한 표시를 이용한다.According to some aspects of the present invention, the STLB 300 includes logic to automatically generate pre-fetches based on the patterns observed at the input port 310. According to some aspects of the present invention, automatic pre-fetching automatically pre-fetches pages following (based on page size including current page size, programmed page size, or minimum page size). According to some aspects of the invention, the automatic pre-fetching logic detects strides or two-dimensional accesses and the prefetch accordingly. While other common prefetching techniques may be used, they are adapted to page-granularity and avoid performing multiple automatic pre-fetches on the same page. According to some aspects of the present invention, the prefetches sent to the worker through port 330, which are incoming from the automatic prefetching logic, may be used to prevent guessing based on speculation to prevent a walker from enabling side effects of a work such as interrupts Display.

이제 도 4를 참조하면, 본 발명의 몇몇 양상들에 따르면, STLB(100)는 SMMU(400) 내에 포함된다. STLB(100)는 워커(410)에 커플링된다. 워커(410)는, STLB(100)로부터 외부 변환 요청들을 수신하고, 워커 포트(130)를 통해서 응답들을 제공한다. STLB(100) 및 워커(410)는 출력 포트(120)를 공유한다.Referring now to FIG. 4, in accordance with some aspects of the present invention, an STLB 100 is included in an SMMU 400. The STLB 100 is coupled to the walker 410. The worker 410 receives external conversion requests from the STLB 100 and provides responses through the worker port 130. The STLB 100 and the worker 410 share the output port 120.

STLB들을 갖는 인터커넥트(5100)가 도 5에 도시된다. 인터커넥트(5100)는, 개시자 네트워크 인터페이스 유닛 포트들(5110 및 5120), 중심 인터커넥션 네트워크(5140), 및 타겟 포트들(5150)을 포함한다. 개시자 네트워크 인터페이스 포트(5110)는 AXI 프로토콜을 이용하고, 개시자 네트워크 인터페이스 포트(5120)는 AHB 인터페이스 프로토콜을 이용한다. 개시자 IP 인터페이스들은 STLB들(5112 및 5122) 각각을 통해서 개시자 네트워크 인터페이스 포트들(5110 및 5120)에 접속된다. STLB들(5112 및 5122)은 워커 인터페이스 포트(5160)를 통해서 워커에 접속된다.An interconnect 5100 with STLBs is shown in FIG. Interconnect 5100 includes initiator network interface unit ports 5110 and 5120, a central interconnection network 5140, and target ports 5150. Initiator network interface port 5110 uses the AXI protocol and initiator network interface port 5120 uses the AHB interface protocol. Initiator IP interfaces are connected to initiator network interface ports 5110 and 5120 through each of STLBs 5112 and 5122. The STLBs 5112 and 5122 are connected to the walker via the walker interface port 5160.

도 6은, 개시자 네트워크 인터페이스 유닛들(네트워크 인터페이스 유닛)(6210 및 6220)을 포함하는, 본 발명에 따른 인터커넥트(6200)를 도시한다. 네트워크 인터페이스 유닛(6210)은 개시자 AXI 트랜잭션 인터페이스를 내부 범용 인터페이스 프로토콜에 적응시키는 특정-범용 유닛(6211)을 포함한다. 네트워크 인터페이스 유닛(6220)은, 개시자 AHB 트랜잭션 인터페이스를 내부 범용 인터페이스 프로토콜에 적응시키는 특정-범용(specific-to-generic) 유닛(6221)을 포함한다.Figure 6 illustrates an interconnect 6200 in accordance with the present invention, including initiator network interface units (network interface units) 6210 and 6220. [ The network interface unit 6210 includes a general-purpose unit 6211 that adapts the initiator AXI transaction interface to an internal general purpose interface protocol. Network interface unit 6220 includes a specific-to-generic unit 6221 that adapts the initiator AHB transaction interface to an internal general purpose interface protocol.

개시자 네트워크 인터페이스 유닛들(6210 및 6220) 각각은 G2T(generic to transport) 유닛(6212)을 더 포함한다. G2T 유닛은, 각각의 트랜잭션을 하나 또는 그 초과의 수송 패킷들로 컨버팅하고, 그 수송 패킷들을 타겟 네트워크 인터페이스 유닛 포트들(6250)에 트랜잭션들을 운반하는 데이터경로 수송 네트워크(6240) 상으로 전송한다.Each of the initiator network interface units 6210 and 6220 further includes a generic to transport (G2T) unit 6212. The G2T unit converts each transaction into one or more transport packets and transmits the transport packets onto a data path transport network 6240 that carries transactions to target network interface unit ports 6250. [

본 발명의 일 양상에 따르면, 각각의 개시자 네트워크 인터페이스 유닛은 특정-범용 유닛과 G2T 유닛 사이에 배열된 STLB(6213)를 더 포함한다. STLB들(6213)은 자신의 개시자 측 데이터 요청 인터페이스 및 자신의 타겟 측 데이터 요청 인터페이스상에 범용 프로토콜 인터페이스를 포함한다. STLB들(5112 및 5122) 각각은 자신의 상이한 개별적인 프로토콜들(STLB(5112)에 대해서는 AXI 및 STLB(5122)에 대해서는 AHB))에 적응되는 반면에, STLB들(6213)은 동일하고 범용 프로토콜 사양으로 설계된다. 프로토콜 적응의 복잡한 특징들(complexities)은, 특정-범용 유닛들(6211 및 6221)에서 수행되고, 이에 따라 범용 프로토콜은 단순함을 위해 설계된다. 본 발명의 일부 양상들에 따르면, 범용 프로토콜은 정렬되지 않은 액세스들의 복잡한 특징들 또는 복잡한 순서화 요건들을 지원하지 않는다. 이에 따라, STLB(6213)의 설계는 크게 단순화된다. 게다가, 단순화로 인해, STLB(6213) 내 로직 경로들은 더 짧게 되고, 그 레이턴시는 더 낮게 된다.According to an aspect of the present invention, each initiator network interface unit further includes an STLB 6213 arranged between the general-purpose unit and the G2T unit. STLBs 6213 include a generic protocol interface on its initiator side data request interface and on its target side data request interface. Each of the STLBs 5112 and 5122 is adapted to its own separate protocols (AXI for STLB 5112 and AHB for STLB 5122), while STLBs 6213 are the same, . The complexities of protocol adaptation are performed in the specific-general purpose units 6211 and 6221, and thus the general purpose protocol is designed for simplicity. According to some aspects of the invention, the general purpose protocol does not support the complex features of unaligned accesses or complex ordering requirements. Accordingly, the design of the STLB 6213 is greatly simplified. In addition, due to simplification, the logic paths in STLB 6213 are shorter and the latency is lower.

G2T(6212)는, 트랜잭션이 지향될(directed) 일 세트의 하나 또는 그 초과의 타겟 인터페이스들(6250)을 결정하기 위해 트랜잭션 어드레스들을 디코딩한다. STLB(6213)는 또한 변환을 룩업하기 위해 그 어드레스를 디코딩해야만 한다. 본 발명의 다른 양상에 따르면, G2T(6212)에서 수행된 것과는 다르게, 어드레스 디코딩이 대신에 STLB(6213)에서 수행된다. 이는, 감소된 트랜잭션 레이턴시의 이점을 제공한다.G2T 6212 decodes the transaction addresses to determine a set of one or more target interfaces 6250 to which the transaction is directed. STLB 6213 must also decode its address to look up the translation. According to another aspect of the present invention, unlike that performed in G2T 6212, address decoding is performed in STLB 6213 instead. This provides the advantage of reduced transaction latency.

자연적으로, 각각의 STLB는 워커(6230)에 워커 요청들을 전송하기 위해 워커 인터페이스를 갖는다. 본 발명의 다른 양상에 따르면, STLB들(6213)의 워커 인터페이스들은 수송 네트워크(6260)를 통해서 워커(6230)에 접속된다. 수송 네트워크(6260)는 수송 네트워크(6240)와 동일한 수송 유닛들의 라이브러리 및 프로토콜을 이용한다. 이는, 요구되는 유닛 레벨 로직 설계 검증의 양을 감소시킬 뿐만 아니라 성능 추정 시뮬레이션 모델을 구축하는데 있어서의 복잡도를 감소시킨다. 수송 유닛들의 라이브러리는:Naturally, each STLB has a worker interface to send worker requests to the worker 6230. According to another aspect of the present invention, the walker interfaces of the STLBs 6213 are connected to the walker 6230 via the transport network 6260. [ The transport network 6260 utilizes the same transport unit libraries and protocols as the transport network 6240. This not only reduces the amount of required unit level logic design verification, but also reduces the complexity in building a performance estimation simulation model. The library of transport units is:

칩 평면도(chip floor plan) 내에서의 대역폭과 배선들의 트레이드-오프들을 허용하기 위한 직렬화 어댑터들(serialization adapters);Serialization adapters to allow trade-offs in bandwidth and wiring within the chip floor plan;

별도의 클록 트리들 및 주파수 스케일링을 위한 클록 도메인 어댑터들; Separate clock trees and clock domain adapters for frequency scaling;

전력 도메인 관리를 허용하기 위한 전력 어댑터들;Power adapters to allow power domain management;

관찰 프로브들(observation probes);Observation probes;

보안 필터들; 및Security filters; And

다른 통상적인 온-칩-인터커텍트 유닛들을 포함한다. 대조적으로, 워커로의 인터페이스 포트(5160)는 표준 프로토콜을 이용하지 않으며, 이에 따라 필연적으로 상이한 세트의 인터커넥트 로직을 갖는다.Other conventional on-chip-interconnect units. In contrast, the interface port 5160 to the worker does not use a standard protocol and thus has a different set of interconnect logic inevitably.

도 7은, 본 발명의 교시들에 따라서, 도 6의 인터커넥트(6200), 개시자 네트워크 인터페이스 유닛(6210), 및 도 6의 STLB(6213)를 도시한다. 서브시스템 인터커넥트(7300)는, 자신의 타겟 네트워크 인터페이스 유닛(7310)을 통해서 개시자 네트워크 인터페이스 유닛(6210)에 접속된다. 서브시스템 인터커넥트(7300)는 다수의 개시자 포트들(7320) 및 내부 네트워크(7330)를 포함한다.FIG. 7 illustrates interconnect 6200, initiator network interface unit 6210, and STLB 6213 of FIG. 6, in accordance with the teachings of the present invention. Subsystem interconnect 7300 is connected to initiator network interface unit 6210 via its own target network interface unit 7310. [ Subsystem interconnect 7300 includes a plurality of initiator ports 7320 and an internal network 7330.

본 발명의 일 양상에 따르면, 서브시스템 인터커넥트(7300)는 인터커넥트(6200)와 동일한 라이브러리로부터의 유닛들을 포함한다. 본 발명의 몇몇 양상들에 따르면, 타겟 네트워크 인터페이스 유닛(7310)과 개시자 네트워크 인터페이스 유닛(6210) 사이의 인터페이스 프로토콜은 표준 프로토콜이다. 몇몇 표준 프로토콜들은 AXI, ACE, 및 OCP이다. 본 발명의 다른 양상들에 따르면, 타겟 네트워크 인터페이스 유닛(7310)과 개시자 타겟 인터페이스 유닛(6210) 사이의 프로토콜은, 2012년 9월 25일 출원되고 발명의 명칭이 NETWORK ON A CHIP SOCKET PROTOCOL이며 본원에 인용에 의해 통합된 미국 일반 특허 출원 제13/626,766호에 설명된 네트워크-온-칩 소켓 프로토콜과 같은 특히 낮은 레이턴시를 갖는 특수한 프로토콜이다. 몇몇 프로토콜들을 낮은-레이턴시 프로토콜들로 만드는 일 특징은, 응답들을 요청들과 연관시키기 위해 마스터들이 간접적인 룩업을 수행하게 하는 필요성을 제거하는 트랜잭션 식별자 신호를 갖는 것이다.In accordance with an aspect of the invention, subsystem interconnect 7300 includes units from the same library as interconnect 6200. [ According to some aspects of the present invention, the interface protocol between the target network interface unit 7310 and the initiator network interface unit 6210 is a standard protocol. Some standard protocols are AXI, ACE, and OCP. According to other aspects of the present invention, the protocol between the target network interface unit 7310 and the initiator target interface unit 6210 is a NETWORK ON A CHIP SOCKET PROTOCOL, filed September 25, 2012, Such as the network-on-a-chip socket protocol described in U. S. Patent Application Serial No. 13 / 626,766, incorporated by reference in its entirety. One feature that makes some protocols low-latency protocols is to have a transaction identifier signal that eliminates the need for masters to perform an indirect lookup to associate responses with requests.

본 발명의 일 양상에 따르면, TLB 할당 정보가 개시자 네트워크 인터페이스 유닛들(7320)에 접속된 개시자들에 의해 전송되고, 서브시스템 내부 네트워크(7330)를 통해서, 타겟 네트워크 인터페이스 유닛(7310)을 통해서, 이 정보가 STLB(6213)에 제공되는 개시자 네트워크 인터페이스 유닛(6210)으로 수송된다. STLB(6213)는 할당 정책을 수행하기 위해 할당 정보를 이용한다.According to an aspect of the present invention, TLB allocation information is transmitted by initiators connected to initiator network interface units 7320 and transmitted via subsystem internal network 7330 to target network interface unit 7310 The information is conveyed to the initiator network interface unit 6210 provided to the STLB 6213. [ The STLB 6213 uses the allocation information to perform the allocation policy.

본 발명의 몇몇 양상들에 따르면, TLB 할당 정보는, 트랜잭션 프로토콜의 ID 필드들의 순서화를 이용하여, 개시자 네트워크 인터페이스 유닛들(7320)에서 인코딩된다. 본 발명의 다른 양상들에 따르면, TLB 할당 정보는, 개시자 네트워크 인터페이스 유닛들(7320)로부터 타겟 네트워크 인터페이스 유닛(7310)으로 수송된 프로토콜 측파대 신호들로 인코딩된다. 본 발명의 다른 양상들에 따르면, TLB 할당 정보는 수송 프로토콜의 네트워크 인터페이스 유닛 식별자 필드들로 인코딩된다.According to some aspects of the present invention, the TLB allocation information is encoded in initiator network interface units 7320, using the ordering of the ID fields of the transaction protocol. According to other aspects of the present invention, the TLB allocation information is encoded into protocol sideband signals carried from the initiator network interface units 7320 to the target network interface unit 7310. According to other aspects of the invention, the TLB allocation information is encoded into network interface unit identifier fields of the transport protocol.

본 발명의 몇몇 양상들에 따르면, STLB 프리페치 요청들은 개시자 네트워크 인터페이스 유닛들(7320)로부터 STLB(6213)로 전송된다. 프리페치 요청들은, ___자로 출원되고 발명의 명칭이 SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING ART-024US1이며, 인용에 의해 본원에 통합된 미국 일반 특허 출원 일련 번호 제 ___ 호에 설명된 유형일 수 있다. 서브시스템 인터커넥트(7300)는, STLB(6213)가 프리페치 요청들을 식별할 수 있게 프리페치 요청들이 전송되거나 재생성되도록 구성된다. 본 발명의 다른 양상들에 따르면, 개시자 네트워크 인터페이스 유닛들(7320)은 프리페치 요청들로부터 노멀 요청들을 구별하기 위해 ID 비트들을 순서화하는 것을 이용한다. 본 발명의 다른 양상들에 따르면, 프리페치 요청들은 측파대 신호들로 표시된다.According to some aspects of the present invention, STLB prefetch requests are sent from initiator network interface units 7320 to STLB 6213. [ Prefetch requests are described in US patent application serial no. ______ filed as ____, the title of the invention being SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING ART-024US1, Lt; / RTI > Subsystem interconnect 7300 is configured such that prefetch requests are sent or regenerated such that STLB 6213 can identify prefetch requests. According to other aspects of the present invention, initiator network interface units 7320 utilize ordering ID bits to distinguish normal requests from prefetch requests. According to other aspects of the present invention, prefetch requests are represented by sideband signals.

본 발명의 일 양상에 따르면, 개시자 네트워크 인터페이스 유닛들(7320)은 노멀 요청과 프리페치 요청 사이를 구별하도록 프로그래밍가능하다.According to an aspect of the present invention, initiator network interface units 7320 are programmable to distinguish between a normal request and a prefetch request.

본 발명의 일 양상에 따르면, TLB 할당 정보 및 프리페치 식별 정보는 개시자 네트워크 인터페이스 유닛들(7320)로부터 바뀌지 않은(unaltered) 타겟 네트워크 인터페이스 유닛들(7310)로 전송될 수 있어서, 임의의 수의 서브시스템 인터커넥트들(7300)이 캐스케이딩될 수 있고 그리고 여전히 할당 정보를 STLB(6213)에 제공할 수 있다.According to an aspect of the present invention, the TLB allocation information and the prefetch identification information may be transmitted from the initiator network interface units 7320 to unaltered target network interface units 7310 so that any number of Subsystem interconnects 7300 may be cascaded and still provide allocation information to STLB 6213. [

도 8에 도시된 바와 같이, 본 발명의 일 양상에 따르면, STLB들(8400)은 중간-레벨 변환 캐시(8410)를 공유한다. 도 8은, 2개의 STLB들(8400)에 접속된 개시자(8420)를 도시한다. STLB들(8400) 각각은, 워커 인터페이스(8430)를 통해서 워커에 접속된 중간-레벨 변환 캐시(8410)에 접속된다. STLB(8400) 및 중간-레벨 변환 캐시(8410) 둘 다에서 미스하는 변환 요청들은 워커 쓰루 포트(8430)로 전송된다.8, in accordance with an aspect of the present invention, STLBs 8400 share an intermediate-level translation cache 8410. In FIG. FIG. 8 shows an initiator 8420 connected to two STLBs 8400. Each of STLBs 8400 is connected to a mid-level translation cache 8410 connected to a worker via a worker interface 8430. Translation requests that are missed in both the STLB 8400 and the intermediate-level translation cache 8410 are forwarded to the worker through port 8430.

본 발명의 일 양상에 따르면, 중간-레벨 변환 캐시(8410)는 STLB들(8400) 내의 캐시들보다 더 크고, STLB들(8400)은 중간-레벨 변환 캐시들(8410)의 가외의 커패시티를 공유한다.Level translation cache 8410 is larger than the caches in the STLBs 8400 and the STLBs 8400 are used to store extra capacities of the mid-level translation caches 8410 Share.

본 발명의 양상에 따르면, STLB들(8400)에 의해 수신된 요청들은 교차-국부성을 갖는데, 즉, 상이한 STLB들(8400)은 동일한 변환들 중 일부를 필요로 한다. 중간-레벨 캐시는, 변환들이 워커에 의해 리턴되기 때문에 이 변환들을 홀딩하여, 워커 요청의 딜레이를 일으키지 않으면서 그 대신에 두 번째로 요청하는 STLB(8400)가 중간-레벨 캐시(8410) 내에서 자신이 필요로 하는 변환을 찾을 수 있게 된다.In accordance with an aspect of the present invention, requests received by STLBs 8400 have cross-locality, i.e., different STLBs 8400 require some of the same transforms. The mid-level cache holds these conversions because the conversions are returned by the worker so that the second requesting STLB 8400 instead of causing a delay of the worker request is sent in the mid-level cache 8410 You will be able to find the transformations you need.

본 발명의 일 양상에 따르면, 개시자(8420)는 다수의 인터페이스들을 갖는 개시자이다. 개시자(8420)는 포트들 사이에 트래픽을 분포시킨다. 이 분포는, 링크의 폭을 증가시키지 않고 요청 대역폭을 증가시킨다. 본 발명의 몇몇 양상들에 따르면, 분포는, 몇몇 어드레스 비트들에 기초하여 어드레스 범위의 인터리빙에 의해 결정되어, 그 특정 어드레스 비트들 또는 그 어드레스 비트들의 해쉬(hash)는, 요청에 의해 어떤 포트가 이용될지를 결정한다. 본 발명의 다른 양상들에 따르면, 각각의 포트는 어드레스 공간의 일부에 전용된 캐시에 의해 구동된다. 본 발명의 일 양상에 따르면, 멀티포트형 개시자는, 3D(GPU) 엔진, 2D 엔진, 비디오 엔진, 이미지 프로세싱 엔진, 또는 신호 프로세싱 엔진과 같은 멀티미디어 엔진이다.According to an aspect of the invention, an initiator 8420 is an initiator having a plurality of interfaces. Initiator 8420 distributes traffic between the ports. This distribution increases the requested bandwidth without increasing the width of the link. According to some aspects of the present invention, the distribution is determined by interleaving of the address range based on some address bits, and the particular address bits or a hash of the address bits are determined by a request, To be used. According to other aspects of the invention, each port is driven by a cache dedicated to a portion of the address space. According to an aspect of the invention, a multiport initiator is a multimedia engine such as a 3D (GPU) engine, a 2D engine, a video engine, an image processing engine, or a signal processing engine.

특히 포트들 사이에서 요청들의 분포가 적어도 부분적으로는 낮은 어드레스 비트들에 기초한 인터리빙에 기초하여 행해진 경우에는, 동일한 엔진의 다수의 포트들로부터 나오는 트래픽은 양호한 페이지 국부성을 갖는 경향이 있다. 이 경우에서, 긴 인접 버스트(long contiguous burst)들은 포트들 사이에서 나뉠 것이며, STLB 레이턴시는 공유된 중간-레벨 변환 캐시의 이용에 의해 현저하게 감소된다.Traffic from multiple ports of the same engine tends to have good page localization, especially if the distribution of requests among the ports is made based at least in part on interleaving based on low address bits. In this case, long contiguous bursts will be split between the ports, and the STLB latency is significantly reduced by the use of a shared mid-level translation cache.

시뮬레이션 환경은, 본 발명의 다양한 양상들에 따라 도 9에 제시된다. 시뮬레이션 환경은, 컴퓨터에 의해 구동되는 컴퓨터 실행가능 명령들로 구현된다. 로컬 컴퓨터 또는 클라우드 컴퓨터와 같은 수많은 유형들의 컴퓨터들이 이용될 수 있다. 시뮬레이션은, 명령들의 실행의 호출(invocation)에 의해 시작한다.The simulation environment is presented in FIG. 9 in accordance with various aspects of the present invention. The simulation environment is implemented with computer-executable instructions that are run by a computer. Many types of computers, such as a local computer or a cloud computer, may be used. The simulation starts by invocation of execution of instructions.

본 발명의 일 양상에 따르면, 인터커넥트(9510)가 시뮬레이션 환경(9520) 내에서 시뮬레이팅된다. 인터커넥트(9510)는 STLB(9530)를 포함한다. 동일한 시뮬레이션 환경이, STLB 없는 인터커넥트에 대해 또는 TLB를 포함하는 인터커넥트(9510)와 같은 인터커넥트에 대해 이용될 수 있다. 이는, 인터커넥트 및 별도의 STLB에 대해 별도의 시뮬레이션 환경들을 통합시키도록 요구되는 엄청난 복잡성 및 어려운 작업을 회피한다.In accordance with an aspect of the invention, an interconnect 9510 is simulated within a simulation environment 9520. Interconnect 9510 includes STLB 9530. The same simulation environment can be used for an interconnect without STLB or for an interconnect such as interconnect 9510 that includes a TLB. This avoids the enormous complexity and difficult task required to integrate separate simulation environments for interconnects and separate STLBs.

본 발명의 몇몇 양상들에 따르면, 시뮬레이션 환경(9520)은 트랜잭터들, 모니터들, 다양한 다른 검증 지적 재산들, 및 스코어보드를 포함한다. 이 스코어보드는 인터커넥트를 지원하도록 설계된다. 스코어보드를 포함하는 시뮬레이션 환경은 내부 STLB를 갖는 또는 내부 STLB를 갖지 않는 인터커넥트에 대해 재사용될 수 있다. 시뮬레이션 환경은, Verilog 또는 System Verilog와 같은 레지스터 트랜스퍼 레벨 언어로 구현된다.According to some aspects of the present invention, the simulation environment 9520 includes translators, monitors, various other verification intellectuals, and a scoreboard. This scorecard is designed to support interconnects. The simulation environment including the scoreboard can be reused for interconnects with internal STLBs or interconnects without internal STLBs. The simulation environment is implemented in a register transfer level language such as Verilog or System Verilog.

본 발명의 다른 양상들에 따르면, 이 시뮬레이션은 성능 시뮬레이션(performance simulation)이다. 시뮬레이션 환경은, SystemC와 같은 시스템 레벨 모델링 언어들로 구현된다. 공통의 트랜잭션 소켓 모델링 프로토콜은, OSCI(Open SystemC Initiative) TLM(Transaction Level Modeling) 2.0 표준이다. According to other aspects of the invention, this simulation is a performance simulation. The simulation environment is implemented in system level modeling languages such as SystemC. The common transaction socket modeling protocol is OSCI (Open SystemC Initiative) TLM (Transaction Level Modeling) 2.0 standard.

도 10은, 요청들의 가상 어드레스들이 변환되는 STLB(10120)로 요청들을 전송하는 DMA 엔진(10110)을 포함하는 시스템을 도시한다. DMA 엔진들의 예시들은, 이더넷, 무선, USB, SATA 컨트롤러들과 같은 I/O-기반 엔진들을 포함할 뿐 아니라 또한 카메라 이미지 신호 프로세싱 엔진들, 비디오 엔진들, 2D 및 3D 엔진들, 및 TCP/IP와 같은 가속기들 및 암호화 가속기(cryptography accelerator)를 포함한다. DMA 엔진(10110)과 STLB(10120)는 통신하고 있다. DMA 엔진(10110)은, 요청들에서 전송될 가상 어드레스들의 시퀀스들을 발생시키는 가상 어드레스 발생기(10140)를 포함한다. 본 발명의 몇몇 양상들에 따르면, DMA 엔진(10110)과 STLB(10120)는 직접 접속된다. 본 발명의 다른 양상들에 따르면, DMA 엔진(10110)과 STLB(10120)는 다른 모듈을 통해서 접속된다. 보조 저장 요청들이 인터커넥트(10130)에 행해진다. 대안적으로, 다음-레벨 캐시 또는 보조 저장 DRAM 컨트롤러와 같은 타겟은 인터커넥트(10130) 대신일 수 있다.10 illustrates a system that includes a DMA engine 10110 that sends requests to STLB 10120 where the virtual addresses of requests are translated. Examples of DMA engines include not only I / O-based engines such as Ethernet, wireless, USB, SATA controllers, but also camera image signal processing engines, video engines, 2D and 3D engines, and TCP / And cryptography accelerators. &Lt; RTI ID = 0.0 > DMA engine 10110 and STLB 10120 are communicating. DMA engine 10110 includes a virtual address generator 10140 that generates sequences of virtual addresses to be transmitted in requests. According to some aspects of the present invention, DMA engine 10110 and STLB 10120 are directly connected. According to other aspects of the present invention, the DMA engine 10110 and the STLB 10120 are connected through different modules. Supplemental storage requests are made to interconnect 10130. Alternatively, a target such as a next-level cache or a secondary storage DRAM controller may be an alternative to interconnect 10130.

STLB(10120)는, 어드레스 변환들을 캐싱하고, 워커 인터페이스(10160)를 통해서 캐시 내에 있지 않은 변환들에 대한 변환 요청들을 전송할 수 있다. 대안적으로, 워커는 STLB와 통합되어 이를 완전한 SMMU로 만들 수 있다. 이러한 경우에서, 변환 테이블 메모리 인터페이스가 워커 인터페이스(10160) 대신에 접속될 것이다. 이러한 경우에서, 대안적으로 워커 인터페이스(10160)는 존재하지 않을 수도 있고, 페이지 테이블 메모리 요청들은 인터커넥트(10130)에 직접 전송될 것이다.STLB 10120 may cache the address translations and send translation requests for translations that are not in cache via worker interface 10160. Alternatively, the walker can be integrated with STLB to make it a complete SMMU. In this case, the conversion table memory interface will be connected instead of the worker interface 10160. In this case, alternatively, the worker interface 10160 may not be present and the page table memory requests will be sent directly to the interconnect 10130.

STLB(10120)는 변환 룩-업들 및 어드레스 변형을 수행하기 위해 로직의 결과로서 DMA 엔진(10110)에 의해 전송된 모든 요청들에 보통의(modest) 레이턴시를 부가한다. 그러나, 요청이 STLB(10120)에서 미스할 때(즉, 가상 어드레스에 대응하는 페이지 변환이 변환 캐시에 존재하지 않을 때), STLB(10120)는 페이지 테이블 메모리로부터 변환을 리트리브하기 위해 긴 프로세스(lengthy process)를 통해 진행한다. 이는, TLB 미스들을 갖는 요청들에 상당한 딜레이를 야기한다.STLB 10120 adds a modest latency to all requests sent by DMA engine 10110 as a result of logic to perform translation look-ups and address translation. However, when the request misses at STLB 10120 (i.e., when the page translation corresponding to the virtual address is not present in the translation cache), STLB 10120 issues a long process (lengthy) to retrieve the translation from the page table memory process. This causes a significant delay in requests with TLB misses.

STLB(10120)의 버퍼링 능력 및 요청 스트림의 순서화 제약들에 따라, 다수의 미스들을 동시에 처리할 수 있다. 그러나, 자신의 버퍼링 능력들이 고갈될(exhausted) 때, STLB(10120)는 미스들을 한 번에 하나만 프로세싱해야하고, 이는 훨씬 더 많은 미스들을 급격하게 요청 레이턴시에 부가한다.Depending on the buffering capabilities of the STLB 10120 and the ordering constraints of the request stream, multiple misses may be processed simultaneously. However, when their buffering capabilities are exhausted, STLB 10120 has to process only one miss at a time, which adds a lot more misses to the request latency.

간단한 블록-기반 DMA 엔진들은 양호한 페이지 국부성을 가질 수 있다. 예를 들어, 4KB의 공통의 페이지 크기의 경우, 간단한 블록-기반 DMA 엔진은 64개의 연속적인 64B 요청들을 위치시킬 수 있다. 페이지에 대한 첫 번째 요청은, 그 요청의 전송을 미스하고 딜레이할 것이다. 그러나, 변환이 STLB에 의해 수신되었을 때, 이 변환은 뒤따르는 63개의 요청들에 대해 효율적으로 재사용될 수 있다. 첫 번째 요청이 상당한 레이턴시 페널티를 발생시키는 동안, 뒤따르는 요청들은 빠르게 진행할 수 있고, 변환이 수신되면, STLB의 존재로 인해 상당하지만 보통 손상(crippling)을 주지 않는 전반적인 성능 페널티를 야기한다. 성능 페널티는 변환 리트리벌의 레이턴시 및 요청들의 레이트에 의존한다.Simple block-based DMA engines can have good page localization. For example, for a common page size of 4 KB, a simple block-based DMA engine could place 64 consecutive 64B requests. The first request to the page will miss and delay the transmission of the request. However, when the transform is received by the STLB, this transform can be efficiently reused for the following 63 requests. While the first request generates a significant latency penalty, subsequent requests can go fast and, if a conversion is received, cause an overall performance penalty that is significant due to the presence of STLB but does not usually cause crippling. The performance penalty depends on the latency of the conversion retry and the rate of requests.

그러나, 수많은 DMA 엔진들은 훨씬 더 불량한 페이지 국부성을 갖는 요청 패턴들을 갖는다. 예를 들어, 어레이를 수직으로 판독하는 2D DMA 엔진은 이전 페이지와는 상이한 페이지에 모든 각각의 요청들을 전송할 수 있다. 액세스된 페이지들의 수가 TLB에 저장될 수 있는 변환들의 수보다 큰 경우, 미스들은 성능에 심각하게 손상을 줄 수 있다. 캐시가 충분히 큰 경우, 2D 엔진이 다음 컬럼을 페칭할 때, 페이지들은 여전히 그 캐시 내에 있을 수도 있어서, 이는 페널티를 감소시킨다.However, many DMA engines have request patterns with much poorer page localization. For example, a 2D DMA engine that reads the array vertically can send all respective requests to a different page than the previous page. If the number of accessed pages is greater than the number of transformations that can be stored in the TLB, the misses can severely impair performance. When the cache is large enough, when the 2D engine fetches the next column, the pages may still be in its cache, which reduces the penalty.

마이크로프로세서들 내에서, 요청들의 어드레스들은 이전에 쉽게 컴퓨팅되지 않는다. 그러나, 수많은 DMA 엔진들은 페칭된 데이터와 그들의 어드레스들 사이에 약간의 종속성(dependency)을 갖는다. 이러한 DMA 컨트롤러들은 예상 요청들의 정확한 스트림을 쉽게 발생시킬 수 있다.Within microprocessors, the addresses of requests are not easily computed previously. However, many DMA engines have some dependency between the fetched data and their addresses. These DMA controllers can easily generate an accurate stream of expected requests.

이는, 오직 데이터 관리가 공간을 다 소진할 때 또는 지원되는 중요한(outstanding) 요청들의 수에 도달될 때에만 가상 어드레스 발생기가 제한되도록, 데이터 버퍼 관리로부터 가상 어드레스 발생기를 디커플링함으로써 자연스럽게 이용된다. 이 특성은 STLB 미스들의 성능 페널티를 감소시키는데 이용될 수 있다.This is used naturally by decoupling the virtual address generator from the data buffer management so that the virtual address generator is limited only when the data management is exhausting the space or when the number of outstanding requests supported is reached. This property can be used to reduce the performance penalty of STLB misses.

도 11은, 타겟(10130)에 접속된 STLB(11220)에 접속된 DMA 엔진(11210)을 포함하는 시스템을 도시한다. DMA 엔진(11210)은 2개의 가상 어드레스 스트림들을 발생시킨다. 하나의 가상 어드레스 스트림은 노멀 가상 어드레스 발생기(11240)에 의해 발생된 노멀 가상 어드레스 스트림이다. 다른 가상 어드레스 스트림은, 노멀 가상 어드레스 발생기(11240)에 의해 발생된 가상 어드레스 스트림과 유사하지만 이보다 앞서는, 프리페치 가상 어드레스 발생기(11250)에 의해 발생된 프리페치 가상 어드레스들의 스트림이다.11 shows a system that includes a DMA engine 11210 connected to STLB 11220 connected to a target 10130. Fig. DMA engine 11210 generates two virtual address streams. One virtual address stream is the normal virtual address stream generated by the normal virtual address generator 11240. The other virtual address stream is a stream of prefetch virtual addresses generated by prefetch virtual address generator 11250, similar to but preceded by a virtual address stream generated by normal virtual address generator 11240.

본 발명의 몇몇 양상들에 따르면, 프리페치 가상 어드레스 발생기(11250)는 노멀 가상 어드레스 발생기(11240)와 동일하다. 본 발명의 몇몇 양상들에 따르면, 프리페치 가상 어드레스 발생기(11250)는 노멀 가상 어드레스 발생기(11240)와 몇몇 로직을 공유한다. 가능한 공유 로직의 몇몇 유형들은 설정 레지스터들 및 상태 머신들의 일부들이다.According to some aspects of the present invention, prefetch virtual address generator 11250 is identical to normal virtual address generator 11240. According to some aspects of the present invention, the prefetch virtual address generator 11250 shares some logic with the normal virtual address generator 11240. Some types of possible sharing logic are part of the configuration registers and state machines.

본 발명의 몇몇 양상들에 따르면, 프리페치 가상 어드레스 발생기(11250)는 노멀 가상 어드레스 발생기(11240)에 의해 이후에 전송될 각각의 노멀 요청들에 대응하는 프리페치 요청을 전송한다. 본 발명의 다른 양상들에 따르면, 프리페치 가상 어드레스 발생기(11250)는 이전의 또는 최근의 요청에서 액세스되었던 페이지들의 범위에 있는 리던던트 연속 가상 어드레스들을 통해 프리페치들을 요청하는 것을 회피한다. 그렇게 하기 위한 하나의 방법은, 일 페이지 내에서 오직 특정 어드레스에 대응하는 요청 어드레스들만으로 프리페치들의 파생된 스트림을 한정하는 것이다. 보통, 이것은 일 페이지 내에서 첫 번째 어드레스일 수 있다.According to some aspects of the present invention, prefetch virtual address generator 11250 sends a prefetch request corresponding to each normal request to be subsequently transmitted by normal virtual address generator 11240. [ According to other aspects of the present invention, prefetch virtual address generator 11250 avoids requesting prefetches through redundant contiguous virtual addresses in the range of pages that were accessed in a previous or recent request. One way to do so is to define a derived stream of prefetches in only one request address that corresponds to a particular address within a page. Usually, this can be the first address in a page.

본 발명의 몇몇 양상들에 따르면, 프리페치 요청들은 사용시에 표준 트랜잭션 프로토콜에 따라 합법적으로 형성된다. 몇몇 표준 트랜잭션 프로토콜들은 AMBA(Advanced Microcontroller Bus Architecture) AXI(Advanced extensible Interface) 및 OCP(Open Cores Protocol)이다. 본 발명의 몇몇 양상들에 따르면, 노멀 요청들은 전체 캐시 라인을 위한 것인 반면, 프리페치 요청들은 0 바이트들 또는 1 바이트와 같은 소량의 데이터를 위한 것이다. 0 바이트 요청은 몇몇 프로토콜들에 따르면 불법적이지만, 프리페치의 표시로서 이용될 수 있다. 1 바이트 요청은, 대부분의 프로토콜에 따르면 합법적이며, 프리페치 크기로서 이용될 수 있다. 1 바이트 요청(바이트는 데이터 요청의 최소 원자 단위임)은, 요청된 데이터가, 액세스 제한들의 정의된 어드레스 범위들 또는 카덴스들 사이의 임의의 바운더리들, 예컨대, 페이지 크기로 정렬된 범위들에 걸쳐있지 않을 것임을 보장하는 이점을 갖는다. 게다가, 1 바이트 요청은 인터페이스상에서 데이터의 트랜스퍼의 정확하게 1 사이클을 요구한다. 대안적으로, 입력 요청들은, 더 낮은 어드레스 비트들을 폐기함으로써와 같이, 페이지 정렬되도록 강제될 수 있다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청들은 어떠한 데이터도 포함하지 않아야만 한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청은 4KB와 같은 특정 바운더리로 정렬되어야만 한다. 본 발명의 다른 양상들에 따르면, 프리페치 요청들은 프로토콜에 의해 예상되지 않는 특수 형태들을 취한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 요청 ID는 노멀 요청들의 ID와는 상이해야만 한다. 본 발명의 몇몇 양상들에 따르면, 펜딩 프리페치 요청들은 모두 상이한 ID를 가져야만 한다. 본 발명의 몇몇 양상들에 따르면, STLB 인터페이스는 펜딩 프리페치 요청들의 수를 특정수로 제한한다. 펜딩 프리페치 요청들의 수는 동적이며, 이는 수를 프로그래밍하는 소프트웨어에 의해 또는 가능한 수들의 초이스로부터 선택하는 소프트웨어에 의해서와 같이, 제조된 칩에서 변경될 수 있음을 의미한다.According to some aspects of the present invention, prefetch requests are legitimately formed in use in accordance with standard transaction protocols. Some standard transaction protocols are AMBA (Advanced Microcontroller Bus Architecture) AXI (Advanced Extensible Interface) and OCP (Open Cores Protocol). According to some aspects of the invention, the normal requests are for the entire cache line, while the prefetch requests are for a small amount of data, such as zero bytes or one byte. A 0 byte request is illegal according to some protocols, but can be used as an indication of prefetch. A one-byte request is legal according to most protocols and can be used as a prefetch size. A one-byte request (where the byte is the minimum atomic unit of the data request) indicates that the requested data is within the defined bounds of access constraints or ranges bounded by any boundaries between the cadences, e.g., page size It will have the advantage of ensuring that it will not be. In addition, a one-byte request requires exactly one cycle of transfer of data on the interface. Alternatively, input requests can be forced to page align, such as by discarding lower address bits. According to some aspects of the invention, prefetch requests must not contain any data. According to some aspects of the present invention, the prefetch request must be aligned to a specific boundary, such as 4 KB. According to other aspects of the present invention, prefetch requests take special forms that are not expected by the protocol. According to some aspects of the present invention, the prefetch request ID must differ from the IDs of normal requests. According to some aspects of the invention, pending prefetch requests must all have different IDs. According to some aspects of the invention, the STLB interface limits the number of pending prefetch requests to a specific number. The number of pending prefetch requests is dynamic, which means that it can be changed in the fabricated chip, such as by software programming the number or by software selecting from a possible number of choices.

본 발명의 몇몇 양상들에 따르면, STLB(11220)는 DMA 엔진(11210)으로부터 노멀 및 프리페치 요청 스트림들 모두를 수신한다. 본 발명의 몇몇 양상들에 따르면, 노멀 요청들 및 프리페치 요청들은 상이한 물리적 또는 가상 요청 채널들 상에서 전송된다. 본 발명의 다른 양상들에 따르면, 노멀 요청들 및 프리페치 요청들은 동일한 채널들 상에서 전송되고, 프리페치 요청들은 STLB(11220)에 의해 식별가능하다.In accordance with some aspects of the present invention, STLB 11220 receives both normal and prefetch request streams from DMA engine 11210. [ According to some aspects of the invention, normal requests and prefetch requests are transmitted on different physical or virtual request channels. According to other aspects of the present invention, normal requests and prefetch requests are transmitted on the same channels, and prefetch requests are identifiable by the STLB 11220. [

필수적으로 전부는 아니지만 일부의 방식들은:Essentially, but not all, of the following methods:

각각의 요청을 통해 전송된 측파대 신호를 이용함으로써;By using sideband signals transmitted on each request;

요청의 ID를 이용함으로써;By using the ID of the request;

요청들의 어드레스 비트들을 이용함으로써;By using the address bits of the requests;

요청의 일부인 다른 필드들 또는 필드들의 조합을 이용함으로써; 그리고By using a combination of other fields or fields that are part of the request; And

노멀 요청들에 대해 이용된 포트와는 별개인 전용 프리페치 포트를 이용함으로써 By using a dedicated prefetch port that is separate from the port used for normal requests

식별가능하다.Identifiable.

본 발명의 몇몇 양상들에 따르면, 프리페치 가상 어드레스 발생기(11250)는, 노멀 가상 어드레스 발생기(11240)의 현재 상태, DMA 엔진(11210) 내의 버퍼 이용가능성, 또는 STLB(11220)와의 프리페치 가상 어드레스 발생기(11250)의 상호작용에 기초하여 요청들을 전송하기 위한 제어 로직을 포함한다.According to some aspects of the present invention, the pre-fetch virtual address generator 11250 is operable to determine the current state of the normal virtual address generator 11240, the buffer availability in the DMA engine 11210, Lt; RTI ID = 0.0 > 11250 < / RTI >

본 발명의 몇몇 양상들에 따르면, 프리페치 가상 어드레스 발생기(11250)는 노멀 가상 어드레스 발생기(11240)에 테더링되어, 오직 노멀 가상 어드레스 발생기(11240)에 의해 발생된 가상 어드레스들 앞에 N개의 가상 어드레스들의 거리 내에 있게 되는 가상 어드레스들만을 발생시킨다. 통상적으로 이러한 거리를 계산하기 위한 노멀 어드레스는 가장 최근에-발생된 노멀 어드레스이다. 본 발명의 몇몇 양상들에 따르면, 프리페치 가상 어드레스 스트림이 오직 노멀 스트림 앞의 M개의 필터링된 가상 어드레스들인 가상 어드레스들만을 발생시키도록, 테더(tether)는 필터링된 프리페치 가상 어드레스 스트림에 기초한다(즉, 여기서 프리페치 가상 어드레스는, 요청이 다른 최근의 요청과 동일한 페이지 내에 어드레스를 갖는 요청과 같이 특정 기준을 충족하는 경우에는 발생되지 않는다). 이는, 너무 많은 프리페칭을 통해 TLB 캐시를 쓰래싱(thrashing)하는 것을 회피하기 위해, 프리페치 스트림에 의해 요청된 새로운 페이지들의 수를 제어하는데 유용하다. 본 발명의 몇몇 양상들에 따르면, N 및 M은 하드 코딩될 수 있거나, 동적으로 프로그래밍가능하거나, 또는 가능한 값의 테이블의 일부일 수 있다.According to some aspects of the present invention, the pre-fetch virtual address generator 11250 is tiled to the normal virtual address generator 11240 so that only the N virtual addresses < RTI ID = 0.0 >Lt; RTI ID = 0.0 > of < / RTI > Typically, the normal address for calculating this distance is the most recently-generated normal address. According to some aspects of the invention, the tether is based on the filtered pre-fetch virtual address stream such that the pre-fetch virtual address stream only generates virtual addresses that are M filtered virtual addresses before the normal stream (I.e., the prefetch virtual address is not generated if the request meets certain criteria, such as a request with an address in the same page as another recent request). This is useful for controlling the number of new pages requested by the prefetch stream, in order to avoid thrashing the TLB cache with too much prefetching. According to some aspects of the present invention, N and M may be hard coded, dynamically programmable, or may be part of a table of possible values.

본 발명의 몇몇 양상들에 따르면, STLB(11220)는 자신의 캐시 및 구조들에서 할당 정책들을 지원한다. STLB(11220)에서 이용가능한 할당된 리소스들이 초과되지 않도록 보장하기 위해 테더를 N 또는 M으로 한정시킴으로써, DMA 엔진(11210)은, 프리페치 요청 스트림이 이용가능한 리소스들의 이점을 취하고 그리고 캐시들 또는 요청 스트림의 거동을 방해하지 않는 것을 보장한다.According to some aspects of the present invention, STLB 11220 supports allocation policies in its cache and structures. By limiting the tether to N or M to ensure that the allocated resources available at the STLB 11220 are not exceeded, the DMA engine 11210 can take advantage of the available resources and cache or request Ensuring that it does not disturb the behavior of the stream.

본 발명의 몇몇 양상들에 따르면, STLB(11220)는, 캐시 미스가 발생하였고 그리고 STLB(11220) 내의 트래킹 구조에서 슬롯이 취해졌음을 나타내기 위한 방법으로서 프리페치 요청들에 대한 응답들을 이용한다. 특히, 프리페치 요청이 변환 캐시 내에 이미 존재하거나 또는 미스로서 이미 펜딩 가상 어드레스에 수신되는 경우, STLB(11220)는 응답을 즉시 리턴한다. 프리페치 요청이 미스하여 페이지에 대한 워커 요청이 이미 펜딩중이 아닌 경우, 오직 변환이 워커에 의해 리턴된 때에만 STLB(11220)는 응답을 리턴한다. 본 발명의 몇몇 양상들에 따르면, 프리페치 가상 어드레스 발생기(11250)는 오직 P개의 프리페치 요청들만이 펜딩하는 것으로 프로그래밍될 수 있다. 오직 변환이 획득된 후에만 미스들에 대한 응답을 리턴하는 STLB(11220)과 함께, 프리페치 가상 어드레스 발생기(250)는, 자신이 처리할 수 있는 것보다 더 많은 요청들을 STLB(11220)에 전송하지 않는 것을 보장한다. 이는, STLB(11220)로부터의 역압(backpressure) 또는 STLB(11220)에 의한 프리페치 요청들의 중단(dropping)(둘 모두 바람직하지 않음)을 회피한다. 본 발명의 몇몇 양상들에 따르면, P는 하드 코딩되거나, 동적으로 프로그래밍가능하거나, 또는 가능한 값의 테이블의 일부일 수 있다.According to some aspects of the present invention, STLB 11220 uses responses to prefetch requests as a way to indicate that a cache miss has occurred and that a slot has been taken in the tracking structure within STLB 11220. In particular, if the prefetch request already exists in the translation cache or is already received at the pending virtual address as a miss, the STLB 11220 immediately returns a response. If the prefetch request misses and the worker request for the page is not already pending, STLB 11220 returns a response only when the transformation is returned by the worker. According to some aspects of the present invention, prefetch virtual address generator 11250 can be programmed to ping only P prefetch requests. With prefetch virtual address generator 250, with STLB 11220 returning a response to the misses only after the transformation is obtained, the prefetch virtual address generator 250 sends more requests to the STLB 11220 than it can process And that This avoids backpressure from STLB 11220 or dropping of prefetch requests by STLB 11220 (both undesirable). According to some aspects of the invention, P may be hard-coded, dynamically programmable, or part of a table of possible values.

본 발명의 몇몇 양상들에 따르면, STLB(11220)는 또한 할당 정책을 지원한다. STLB 프리페치들에 적용가능한 몇몇 할당 정책들은, ___자로 출원되고 발명의 명칭이 SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING인 미국 일반 특허 출원 ___(ART-024US1), 및 2012년 8월 18일자로 출원되고 발명의 명칭이 SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING인 미국 가특허 출원 일련 번호 제61/684705호 (어토니 도켓 번호: ART-024PRV)에 설명되며, 이들은 둘 다 인용에 의해 그 전체가 본원에 통합된다. 할당 정책을 이용하여, DMA 엔진(11210)은 STLB(11220)에서 특정 개수의 엔트리들을 예비할 수 있다. 자신의 중요한 프리페치들의 수를 그 수로 한정시킴으로써, DMA 엔진(11210)은 STLB(11220)에서 프리페치 오버플로우들을 절대 야기하지 않는 것으로 보장한다.According to some aspects of the present invention, STLB 11220 also supports assignment policies. Some assignment policies applicable to STLB pre-fetches are the US generic patent application ___ (ART-024US1), filed as ___ and entitled SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING, U.S. Provisional Patent Application Serial No. 61/684705 (Attorney Docket No. ART-024PRV), filed on January 18, and entitled SYSTEM TRANSLATION LOOK-ASIDE BUFFER WITH REQUEST-BASED ALLOCATION AND PREFETCHING, Both of which are incorporated herein by reference in their entirety. Using the allocation policy, the DMA engine 11210 may reserve a certain number of entries in the STLB 11220. [ By limiting the number of its significant prefetches to that number, DMA engine 11210 ensures that it does not cause prefetch overflows in STLB 11220 at all.

본 발명의 몇몇 양상들에 따르면, STLB(11220)는 선택가능한 또는 프로그래밍가능한 할당 정책들을 지원한다. DMA 엔진(11210)으로부터의 프리페치 요청들은, STLB(11220)가 사용해야만 하는 할당 정책의 표시를 포함한다.According to some aspects of the present invention, STLB 11220 supports selectable or programmable allocation policies. Prefetch requests from DMA engine 11210 include an indication of the allocation policy that STLB 11220 should use.

본 발명의 몇몇 양상들에 따르면, DMA 엔진(11210)은 다수의 노멀 가상 어드레스 발생기들을 포함한다. 프리페치 가상 어드레스 발생기는, 다수의 노멀 가상 어드레스 발생기들이 STLB 프리페칭의 이점을 취하는 것을 가능하게 하기 위해 다수의 노멀 가상 어드레스 발생기들에 커플링된다.According to some aspects of the present invention, the DMA engine 11210 includes a plurality of normal virtual address generators. The pre-fetch virtual address generator is coupled to a plurality of normal virtual address generators to enable a plurality of normal virtual address generators to take advantage of STLB pre-fetching.

임의의 캐시와 같은 TLB는 퇴거(eviction) 정책을 갖는다. 퇴거 정책은, 새로운 변환이 워커로부터 페칭될 때 어떤 저장된 변환이 교체될지 결정한다. 프리페치들은 이러한 퇴거 정책을 바이어싱하는데 이용될 수 있다. 본 발명의 일 양상에 따르면, TLB는 LRU(least recently used) 교체 정책을 이용하고, 프리페치는 엔트리 변환을 "이용"하는 효과를 가져서, 이에 의해 변환의 퇴거 확률을 낮춘다.A TLB, such as any cache, has an eviction policy. The eviction policy determines which stored conversions are to be replaced when a new transformation is fetched from the worker. Prefetches can be used to bias this eviction policy. According to an aspect of the invention, the TLB utilizes a least recently used (LRU) replacement policy, and prefetching has the effect of "using" the entry transformation, thereby lowering the probability of the eviction of the transformation.

이제 도 12의 플로우차트를 참조한다. 본 발명의 일 양상에 따르면, 프리페치 요청 스트림은 테더링된다. 프리페치 요청 스트림은 노멀 요청 스트림과 동일하지만 그에 앞선다. 가상 어드레스 스트림들을 발생시키는 프로세스의 시작(12300) 이후에, DMA 엔진은 노멀 가상 어드레스(12310)를 발생시킨다. 다음으로 그리고 노멀 가상 어드레스(12310)에 기초하여, DMA 엔진은 프리페치 가상 어드레스(12320)를 발생시킨다. 프리페치 가상 어드레스는 직접 또는 중간 모듈을 통해서 STLB(12330)에 전송된다. 약간의 시간 이후에(이 시간은 이상적으로, 변환을 페칭하기 위해 STLB가 최소한으로 소요하는 시간임), DMA는 직접 또는 중간 모듈을 통해서 노멀 가상 어드레스를 STLB(12340)에 전송한다.Reference is now made to the flow chart of FIG. According to an aspect of the invention, the prefetch request stream is tiled. The prefetch request stream is the same as the normal request stream, but precedes it. After the start 12300 of the process of generating virtual address streams, the DMA engine generates a normal virtual address 12310. Next, and based on the normal virtual address 12310, the DMA engine generates a prefetch virtual address 12320. The prefetch virtual address is transferred to STLB 12330 either directly or through an intermediate module. After a short time (ideally, this is the minimum amount of time STLB takes to fetch transitions), the DMA sends a normal virtual address to STLB 12340 either directly or through an intermediate module.

"사용 카운터(use counter)"는 각각의 변환 엔트리와 연관된다. "사용 카운터"는, 각각의 프리페치 요청에 대해 증분되고, 각각의 노멀 요청에 대해 감분된다. 논-널(non-null) 사용 카운터는 엔트리가 퇴거되는 것을 방지한다. 미스하는 프리페치 요청은 퇴거를 요구한다. 그 퇴거가 방지되는 경우, 프리페치 요청은 정지된다. 이 메커니즘은, 요청 플로우의 국부성 및 STLB에서 이용가능한 엔트리들의 수의 함수로서, 테더의 길이를 자동으로 조절한다.A "use counter" is associated with each conversion entry. The "usage counter" is incremented for each prefetch request and decremented for each normal request. A non-null usage counter prevents an entry from being evicted. Miss prefetch requests require eviction. If the eviction is prevented, the prefetch request is stopped. This mechanism automatically adjusts the length of the tether as a function of the localization of the request flow and the number of entries available in the STLB.

회전 엔진과 같은 다차원 엔진은, 2D 표면을 취하고, 반전된 x-y 좌표들로 이를 기록한다. 도 13은, 메모리 내에서의 표면 데이터의 어레이를 도시한다. 소스 표면(13110)은 목적지 표면(13120)을 생성하기 위해 반전된 자신의 좌표들을 갖는다.A multidimensional engine, such as a rotating engine, takes a 2D surface and records it with the inverted x-y coordinates. Figure 13 shows an array of surface data in memory. Source surface 13110 has its coordinates reversed to create destination surface 13120.

본 발명의 일 양상에 따르면, 자신의 좌표들에 기초한 표면의 각각의 픽셀의 메모리 어드레스는 이하의 공식에 의해 제공된다.According to an aspect of the invention, the memory address of each pixel of the surface based on its coordinates is provided by the following formula:

Addr = BASE + y * WIDTH + x * PIX_SIZE Addr = BASE + y * WIDTH + x * PIX_SIZE

여기서:here:

x 및 y는 표면 내 픽셀의 좌표들이고;x and y are the coordinates of the pixels in the surface;

BASE는 표면의 베이스 어드레스이고;BASE is the base address of the surface;

WIDTH는 로우의 시작부와 다음 로우의 시작부 사이의 거리(바이트 단위)이고; 그리고WIDTH is the distance (in bytes) between the beginning of the row and the beginning of the next row; And

PIX_SIZE는 바이트 단위의 일 픽셀의 크기(통상적으로 2 또는 4 바이트)이다.PIX_SIZE is the size of one pixel in bytes (typically 2 or 4 bytes).

본 발명의 다른 양상들에 따르면, 다른 공식들은 메모리 어드레스들에서의 픽셀들의 어레인지먼트를 설명한다.According to other aspects of the present invention, other formulas describe the arrangement of pixels at memory addresses.

소스 표면(13110)과 목적지 표면(13120)은 동일한 파라미터들(BASE, WIDTH, PIX_SIZE)을 가질 필요는 없다.The source surface 13110 and the destination surface 13120 need not have the same parameters (BASE, WIDTH, PIX_SIZE).

종래의 다차원 엔진들에 대한 문제점은, (로우의 말단부에서는 잠재적으로 제외하고) 인접하는 데이터의 증분 어드레스들에서 하나의 표면이 스텝 쓰루될 수 있으면서, 다른 표면은 어드레스들에 있어서 상대적으로 큰 스텝들로 스텝 쓰루 되어야만 한다는 점이다. 이는, 도 14에 도시되는데, 여기서 픽셀 어드레스들의 맵핑은,The problem with conventional multidimensional engines is that one surface can be stepped through at incremental addresses of contiguous data (potentially at the end of the row), while the other surface has relatively large steps in the addresses It must be stepped through. This is illustrated in FIG. 14, where the mapping of pixel addresses,

소스 표면(0) => 목적지 표면(0)Source surface (0) => Destination surface (0)

소스 표면(32) => 목적지 표면(4)Source surface 32 = destination surface 4

소스 표면(64) => 목적지 표면(8)Source surface 64 = destination surface 8

소스 표면(96) => 목적지 표면(12)Source surface 96 => destination surface 12

소스 표면(128) => 목적지 표면(16)Source surface 128 => destination surface 16

이다.to be.

목적지 표면은 인접 데이터의 증분 어드레스들(여기서, PIX_SIZE = 4바이트임)로 기록되는 한편, SRC 표면은 픽셀들 사이에서 큰 점프(jump)들을 가지고 판독된다.The destination surface is written with incremental addresses of contiguous data (where PIX_SIZE = 4 bytes), while the SRC surface is read with large jumps between pixels.

표면들이 기록 에이전트와 판독 에이전트 사이에서 공유될 수도 있는 DRAM(dynamic random access memory)들과 같은 메모리들은, 데이터의 작은 유닛들에 액세스할 때 효율적이지 않다. 도 14의 예시에서, 목적지 표면의 기록은 효율적으로 행해질 수 있지만, 소스 표면의 판독은 효율적으로 행해질 수 없다.Memories, such as dynamic random access memories (DRAMs), whose surfaces may be shared between the recording agent and the read agent, are inefficient when accessing small units of data. In the example of Fig. 14, the recording of the destination surface can be performed efficiently, but the reading of the source surface can not be performed efficiently.

이는, 전통적으로 2개의 단계들:This traditionally involves two steps:

(1) 더 큰 블록들에서의 소스 표면들로부터 페칭하는 단계(1) fetching from the source surfaces in the larger blocks

(2) 큰 블록 페치로부터의 불필요한 데이터가, 다차원 엔진이 이를 필요로 할 때 중간 저장소에 여전히 있게 하기 위해 충분한 시간 동안 유지되도록 다차원 엔진에 일부 중간 저장소를 부가하는 단계로 해결된다.(2) adding some intermediate storage to the multidimensional engine so that unnecessary data from the large block fetch is held for a sufficient time to still remain in the intermediate storage when the multidimensional engine needs it.

도 15에서, 다차원 엔진은 인접 픽셀들의 그룹들(이 예시에서, 2개의 그룹들)에서의 SRC로부터 판독한다. 그후, 이 엔진이 DST에 직접 기록하기 위해 사용하는 동안, 나머지는 일시적으로 저장된다. 블록들의 폭 및 높이는, 필요한 버퍼를 최소화하면서 DRAM의 사용을 최대화하도록 선택될 수 있다.In Fig. 15, the multi-dimensional engine reads from SRCs in groups of adjacent pixels (in this example, two groups). Then, while this engine is used to write directly to the DST, the rest is temporarily stored. The width and height of the blocks may be selected to maximize the use of the DRAM while minimizing the required buffer.

DRAM들은 통상적으로 64-256 바이트 버스트들에 대해 거의 최적으로 거동하여, 직사각형 액세스 영역이 한 측에 16-128 픽셀들이 있을 수 있다. 버퍼링을 감소시키기 위해, 직사각형의 일 차원은 감소될 수 있다.The DRAMs typically behave almost optimally for 64-256 byte bursts, so the rectangular access area may be 16-128 pixels on one side. To reduce buffering, one dimension of the rectangle can be reduced.

다른 문제는, 다차원 엔진에 의해 액세스되는 어드레스들이 가상 어드레스들일 때 발생한다.Another problem arises when the addresses accessed by the multidimensional engine are virtual addresses.

가상 어드레싱 시스템에서, 메모리는 페이지들(통상적인 크기는 4KB임)로 구성된다. 물리적 어드레스들(PA)로의 가상 어드레스들(VA)의 맵핑은 불규칙한 경향이 있어서, 페이지 바운더리에 걸쳐 인접하는 VA들에 있는 픽셀들은 물리적으로-어드레싱된 메모리에서는 멀리 떨어져 있을 수도 있다.In a virtual addressing system, the memory consists of pages (the typical size is 4 KB). The mapping of virtual addresses VA to physical addresses PA tends to be irregular so that pixels in adjacent VAs across the page boundary may be far from physically-addressed memory.

칩들 내에서 회전되는 표면들은, 4B의 PIX_SIZE를 갖는 4KB의 WIDTH를 초과할 수 있다. 4KB의 가상으로 어드레싱된 페이지 크기들의 경우, 이는, 표면 내 픽셀들의 단일 로우가 1 페이지를 초과하여 확장한다는 것을 의미한다. 그 결과로서, 컬럼 내 픽셀들은 동일한 픽셀 상에 있지 않다. 심지어는 페이지 크기보다 더 작은 WIDTH의 경우에도, 컬럼 내 픽셀들의 페이지 국부성은 STLB 미스들로 인한 상당한 성능 문제를 야기할 만큼 충분히 낮을 수 있다.Surfaces rotated in chips may exceed 4KB WIDTH with PIX_SIZE of 4B. In the case of 4 KB of virtually addressed page sizes, this means that a single row of pixels within a surface extends beyond one page. As a result, the pixels in the column are not on the same pixel. Even in the case of a WIDTH smaller than the page size, the page localization of the pixels in the column may be low enough to cause significant performance problems due to STLB misses.

도 16은, 3200 픽셀 폭 및 4B PIX_SIZE를 갖는 표면(16410)을 도시한다. 표면의 각각의 로우(16420)는 WIDTH=3200 * 4B=12.8kB를 이용한다. 4KB 페이지들의 경우, 각각의 로우는 3.125 페이지들을 이용한다. 제 1 컬럼의 픽셀들은 페이지들 0, 3, 6 등에 있다.16 shows a surface 16410 with a 3200 pixel width and 4B PIX_SIZE. Each row 16420 of the surface uses WIDTH = 3200 * 4B = 12.8kB. For 4KB pages, each row uses 3.125 pages. The pixels in the first column are on pages 0, 3, 6, and so on.

가상 메모리 시스템에서, 다차원 엔진은 시스템 메모리 관리 유닛(SMMU)을 통해서 메모리에 접속된다. SMMU는 VA들을 취하고 이들을 메모리에 적합한 PA들로 컨버팅한다.In a virtual memory system, the multidimensional engine is connected to memory via a system memory management unit (SMMU). The SMMU takes the VAs and converts them into PAs suitable for memory.

본 발명의 일 양상에 따르면, 도 17에 도시된 바와 같이, 다차원 엔진(17510)은 SMU(17520)를 통해서 메모리(17530)로 접속된다. SMMU(17520)는 시스템 변환 색인 버퍼(STLB)(17522) 및 워커(17524)를 포함한다. STLB(17522)는 최근의 VA-PA 변환들을 파악하고 있다. 워커(17524)는, 요청된 VA에 대한 변환이 STLB(17522)에 존재하지 않을 때 메모리 내의 변환 테이블로부터의 변환을 컴퓨팅하거나 또는 룩업한다.According to one aspect of the present invention, as shown in Fig. 17, a multidimensional engine 17510 is connected to the memory 17530 through an SMU 17520. Fig. The SMMU 17520 includes a system conversion index buffer (STLB) 17522 and a worker 17524. STLB 17522 is aware of recent VA-PA conversions. Worker 17524 computes or looks up a transformation from the translation table in memory when the translation for the requested VA is not present in STLB 17522. [

워커(17524)는 변환을 해결하기 위해 2개 내지 20개 초과의 메모리 액세스들을 취한다. 2개의 메모리 액세스들은 작은 VA 공간에 충분하다. 과도한 가상화 계층으로 인해 "네스티드 페이징(nested paging)" 및 64비트들로 표현될 수 있는 것들과 같은 큰 VA 공간들에 대해 20개 또는 그 초과의 메모리 액세스들이 요구된다.Worker 17524 takes more than two to twenty memory accesses to resolve the translation. Two memory accesses are sufficient for small VA space. Due to the excessive virtualization layer, 20 or more memory accesses are required for large VA spaces such as "nested paging" and those that can be represented by 64 bits.

이 때문에, 수직 방향으로 표면(410)의 횡단 동안 워커(17524)에 의해 발생된 메모리 액세스 트래픽은, 그들 자신의 픽셀들에 액세스하기 위한 트래픽을 훨씬 초과하며, STLB 미스들로 인한 정지들의 지속기간은 쓰루풋을 급격하게 저하시킬 수 있다. 이에 따라, STLB(17522)에서 변환들을 캐싱하는 것은 중요하다.Because of this, the memory access traffic generated by the walker 17524 during the traversal of the surface 410 in the vertical direction far exceeds the traffic for accessing its own pixels, and the duration of stalls due to STLB misses Can drastically lower the throughput. Accordingly, it is important to cache transitions in STLB 17522.

STLB 내에서 캐싱하기 위한 적절한 수의 엔트리들은, 표면의 수직 순회(vertical traversal)에 의해 터치되는 페이지들의 수이다. 픽셀들의 로우들에 의해 이용되는 메모리가 VA 페이지 크기들을 초과할 때, 일 엔트리는 그 표면 내의 각각의 로우에 대해 캐싱되어야 한다.The appropriate number of entries for caching in the STLB is the number of pages that are touched by the vertical traversal of the surface. When the memory used by the rows of pixels exceeds VA page sizes, one entry must be cached for each row in its surface.

STLB를 액세스 영역의 높이와 동일한 엔트리들의 수로 사이징하는 것은 여전히 이하의 문제들을 제기한다:Sizing the STLB by the number of entries equal to the height of the access area still raises the following problems:

(A) 회전 판독들 및 기록들의 플로우는, 로우 액세스가 새로운 페이지에 도달할 때 (종종, 긴 시간 기간 동안) 인터럽트되어 STLB 미스를 야기한다. (A) The flow of rotation readings and writes is interrupted when a row access reaches a new page (often for a long period of time), resulting in a STLB miss.

(B) WIDTH가 정수 페이지들인 표면들과 같은 잘 정렬된 표면들의 경우, STLB 미스들은 로우 액세스가 새로운 페이지에 도달하는 매 번 모든 로우에 대해 백-투-백(back-to-back)을 일으킨다. 이는, SMMU 워커로부터 큰 버스트의 트래픽을 생성하여, 장시간 동안 픽셀 트래픽을 딜레이한다.(B) For well-aligned surfaces such as surfaces where WIDTH is an integer page, STLB misses cause a back-to-back for every row every time a row access reaches a new page . This creates a large burst of traffic from the SMMU worker and delays the pixel traffic for an extended period of time.

본 발명의 일 양상에 따르면, 변환 프리페칭 메커니즘은 STLB 미스들로 인해 딜레이를 감소시키거나 또는 제거하기 위해 STLB와 함께 이용된다. STLB는, 워커로 하여금 자신의 가까운 미래 사용의 예측에 있어서 변환을 페칭하도록 트리거하기 위해 다차원 엔진(또는 다른 조정된 에이전트)으로부터 프리페치 커맨드들을 수신한다. 워커는 STLB에 새로운 변환을 위치시켜서, STLB가 사전에 이용가능하거나 또는 변환이 다차원 엔진에 의해 요청된 후에 감소된 양의 시간에 이용가능하다.According to an aspect of the present invention, a transform prefetching mechanism is used with the STLB to reduce or eliminate delays due to STLB misses. The STLB receives prefetch commands from the multidimensional engine (or other coordinated agent) to trigger the walker to fetch the transformations in the prediction of its near future usage. The walker places a new transform in the STLB so that the STLB is available in advance or at a reduced amount of time after the transformation is requested by the multidimensional engine.

도 18은, SMMU(18620)를 통해서 메모리(17530)에 접속된, 본 발명의 몇몇 양상들에 따른 다차원 엔진(18610)을 도시한다. 다차원 엔진(18610)은, 물리적 채널(18640)을 통해서 데이터 요청들을 행하고, 물리적 채널(18650)을 통해서 프리페치 요청들을 행한다. 본 발명의 다른 양상들에 따르면, 픽셀 및 프리페치 요청들은, 공유된 물리적 채널 상에서 전송되고, 요청의 속성(예를 들어, 비트 또는 커맨드 유형 또는 예비된 요청 크기)에 의해 구별된다.18 shows a multidimensional engine 18610 according to some aspects of the present invention connected to memory 17530 via SMMU 18620. [ Multidimensional engine 18610 performs data requests over physical channel 18640 and makes prefetch requests over physical channel 18650. [ According to other aspects of the present invention, pixel and prefetch requests are transmitted on a shared physical channel and are distinguished by the attributes of the request (e.g., bit or command type or reserved request size).

본 발명의 몇몇 양상들에 따르면, 도 19a에 도시된 바와 같이, 다차원 엔진(18610)은 메모리 내의 픽셀들에 액세스하는데 필요한 어드레스들을 발생시키는 어드레스 발생기(19720)를 포함한다. 어드레스 발생기(19720)는, 각각의 발생기가 동일한 스트림의 어드레스들을 발생시키도록 인에이블되도록 복사(duplicate)되지만, 채널(18650) 상에서의 프리페치 요청들로서는 더 일찍 그리고 채널들(18640) 상에서의 데이터 요청들로서는 더 늦게 이들을 발생시킬 것이다.According to some aspects of the present invention, as shown in Figure 19A, the multidimensional engine 18610 includes an address generator 19720 that generates the addresses needed to access the pixels in memory. Address generator 19720 is duplicated so that each generator is enabled to generate addresses of the same stream, but earlier as prefetch requests on channel 18650 and data on channels 18640 Requests will generate them later.

본 발명의 다른 양상들에 따르면, 도 19b에 도시된 바와 같이, 다차원 엔진(19710)은, 동일한 어드레스들의 데이터 요청들에 앞서, 어드레스들에 대한 프리페치 요청들의 진보된 스트림을 발생시키는 어드레스 발생기(19730)를 포함한다.19B, the multidimensional engine 19710 includes an address generator (not shown) that generates an advanced stream of prefetch requests for addresses, prior to data requests of the same addresses 19730).

본 발명의 다른 양상에 따르면, 프리페치 발생기는, 규칙적인 스트림의 어드레스들의 특정 범위 내에 머무르도록 제약된다.According to another aspect of the invention, the pre-fetch generator is constrained to stay within a certain range of addresses of the regular stream.

본 발명의 다른 양상에 따르면, 거리는, 액세스되는 임의의 로우에 대해, 마주하게될 다음 페이지에 대한 변환은 프리페칭될 수 있지만 그에 뒤따르는 페이지에 대해서는 프리페칭되지 않도록, 1 페이지이다.According to another aspect of the invention, the distance is one page, so that for any row being accessed, the translation for the next page to be encountered can be pre-fetched, but not for subsequent pages.

본 발명의 다른 양상들에 따르면, 거리는, 프리페치 요청들의 워킹 시간을 커버하기 위해 요구되는 레이턴시 및 버퍼링에 의존하여, 일 페이지 미만 또는 일 페이지 초과로 설정될 수 있다.According to other aspects of the invention, the distance may be set to less than one page or more than one page, depending on the latency and buffering required to cover the working time of the prefetch requests.

이제 도 20을 참조하면, 본 발명의 일 양상에 따르면, 표면(20800)은 데이터의 로우를 갖는다(로우는 데이터의 3.125 페이지들을 포함한다). 표면(20800)은 한 번에 8개의 로우들에 액세스되며, 여기서 순차 액세스들은 우측 방향 데이터를 향한다. 원시(raw) 프리페치 어드레스 스트림에는, 이후에 한 페이지-정도의 어드레스들인, 규칙적인 스트림에 의해 액세스 될 모든 어드레스들이 딸려있다. 특정 시간에, 데이터 요청들은 컬럼(20810)에서의 데이터에 대해 전송된다. 원시 프리페치 스트림은 프리페치 요청 컬럼(20820)에서의 데이터에 대해 전송된다.Referring now to FIG. 20, in accordance with an aspect of the present invention, surface 20800 has a row of data (row contains 3.125 pages of data). Surface 20800 is accessed eight rows at a time where the sequential accesses are directed to the right directional data. The raw prefetch address stream is accompanied by all the addresses to be accessed by the regular stream, which are then one page-worth of addresses. At certain times, data requests are sent for the data in column 20810. [ The raw prefetch stream is transmitted on the data in the prefetch request column 20820.

원시 스트림은, 페이지 마다 단 하나의 프리페치만을 전송하도록 필터링된다. 특히, 어드레스들은, 이들이 페이지 바운더리 상에 완벽하게 정렬되지 않은 경우에 필터링 아웃된다. 이에 따라, 다음 페이지의 프리페치는, 이전 페이지의 마지막 데이터 엘리먼트가 액세스된 직후에 전송되고, 로우 마다 정확하게 2개의 변환들에 대해 이용가능하게 될 필요가 있다.The raw stream is filtered to transmit only one prefetch per page. In particular, the addresses are filtered out if they are not perfectly aligned on the page boundary. Thus, the prefetch of the next page is sent immediately after the last data element of the previous page is accessed, and needs to be available for exactly two transforms per row.

표면(20800)의 우측 엣지에서, 데이터 액세스 컬럼은, 표면의 좌측 엣지(20830)로부터 시작하여 다음 그룹의 8개의 로우들의 시작부까지 래핑(warp)한다. 래핑 시에, 각각의 액세스는 변환 미스를 야기할 것이다. 본 발명의 다른 양상에 따르면, 일부(대부분)가 페이지 바운더리에 완벽하게 정렬되지 않는다는 사실에도 불구하고, 좌측 엣지(20830)에 대응하는 어드레스들의 프리페치 요청들이 전송된다. 이는, 액세스 영역의 시작 엣지상의 데이터가 페이지 바운더리에 정렬되지 않을 때 트랜스퍼될 새로운 액세스 영역에 대한 시작 조건에 대응한다.At the right edge of surface 20800, the data access column warps from the left edge of the surface 20830 to the beginning of the next group of eight rows. At the time of wrapping, each access will cause a conversion miss. According to another aspect of the invention, prefetch requests of addresses corresponding to the left edge 20830 are sent, despite the fact that some (most) are not perfectly aligned to the page boundary. This corresponds to the start condition for the new access area to be transferred when the data on the start edge of the access area is not aligned to the page boundary.

본 발명의 몇몇 양상들에 따르면, 프리페치 트래픽은, 이 트래픽이 워커 또는 메모리 시스템을 압도(overwhelm)하지 않도록, 제한된다. 즉, 다차원 엔진은 자신의 상태에 기초하여 프리페치 요청들의 발행을 폐기하거나 또는 딜레이한다. 특히, 프리페치 요청들의 대역폭, 중요한 프리페치 요청들의 현재 수, 및 최대 레이턴시에 기초한 제한들이 가능하다.According to some aspects of the present invention, prefetch traffic is limited such that this traffic does not overwhelm the walker or memory system. That is, the multi-dimensional engine discards or delays the issuance of prefetch requests based on its state. In particular, limitations based on the bandwidth of the prefetch requests, the current number of critical prefetch requests, and the maximum latency are possible.

본 발명의 몇몇 양상들에 따르면, STLB는, 프리페치 윈도우가 폭의 일 페이지로 제한될 때 페치 액세스 영역의 높이의 2배와 동일한 변환들의 수로 사이징된다. 이것은, 전체 프리페치 윈도우가 로우 마다 오직 2개의 페이지들(현재, 다음)만을 포함할 수 있기 때문이다.According to some aspects of the invention, the STLB is sized with a number of transforms equal to twice the height of the fetch access area when the prefetch window is limited to one page of width. This is because the entire prefetch window may only contain two pages (current, next) per row.

본 발명의 다른 양상들에 따르면, STLB는 시스템에 의해 지원된 가장 큰 페이지 크기에 대해 1+ (프리페치 윈도우 폭/페이지 크기)와 동일한 변환들의 수로 사이징된다.According to other aspects of the present invention, the STLB is sized with a number of transforms equal to 1+ (prefetch window width / page size) for the largest page size supported by the system.

이러한 설정들은, 안정 상태에 있을 때(즉, 프리페치 윈도우가 표면의 엣지들을 터치하지 않을 때) 최적이다. 그러나, 프리페치 윈도우가 시작 엣지에 있거나 또는 액세스 영역들에 걸쳐있을(straddle) 때, 새로운 액세스 영역은 통상적으로 완전히 상이한 페이지들을 이용하기 때문에, 프리페치하기 위한 페이지들 내에 불연속(discontinuity)이 존재한다. These settings are optimal when in a steady state (i.e., when the prefetch window does not touch the edges of the surface). However, when the prefetch window is at the start edge or straddle over the access areas, there is discontinuity in the pages for prefetching because the new access area typically uses completely different pages .

본 발명의 몇몇 양상들에 따르면, STLB는 (페이지-너비 프리페치 윈도우에 대해) 페치 높이를 3배 하거나 또는 다른 크기들에 대해서는 페치 높이를 2+(프리페치 윈도우 크기/페이지 크기) 배로 사이징된다. 이는, 프리페치 윈도우로 하여금 프리페칭시에 어떠한 인터럽션도 없이 2개의 상이한 액세스 영역들을 커버하도록 허용한다.According to some aspects of the invention, the STLB is sized by tripling the fetch height (for the page-width prefetch window) or the fetch height by 2+ (prefetch window size / page size) for other sizes . This allows the prefetch window to cover two different access areas without any interruption in prefetching.

정렬되지 않은 경우들에서, 페치 액세스 영역의 좌측에 있는 부분적으로 이용된 페이지들은 또한 액세스 영역의 우측에 있는 이전 로우 상에서 이용된다. 너비가 충분한 표면 상에서, 프리페치 윈도우 크기가 액세스 영역의 우측에 도달할 무렵에 TLB에서 페이지가 교체될 것이고, 그래서 페이지는 다시 프리페칭되어야만 할 것이다. 원시 프리페치 스트림 필터 크기를 증가시키거나 또는 특수 로직을 부가하는 것은, 반복되는 페칭을 불필요하게 만들 수 있다.In unaligned cases, the partially used pages to the left of the fetch access area are also used on the previous row to the right of the access area. On a surface with sufficient width, the page will be replaced in the TLB by the time the prefetch window size reaches the right side of the access area, so the page will have to be prefetched again. Increasing the size of the raw prefetch stream filter or adding special logic may make repeating fetching unnecessary.

본 발명의 몇몇 양상들에 따르면, STLB는 (페이지-너비 프리페치 윈도우에 대해) 페치 높이를 3배 하거나 또는 다른 크기들에 대해서는 페치 높이를 2+(프리페치 윈도우 크기/페이지 크기)배로 사이징되고, TLB는 액세스 영역의 말단부에 도달할 때까지 액세스 영역의 시작부에 페이지 엔트리들을 고정하도록 필터링된다.According to some aspects of the present invention, the STLB may be sized 3 times the fetch height (for the page-width prefetch window) or 2+ (prefetch window size / page size) times the fetch height for other sizes , The TLB is filtered to fix the page entries at the beginning of the access area until it reaches the end of the access area.

본 개시를 읽는 즉시 당업자들에게 명백하게 되는 바와 같이, 본원에 설명되고 예시된 양상들 각각은 본 발명의 범위 또는 사상으로부터 벗어나지 않고 실시예들을 형성하기 위해 특징들 및 양상들로부터 용이하게 분리될 수 있거나 또는 이들과 조합될 수 있는 이산적 컴포넌트들 및 특징들을 갖는다. 임의의 언급된 방법은, 언급된 이벤트들의 순서로 또는 논리적으로 가능한 임의의 다른 순서로 수행될 수 있다.As will be apparent to those skilled in the art upon reading this disclosure, each of the aspects described and illustrated herein may be readily separated from the features and aspects to form embodiments without departing from the scope or spirit of the invention Or discrete components and features that can be combined with them. Any of the mentioned methods may be performed in the order of the events mentioned or in any other order possible logically.

그렇지 않은 것으로 정의되지 않는 한, 본원에 이용된 모든 기술적 그리고 과학적 용어들은 본 발명이 속하는 분야의 당업자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖는다. 또한, 본원에 설명된 것들과 유사하거나 또는 동일한 임의의 방법들 및 내용들이 본 발명의 실제 또는 테스팅에 이용될 수 있지만, 대표적으로 예시적인 방법들 및 내용들이 현재 설명된다.Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Moreover, although any methods and contents similar or equivalent to those described herein can be used in the practice or testing of the present invention, exemplary methods and content are presently exemplary.

이 상세한 설명에서 인용된 모든 간행물들 및 특허들은, 마치 각각의 개별적인 간행물 또는 특허가 인용에 의해 통합되기 위해 구체적으로 그리고 개별적으로 표시되는 것처럼 인용에 의해 본원에 통합되며, 간행물들이 인용된 것과 관련하여 방법들 및/또는 시스템을 개시 및 설명하기 위해 인용에 의해 본원에 통합된다. 임의의 간행물의 인용은, 출원 날짜 이전의 그 개시에 대한 것이며, 본 발명이 종래 발명에 의해 이러한 간행물에 선행하는 것으로 권리부여되지 않는다는 인정(admission)으로서 해석되지 않아야만 한다. 게다가, 제공된 간행물의 날짜들은 개별적으로 확인될 필요가 있을 수 있는 실제 공개 일자들과는 상이할 수 있다.All publications and patents cited in this specification are incorporated herein by reference, as if each individual publication or patent was specifically and individually indicated to be incorporated by reference, and that publications are cited Methods and / or systems are incorporated herein by reference for disclosing and describing. The citation of any publication shall be for its commencement prior to the filing date and shall not be construed as admission that the present invention is not entitled to precede such publication by prior invention. In addition, the dates of the publications provided may differ from the actual publication dates, which may need to be individually verified.

추가적으로, 이러한 균등물들은, 현재 알려진 균등물들 및 미래에 개발될 균등물들 둘 다, 즉, 구조에 상관없이 동일한 기능을 수행하는 개발된 임의의 엘리먼트들을 포함하도록 의도된다. 따라서, 본 발명의 범위는, 본원에 설명된 그리고 나타낸 예시적인 실시예들로 제한되는 것으로 의도되지 않는다. Additionally, such equivalents are intended to include both currently known equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function regardless of structure. Accordingly, the scope of the present invention is not intended to be limited to the exemplary embodiments described and illustrated herein.

본 발명의 교시에 따르면, 컴퓨터 및 컴퓨팅 디바이스는 제조 물품(article of manufacture)들이다. 제조 물품의 다른 예시들은: 마더 보드, 서버, 메인프레임 컴퓨터, 또는 데이터를 수신하고, 데이터를 송신하고, 데이터를 저장하고, 또는 방법들을 수행하기 위해 컴퓨터 판독가능 프로그램 코드(예를 들어, 알고리즘, 하드웨어, 펌웨어, 및/또는 소프트웨어)를 실행하도록 구성된 하나 또는 그 초과의 프로세서들(예를 들어, 중앙 프로세싱 유닛, 그래피컬 프로세싱 유닛, 또는 마이크로프로세서)를 각각 갖는 다른 특수 목적 컴퓨터상에 상주하는 전자 컴포넌트를 포함한다.According to the teachings of the present invention, computers and computing devices are articles of manufacture. Other examples of articles of manufacture include: computer readable program code (e.g., computer program code) for receiving a motherboard, server, mainframe computer or data, transmitting data, storing data, (E. G., A central processing unit, a graphical processing unit, or a microprocessor) configured to execute one or more programs (e. G., Hardware, firmware, and / or software) .

제조 물품(예를 들어, 컴퓨터 또는 컴퓨팅 디바이스)은, 컴퓨터 판독가능 프로그램 단계들 또는 여기에 인코딩된 코드와 같은 일련의 명령들을 포함하는 비-일시적 컴퓨터 판독가능 매체 또는 저장소를 포함한다. 본 발명의 특정 양상들에서, 비-일시적 컴퓨터 판독가능 매체는 하나 또는 그 초과의 데이터 저장소들을 포함한다. 따라서, 본 발명의 임의의 양상에 따른 특정 실시예들에서, 컴퓨터 판독가능 프로그램 코드(또는 코드)는 컴퓨팅 디바이스의 비-일시적 컴퓨터 판독가능 매체로 인코딩된다. 결국, 프로세서는, 컴퓨터 판독가능 프로그램 코드가, 툴을 이용하여 기존의 컴퓨터-지원 설계(computer-aided design)를 생성하거나 또는 수정하게 하도록 실행한다. 실시예들의 다른 양상들에서, 컴퓨터-지원 설계의 생성 또는 수정은 웹-기반 소프트웨어 애플리케이션으로서 구현되고, 여기서 컴퓨터-지원 설계 또는 툴 또는 컴퓨터 판독가능 프로그램 코드에 관련된 데이터의 부분들은 호스트의 컴퓨팅 디바이스로 수신되거나 또는 송신된다.An article of manufacture (e.g., a computer or computing device) includes non-transitory computer readable media or storage that includes a series of instructions, such as computer readable program steps or encoded code therein. In certain aspects of the invention, the non-transitory computer readable medium comprises one or more data stores. Thus, in certain embodiments according to certain aspects of the present invention, the computer readable program code (or code) is encoded into a non-transient computer readable medium of the computing device. As a result, the processor executes the computer readable program code to cause the computer to create or modify an existing computer-aided design using the tool. In other aspects of embodiments, the creation or modification of a computer-aided design is implemented as a web-based software application, wherein portions of the data associated with the computer-aided design or tool or computer readable program code are communicated Received or transmitted.

본 발명의 다양한 양상들에 따른 제조 물품 또는 시스템은, 하나 또는 그 초과의 별도의 프로세서들 또는 마이크로프로세서들, 휘발성 및/또는 비-휘발성 메모리 및 주변기기들 또는 주변 컨트롤러들을 통해; 프로세서, 로컬 휘발성 및 비-휘발성 메모리, 주변기기들 및 입력/출력 핀들을 갖는 집적된 마이크로컨트롤러를 통해; 제조 물품 또는 시스템의 고정 버전을 구현하는 별도의 로직을 통해; 그리고 로컬 또는 원격의 인터페이스를 통해서 재프로그래밍될 수 있는 제조 물품 또는 시스템의 버전을 구현하는 프로그래머블 로직을 통해서 다양한 방식들로 구현된다. 이러한 로직은, 로직으로 또는 소프트-프로세서에 의해 실행된 일 세트의 커맨드들을 통해서 제어 시스템을 구현할 수 있다.An article of manufacture or system according to various aspects of the present invention may be implemented on one or more separate processors or microprocessors, volatile and / or non-volatile memory and peripherals or peripheral controllers; Through an integrated microcontroller with a processor, local volatile and non-volatile memory, peripherals and input / output pins; Through separate logic implementing a fixed version of the article of manufacture or system; And programmable logic implementing a version of the article of manufacture or system that can be reprogrammed via a local or remote interface. Such logic may implement the control system through a set of commands executed by logic or by a soft-processor.

이에 따라, 이전의 사항은 단지 본 발명의 다양한 양상들 및 원리들을 예시한다. 당업자들은 본원에 명시적으로 설명되거나 또는 나타나지 않았지만 본 발명의 원리들을 구현하고, 본 발명의 사상 및 범위 내에 포함되는 다양한 어레인지먼트들을 고안할 수 있을 것임을 인식할 것이다. 게다가, 본원에 언급된 모든 예시들 및 조건적 언어는, 발명자들에 의해 기여된 개념들 및 발명의 원리들을 이해하는데 있어서 독자들을 돕도록 주로 의도되고, 이러한 구체적으로 언급된 예시들 및 조건들에 대해 제한하지는 않는 것으로 해석된다. 더욱이, 본 발명의 원리들, 양상들, 및 실시예들을 언급하는 본원의 모든 진술들뿐만 아니라 이들의 특정 예시들은, 이들의 구조적 그리고 기능적 균등물들 모두를 포함하는 것으로 의도된다. 추가적으로, 이러한 균등물들은, 현재 알려진 균등물들은 현재 알려져 있는 균등물들 및 미래에 개발될 균등물들 둘 다, 즉, 구조에 상관없이 동일한 기능을 수행하는 개발된 임의의 엘리먼트들을 포함하도록 의도된다. 따라서, 본 발명의 범위는, 본원에서 설명된 그리고 논의된 다양한 양상들로 제한되는 것으로 의도되지 않는다. 오히려, 본 발명의 범위 및 사상은 첨부된 청구항들에 의해 구현된다.Accordingly, the foregoing merely illustrates the various aspects and principles of the present invention. Those skilled in the art will recognize that, while not explicitly described or shown herein, it is possible to implement the principles of the invention and to contemplate various arrangements that fall within the spirit and scope of the invention. In addition, all examples and conditional language mentioned herein are intended to aid the reader in understanding the concepts and inventive principles taught by the inventors, and are incorporated by reference to such specifically recited examples and conditions It is understood that the present invention is not limited thereto. Moreover, all statements herein reciting principles, aspects, and embodiments of the present invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, such equivalents are intended to include both currently known equivalents and currently developed equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function regardless of structure. Accordingly, the scope of the present invention is not intended to be limited to the various aspects described and discussed herein. Rather, the scope and spirit of the invention is embodied by the appended claims.

Claims

A translation look-aside buffer (TLB)
A translation table storing address translations;
An input port enabled to receive an input request from an initiator; And
And an output port enabled to transmit an output request corresponding to the input request,
If the input request is a prefetch, the output port does not send an output request,
If the input request matches an entry in the translation table, the translation table returns the translated address,
An address field is modified to use the translated address,
At least some bits of the address field being replaced by the translated address obtained from the translation table, and
Wherein the number of bits in the address field replaced by the translated address is derived from a page size corresponding to the entry in the translation table,
System conversion index buffer.

The method according to claim 1,
Further comprising a prefetch port dedicated to receive prefetch requests,
System conversion index buffer.

The method according to claim 1,
Further comprising a sideband signal indicating whether the input request is a prefetch,
System conversion index buffer.

The method according to claim 1,
Wherein the input request includes an ID indicating a value, the value indicating whether the input request is a prefetch,
System conversion index buffer.

The method according to claim 1,
The input request including an address, the address indicating whether the input request is a prefetch,
System conversion index buffer.

The method according to claim 1,
Wherein the input request is made according to a standard transaction protocol, whether the input request is a prefetch or not,
System conversion index buffer.

The method according to claim 1,
Wherein the input request includes a magnitude indicating an amount of data requested by the input request,
The size being indicative of the amount of data that is less than the size of the entire cache line,
System conversion index buffer.

8. The method of claim 7,
Wherein the size is 0,
System conversion index buffer.

8. The method of claim 7,
Wherein the size is one byte,
System conversion index buffer.

The method according to claim 1,
The input request may be page aligned,
System conversion index buffer.

The method according to claim 1,
Wherein the conversion table is set associative,
System conversion index buffer.

12. The method of claim 11,
Wherein the conversion table comprises a static random access memory array,
System conversion index buffer.

The method according to claim 1,
Wherein the conversion table includes a first conversion table and a second conversion table, the first conversion table having a smaller number of entries than the second conversion table,
System conversion index buffer.

The method according to claim 1,
Further comprising allocation logic arranged to use a replacement policy,
System conversion index buffer.

15. The method of claim 14,
The replacement policy used may be changed to on the fly,
System conversion index buffer.

15. The method of claim 14,
Wherein the replacement policy for a given request is based on an attribute of the input request,
System conversion index buffer.

The method according to claim 1,
Wherein the conversion table comprises a first group and a second group,
System conversion index buffer.

18. The method of claim 17,
Further comprising a configurable register array for determining whether allocation to the first group is allowed.
System conversion index buffer.

18. The method of claim 17,
Further comprising a programmable register array for determining whether allocation to the first group is allowed,
System conversion index buffer.

18. The method of claim 17,
Wherein the attribute of the input request is a value indicating whether the input request is to assign an entry to the first group,
System conversion index buffer.

21. The method of claim 20,
Wherein the attribute is an indication of a source of the input request,
System conversion index buffer.

21. The method of claim 20,
The attribute is an identifier,
System conversion index buffer.

21. The method of claim 20,
Wherein the attribute is a sideband,
System conversion index buffer.

The method according to claim 1,
The input request including an input request class;
The conversion table including a plurality of conversion table entries each including an entry class,
System conversion index buffer.

25. The method of claim 24,
If the number of conversion table entries having an entry class matching the input request class exceeds a maximum number, the system conversion index buffer replaces a conversion table entry having an entry class matching the input request class,
System conversion index buffer.

25. The method of claim 24,
If the number of translation table entries with a particular entry class is less than the minimum number, then the translation table entries with that particular entry class are invalid,
System conversion index buffer.

The method according to claim 1,
Further comprising a walker port for transmitting external conversion requests,
Wherein the input request comprises an address,
If the translation table indicates a miss, the system translation index buffer transmits an external translation request,
System conversion index buffer.

28. The method of claim 27,
An external conversion request type,
Wherein the external conversion request type indicates whether the input request is a prefetch,
System conversion index buffer.

28. The method of claim 27,
Transmitting the outer translation request is temporarily blocked when the number of pending outer translation requests reaches a threshold,
System conversion index buffer.

28. The method of claim 27,
Transmitting the outer translation request is temporarily blocked when the number of pending outer translation requests due to prefetches reaches a threshold,
System conversion index buffer.

28. The method of claim 27,
Transmitting the external conversion request is discarded when the number of pending external conversion requests due to prefetches reaches a threshold,
System conversion index buffer.

28. The method of claim 27,
Wherein the conversion table comprises a first group and a second group; And
Wherein sending the external conversion request is temporarily blocked when the number of pending external conversion requests for the first group reaches a threshold,
System conversion index buffer.

28. The method of claim 27,
The input request including a class;
The class being associated with the external conversion request; And
Wherein sending the external conversion request is temporarily blocked when the number of pending external conversion requests of the class reaches a threshold,
System conversion index buffer.

As a system memory management unit,
A walker;
System conversion index buffer,
The system translation index buffer comprising:
A translation table storing address translations;
An input port enabled to receive an input request from an initiator, the input request comprising an address; if the translation table indicates a miss, the buffer transmits an external translation request; And
An output port enabled to transmit an output request corresponding to the input request, the output port not transmitting an output request if the input request is prefetch;
If the input request matches an entry in the translation table, the translation table returns the translated address,
An address field is modified to use the translated address,
At least some bits of the address field being replaced by the translated address obtained from the translation table, and
Wherein the number of bits in the address field replaced by the translated address is derived from a page size corresponding to the entry in the translation table,
Wherein the system translation index buffer is coupled to the worker,
System memory management unit.

35. The method of claim 34,
Further comprising an external conversion request type indicating whether the input request is a prefetch,
System memory management unit.

36. The method of claim 35,
Wherein when the external request type indicates that the input request is a prefetch, the system memory management unit is prohibited from generating side effects,
System memory management unit.

As an interconnect,
Initiator;
target;
Walker;
System conversion index buffer; And
A worker port for transmitting external conversion requests; transmitting said external conversion request is discarded when the number of pending external conversion requests due to prefetches reaches a limit value;
Lt; / RTI >
The system translation index buffer comprising:
A translation table storing address translations;
An input port enabled to receive an input request from an initiator, the input request comprising an address; if the translation table indicates a miss, the buffer transmits an external translation request; And
An output port enabled to transmit an output request corresponding to the input request, the output port not transmitting an output request if the input request is prefetch;
If the input request matches an entry in the translation table, the translation table returns the translated address,
An address field is modified to use the translated address,
At least some bits of the address field being replaced by the translated address obtained from the translation table,
Wherein the number of bits in the address field replaced by the translated address is derived from a page size corresponding to the entry in the translation table,
Wherein the input port of the system conversion index buffer is accessible by the initiator and the target is accessible by the output port of the system conversion index buffer and the worker is connected to the worker port of the system conversion index buffer Accessible by,
Interconnect.

delete