KR101946476B1

KR101946476B1 - Early miss prediction based periodic cache bypassing technique, streaming multiprocessor and embedded system performed by the technique

Info

Publication number: KR101946476B1
Application number: KR1020170151765A
Authority: KR
Inventors: 김철홍; 콩 튜안 두; 김광복
Original assignee: 전남대학교산학협력단
Priority date: 2017-11-14
Filing date: 2017-11-14
Publication date: 2019-05-20

Abstract

The present invention relates to periodic cache bypassing technique based on cache hit ratio prediction and a streaming multiprocessor and embedded system using the same and, more specifically, to periodic cache bypassing technique capable of improving cache hit ratio while improving performance of an embedded system by decreasing the number of access to L1 data cache through periodic cache bypassing technique based on cache hit ratio prediction and a streaming multiprocessor and embedded system using the same.

Description

[0001] The present invention relates to a periodic cache bypassing method based on cache hit ratio prediction, a streaming multiprocessor and an embedded system employing the technique,

본 발명은 캐쉬 적중률 예측에 기반한 주기적 캐쉬 우회 기법, 그 기법이 적용된 스트리밍 멀티프로세서 및 임베디드 시스템에 관한 것으로, 보다 구체적으로는 캐쉬 적중률 예측에 기반한 주기적 캐쉬 우회 기법을 통해 L1 데이터 캐쉬로의 접근 횟수를 줄임으로써 임베디드 시스템의 성능을 향상시킬 수 한편 캐쉬 적중률은 향상시킬 수 있는 주기적 캐쉬 우회 기법, 그 기법이 적용된 스트리밍 멀티프로세서 및 임베디드 시스템에 관한 것이다.The present invention relates to a cyclic cache bypassing method based on cache hit ratio prediction, a streaming multiprocessor and an embedded system to which the technique is applied, more specifically, a cyclic cache bypassing method based on cache hit ratio prediction, And a streaming multiprocessor and an embedded system to which the technique is applied. [0003] 2. Description of the Related Art [0004]

캐쉬 우회 기법은 캐쉬 적중률을 향상시키고 메모리 스톨(stall)을 줄일 수 있는 기법이다.The cache bypass technique improves the cache hit ratio and reduces the memory stall.

최근, 그래픽 연산뿐만 아니라 일반 컴퓨팅(Computing) 영역의 연산을 수행할 수 있는 범용 계산 그래픽 처리 장치(GPGPU:General-Purpose computing on Graphics Processing Units)가 개발되고 있다.Recently, general-purpose computing on graphics processing units (GPGPU) capable of performing not only graphic operations but also general computing operations have been developed.

도 1은 일반적인 범용 계산 그래픽 처리 장치의 구조를 보여주는 도면으로, 범용 계산 그래픽 처리 장치(10)는 병렬구조로 프로세스를 처리하기 위한 복수의 스트리밍 멀티프로세서(11,SM:Streaming Multiprocessor)와 내부 네트워크(12)로 연결되는 L2 캐쉬(13), 메모리 컨트롤러(14), 하위 메모리(15)를 포함한다.FIG. 1 is a diagram illustrating a general multiprocessing graphics processing apparatus according to an embodiment of the present invention. Referring to FIG. 1, a general purpose computing graphics processing apparatus 10 includes a plurality of streaming multiprocessors 11 (Streaming Multiprocessors) 12, an L2 cache 13, a memory controller 14, and a lower memory 15.

또한, 도 2는 일반적인 범용 계산 그래픽 처리의 파이프라인을 설명하기 위한 것으로 먼저, 워프 스케줄러(Warp scheduler)에서 워프(warp)의 실행 순서를 정하고(S1000), 패치 유닛(Fetch unit)이 물리 메모리로부터 명령어를 패치하여(S200), 명령어 캐쉬(I-C,Instruction cache)에 저장된다(S300).FIG. 2 illustrates a general general purpose graphics processing pipeline. First, a warp scheduler determines a warp execution order (S1000), and a fetch unit fetches from a physical memory The instruction is patched (S200) and stored in an instruction cache (IC) (S300).

다음, 디코드 유닛(Decode unit)이 어떠한 명령어가 수행되어야 하는지 명령어를 디코딩하고(S400), 이슈 유닛(Issue unit)이 처리될 명령어를 지시하며(S500), 명령어는 레지스터 파일(Register file)에 일시 저장된다(S600).Next, the decode unit decodes an instruction to determine which instruction is to be executed (S400). The issue unit indicates an instruction to be processed (S500). The instruction is temporarily stored in a register file (S600).

다음, 레지스터 파일에 저장되는 명령어의 종류에 따라 연산모듈(ALU)에서 연산이 수행되거나(S800), 로드 스토어 유닛(Load/Store unit)에 의해 메모리의 접근하여 데이터를 코어로 인출하며(S700), 연산 수행 결과가 레지스터 파일에 기록된다(S900).Next, an operation is performed in the ALU according to the type of the instruction stored in the register file (S800), the memory is accessed by the load / store unit and the data is fetched to the core (S700) , And the result of the operation is recorded in the register file (S900).

한편, 데이터를 코어로 인출하는 과정(S700), 먼저, 주소 생성기(Address generator)에서 데이터가 저장된 주소를 생성하고(S710), 생성된 주소는 로드 스토어 큐(Load/Store Queue)에 일시 저장되어 출력된다(S720).In step S700, data is stored in an address generator in step S710. The generated address is temporarily stored in a load / store queue. (S720).

그러면, L1 캐쉬의 데이터 캐쉬(D-cache)에서 해당 주소에 데이터가 있는지 확인하고 데이터가 있을 경우 코어로 데이터를 인출하고(S730), 그렇지 않을 경우 하위 메모리인 L2 캐쉬에서 데이터를 코어로 인출하며(S750), 인출되는 데이터는 데이터 캐쉬에 저장된다. 이때, 데이터의 미스(miss)의 상태는 미스상태 홀딩 레지스터(MSHRs,Miss-Status Holding Registers)에 기록된다(S740).Then, if there is data in the data cache of the L1 cache (D-cache), data is fetched to the core (S730). If not, the data is fetched to the core in the L2 cache (S750), the fetched data is stored in the data cache. At this time, the state of the miss of the data is recorded in the Miss Status Holding Registers (MSHRs) (S740).

또한, 데이터 캐쉬(D-cache)의 각 블록들은 태그와 데이터로 구성되며, 태그에는 주소에 해당하는 태그 정보가 포함된다.Each block of the data cache (D-cache) is composed of a tag and data, and the tag includes tag information corresponding to the address.

이러한, 종래의 범용 계산 그래픽 처리 장치는 스트리밍 멀티프로세서들을 이용하여 광범위한 병렬구조로 프로세스를 처리하므로 처리 성능의 향상을 가지고 올 수 있으나 캐쉬 경합(cache contention) 및 자원 혼잡(resource congestion)과 같은 많은 실행 경쟁(performance challenges) 때문에 캐쉬의 저 효율이 발생하고 있으며, 특히 데이터 캐쉬의 블록 상태와 관계없이 항상 L1 데이터 캐쉬에 접근하므로 데이터 미스가 발생할 경우 파이프라인 스톨(stall)을 피할 수 없다.Such a conventional general-purpose computing graphics processing apparatus can process a process in a wide parallel structure by using streaming multiprocessors, but it can improve the processing performance. However, many execution processes such as cache contention and resource congestion Because of performance challenges, cache efficiency is low. In particular, a pipeline stall can not be avoided in the event of a data miss, because it always accesses the L1 data cache regardless of the block state of the data cache.

[선행기술문헌][Prior Art Literature]

[특허문헌][Patent Literature]

1. 한국등록특허 제10-1662363호, 호스트 장치 내 캐쉬를 우회함으로써 저장장치 내 가상 파일에 액세스하기 위한 호스트 장치 및 방법1. Korean Patent No. 10-1662363, host device and method for accessing a virtual file in a storage device by bypassing cache in the host device

2. 한국등록특허 제10-1612155호, 다중 처리 스택을 구비한 저장 제어기의 Ｉ/Ｏ 요청을 우회시키기 위해 영역 로크를 사용하는 방법 및 구조2. Korean Patent No. 10-1612155, a method and structure for using area locks to bypass I / O requests of a storage controller with multiple processing stacks

본 발명은 상술한 문제점을 해결하기 위해 안출된 것으로 본 발명의 목적은 캐쉬 적중률 향상 및 메모리 스톨(stall)을 줄여 임베디스 시스템의 성능을 향상시킬 수 있는 캐쉬 우회 기법, 그 캐쉬 우회 기법이 적용된 스트리밍 멀티프로세서 및 임베디드 시스템을 제공하는 것이다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to provide a cache bypass technique capable of improving the cache hit ratio and reducing the memory stall to improve the performance of the embedded system, Multi-processor and embedded systems.

상기의 목적을 달성하기 위하여 본 발명은 범용 계산 그래픽 처리 장치(GPGPU:General-Purpose computing on Graphics Processing Units)에서 스트리밍 멀티프로세서(SM:streaming multiprocessor)로서, 상기 스트리밍 멀티프로세서는 내부의 L1 데이터 캐쉬에 저장된 데이터를 코어로 인출하거나, 상기 L1 데이터 캐쉬에 원하는 데이터가 존재하지 않을 경우, 하위 레벨 메모리로 접근하여 원하는 데이터를 인출하고, 상기 L1 데이터 캐쉬에 데이터를 저장하며, 상기 L1 데이터 캐쉬의 태그 정보가 복사되어 저장되고 주소 생성기에서 생성된 주소를 입력받아 상기 태그 정보에 상기 주소가 존재할 경우 히트(miss) 상태를 출력하고, 그렇지 않을 경우 미스(miss) 상태를 출력하는 미스 예측 유닛(EMP unit:Early Miss Prediction unit); 및 상기 주소 생성기에서 생성된 주소와 상기 미스 예측 유닛으로부터 히트 상태 또는 미스 상태를 입력받고, 상기 하위 레벨 메모리로 접근하는 시간인 캐쉬 우회 주기(Bypassing period)와 상기 L1 데이터 캐쉬로 접근하는 시간인 리프레시 주기(Refreshing period)를 설정하며, 현재 시간이 상기 캐쉬 우회 주기이고 상기 미스 예측 유닛의 출력이 미스(miss)일 경우 상기 하위 레벨 메모리로 우회 접근하여 데이터가 인출되게 하는 주기적 캐쉬 우회 유닛(P unit:Periodic Cache Bypassing unit);을 포함하는 것을 특징으로 하는 스트리밍 멀티프로세서를 제공한다.According to an aspect of the present invention, there is provided a streaming multiprocessor (SM) in a general-purpose computing graphics processing unit (GPGPU), the streaming multiprocessor including an L1 data cache If there is no desired data in the L1 data cache, accesses the lower level memory to fetch desired data, stores data in the L1 data cache, and stores the tag information of the L1 data cache (EMP unit) for outputting a miss state when the address is present in the tag information, and a miss state when the address is not present in the tag information, Early Miss Prediction unit); And a memory for receiving a hit state or a miss state from the address generated by the address generator and the miss prediction unit, and a cache bypassing period, which is a time for accessing the lower level memory, A periodic cache bypass unit (P unit) for setting a refresh period and allowing the data to be fetched by bypassing the lower level memory when the current time is the cache bypass cycle and the output of the miss prediction unit is miss, : Periodic Cache Bypassing unit). &Lt; / RTI >

바람직한 실시예에 있어서, 상기 주기적 캐쉬 우회 유닛은 현재 시간이 상기 캐쉬 우회 주기이더라도 상기 미스 예측 유닛의 출력이 히트(Hit)일 경우에는 상기 L1 데이터 캐쉬로 접근하여 데이터가 인출되게 한다.In a preferred embodiment, the periodic cache bypass unit accesses the L1 data cache to fetch data when the output of the misprediction unit is Hit even if the current time is the cache bypass period.

바람직한 실시예에 있어서, 상기 주기적 캐쉬 우회 유닛은 현재 시간이 상기 리프레시 주기일 경우, 상기 미스 예측 유닛의 출력과 관계없이 상기 L1 데이터 캐쉬로 접근하여 데이터가 인출되게 한다.In a preferred embodiment, the periodic cache bypassing unit accesses the L1 data cache and fetches data irrespective of the output of the miss prediction unit when the current time is the refresh period.

바람직한 실시예에 있어서, 상기 미스 예측 유닛에는 상기 리프레시 주기동안 상기 L1 데이터 캐쉬의 태그 정보가 복사되어 저장된다.In a preferred embodiment, the tag information of the L1 data cache is copied and stored in the miss prediction unit during the refresh period.

바람직한 실시예에 있어서, 상기 스트리밍 멀티프로세서,는 원하는 데이터가 저장된 주소를 생성하는 주소 생성기(address generator); 및 상기 주소 생성기의 주소를 입력받아 임시저장하며 상기 주기적 캐쉬 우회 유닛으로 순차 출력하는 로드 스토어 큐(Load/Store Queue);를 포함하고, 상기 로드 스토어 큐에서 상기 주기적 캐쉬 우회 유닛으로 주소가 출력될 때, 출력되는 주소에 대응하는 태그가 상기 미스 예측 유닛에 존재할 경우 상기 미스 예측 유닛은 상기 주기적 캐쉬 우회 유닛으로 히트 상태를 출력하고, 그렇지 않을 경우 상기 미스 예측 유닛은 상기 주기적 캐쉬 우회 유닛으로 미스 상태를 출력한다.In a preferred embodiment, the streaming multiprocessor comprises: an address generator for generating an address where desired data is stored; And a load store queue for receiving and temporarily storing an address of the address generator and sequentially outputting the address to the periodic cache bypass unit, wherein an address is output from the load store queue to the periodic cache bypass unit The misprediction unit outputs a hit state to the periodic cache bypassing unit when a tag corresponding to an output address exists in the miss prediction unit, and if not, the misprediction unit notifies the missed state .

또한, 본 발명은 상기 스트리밍 멀티프로세서들을 포함하는 범용 계산 그래픽 처리 장치를 더 제공한다.The invention further provides a general purpose computing graphics processing device comprising the streaming multiprocessors.

또한, 본 발명은 상기 범용 계산 그래픽 처리 장치가 임베디드된 임베디드 시스템을 더 제공한다.Further, the present invention further provides an embedded system in which the general-purpose computing graphics processing apparatus is embedded.

또한, 본 발명은 상기 스트리밍 멀티프로세서를 이용하여 상기 L1 데이터 캐쉬에서 데이터를 인출하거나 또는 상기 L1 데이터 캐쉬를 우회하여 상기 하위 레벨 메모리에서 데이터를 인출하는 캐쉬 우회 기법으로서, 상기 캐쉬 우회 주기와 상기 리프레시 주기를 결정하여 상기 주기적 캐쉬 우회 유닛에 저장하는 단계; 상기 주소 생성기에서 주소가 생성되는 단계; 상기 주소 생성기에서 생성된 주소가 상기 로드 스토어 큐에 적재되는 동시에 상기 미스 예측 유닛이 상기 주소에 대응하는 태그가 저장되어 있는지 확인하여 저장되어 있을 경우 히트 상태를 출력하고 그렇지 않을 경우 미스 상태를 출력하는 단계; 상기 로드 스토어 큐에서 상기 주기적 캐쉬 우회 유닛으로 주소가 출력되고, 동시에 상기 미스 예측 유닛에서 히트 상태 또는 미스 상태가 상기 주기적 캐쉬 우회 유닛으로 입력되는 단계; 및 상기 주기적 캐쉬 우회 유닛이 현재 시간이 캐쉬 우회 주기이고, 상기 상기 미스 예측 유닛의 출력이 미스 상태일 경우 상기 L1 데이터 캐쉬에 접근하지 않고, 상기 하위 레벨 메모리로 우회 접근하여 데이터가 인출되게 하는 단계;를 포함하는 것을 특징으로 하는 캐쉬 우회 기법을 더 제공한다.According to another aspect of the present invention, there is provided a cache bypassing method for fetching data from the L1 data cache using the streaming multiprocessor or fetching data from the lower level memory by bypassing the L1 data cache, Determining a period and storing the period in the periodic cache bypass unit; Generating an address in the address generator; The address generated by the address generator is loaded into the load store queue and the miss prediction unit checks whether the tag corresponding to the address is stored and outputs a hit state if it is stored and a miss state if not step; Outputting an address from the load store queue to the periodic cache bypass unit and simultaneously inputting a hit state or a miss state in the miss prediction unit to the periodic cache bypass unit; And if the current time is a cache bypass cycle and the output of the missprediction unit is in a miss state, accessing to the lower level memory and bypassing the L1 data cache to fetch data The cache bypassing method further includes:

바람직한 실시예에 있어서, 상기 주기적 캐쉬 우회 유닛은 현재 시간이 상기 캐쉬 우회 주기이더라도 상기 미스 예측 유닛에서의 출력이 히트(Hit) 상태일 경우에는 상기 L1 데이터 캐쉬로 접근하여 데이터가 인출되게 하고, 현재 시간이 상기 리프레시 주기일 경우, 상기 미스 예측 유닛의 출력과 관계없이 상기 L1 데이터 캐쉬로 접근하여 데이터가 인출되게 한다.In a preferred embodiment, the periodic cache bypassing unit accesses the L1 data cache to fetch data when the output from the mispredictor unit is in a Hit state even if the current time is the cache bypass period, When the time is the refresh period, the data is accessed by accessing the L1 data cache regardless of the output of the miss prediction unit.

본 발명은 다음과 같은 우수한 효과를 가진다.The present invention has the following excellent effects.

먼저, 본 발명의 캐쉬 우회 기법, 그 기법이 적용된 스트리밍 멀티프로세서 및 임베디드 시스템에 의하면, 매 사이클마다 L1 데이터 캐쉬에 접근하지 않고, 현재 시간이 캐쉬 우회 주기이고, 미스 예측 유닛의 블록 상태 정보가 미스일 경우 우회하여 하위 메모리로 접근하게 함으로써 캐쉬 적중률을 향상시킬 수 있고, L1 데이터 캐쉬로의 접근을 줄여 임베디드 시스템의 성능을 향상시킬 수 있는 장점이 있다.First, according to the cache bypassing technique of the present invention, a streaming multiprocessor and an embedded system to which the technique is applied, the L1 data cache is not accessed every cycle, the current time is the cache bypass period, and the block state information of the miss prediction unit is miss It is possible to improve the cache hit ratio and to improve the performance of the embedded system by reducing access to the L1 data cache.

도 1은 일반적인 범용 계산 그래픽 처리장치의 구조를 보여주는 도면,
도 2는 일반적인 범용 계산 그래픽 처리장치의 파이프라인을 설명하기 위한 도면,
도 3은 본 발명의 일 실시예에 따른 캐쉬 우회 기법을 설명하기 위한 도면,
도 4는 본 발명의 일 실시예에 따른 캐쉬 우회 기법을 적용할 때, 클럭당 처리 명령어수를 보여주는 그래프,
도 5는 본 발명의 일 실시예에 따른 캐쉬 우회 기법을 적용할 때, L1 데이터 캐쉬의 미스율을 보여주는 그래프이다.BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a diagram showing the structure of a general purpose computing graphic processing apparatus; Fig.
2 is a diagram for explaining a pipeline of a general purpose general purpose computing graphics processing apparatus,
FIG. 3 is a diagram for explaining a cache bypassing technique according to an embodiment of the present invention;
4 is a graph showing the number of processing instructions per clock when applying the cache bypass technique according to an embodiment of the present invention,
FIG. 5 is a graph showing a miss ratio of the L1 data cache when applying the cache bypass method according to an exemplary embodiment of the present invention.

본 발명에서 사용되는 용어는 가능한 현재 널리 사용되는 일반적인 용어를 선택하였으나, 특정한 경우는 출원인이 임의로 선정한 용어도 있는데 이 경우에는 단순한 용어의 명칭이 아닌 발명의 상세한 설명 부분에 기재되거나 사용된 의미를 고려하여 그 의미가 파악되어야 할 것이다.Although the terms used in the present invention have been selected as general terms that are widely used at present, there are some terms selected arbitrarily by the applicant in a specific case. In this case, the meaning described or used in the detailed description part of the invention The meaning must be grasped.

이하, 첨부한 도면에 도시된 바람직한 실시예들을 참조하여 본 발명의 기술적 구성을 상세하게 설명한다.Hereinafter, the technical structure of the present invention will be described in detail with reference to preferred embodiments shown in the accompanying drawings.

그러나 본 발명은 여기서 설명되는 실시예에 한정되지 않고 다른 형태로 구체화될 수도 있다. 명세서 전체에 걸쳐 동일한 참조번호는 동일한 구성요소를 나타낸다.However, the present invention is not limited to the embodiments described herein but may be embodied in other forms. Like reference numerals designate like elements throughout the specification.

도 3은 본 발명의 일 실시예에 따른 캐쉬 우회 기법을 설명하기 위한 것으로, 본 발명의 캐쉬 우회 기법은 스트리밍 멀티프로세서의 데이터 캐쉬에서 코어로 데이터를 인출할 때, 데이터 캐쉬로의 접근을 선택적으로 수행함으로써, 캐쉬 적중률 향상을 향상시키고, 메모리 스톨(stall)을 줄일 수 있는 방법이다.FIG. 3 is a view for explaining a cache bypass technique according to an embodiment of the present invention. In the cache bypass technique of the present invention, when fetching data from a data cache to a core of a streaming multiprocessor, Thereby improving the cache hit ratio and reducing the memory stall.

또한, 본 발명의 캐쉬 우회 기법은 스트리밍 멀티프로세서(SM:Streaming Multiprocessor)에서 수행된다.In addition, the cache bypass technique of the present invention is performed in a streaming multiprocessor (SM).

즉, 본 발명은 본 발명의 캐쉬 우회 기법으로 데이터를 인출하는 장치인 스트리밍 멀티프로세서의 형태로 제공될 수 있다.That is, the present invention can be provided in the form of a streaming multiprocessor which is an apparatus for fetching data using the cache bypass technique of the present invention.

또한, 본 발명은 복수 개의 상기 스트리밍 멀티프로세서를 포함하는 하나의 범용 계산 그래픽 처리장치(GPGPU:General-Purpose computing on Graphics Processing Units)로 제공될 수 있고, 상기 범용 계산 그래픽 처리장치가 임베디드된 임베디드 시스템(embedded system)의 형태로 제공될 수도 있다.In addition, the present invention may be provided in a general purpose computing graphics processing unit (GPGPU) including a plurality of the streaming multiprocessors, and the general purpose graphics processing unit may be embedded in an embedded system or in the form of an embedded system.

도 3을 참조하면, 본 발명의 캐쉬 우회 기법은 도 2에 도시한 파이프라인과 비교하여 데이터 인출하는 과정(도 2의 'S700')만이 서로 상이하고 나머지 단계들은 실질적으로 동일하다.Referring to FIG. 3, in the cache bypassing method of the present invention, only the process of fetching data (S700 'in FIG. 2) is different from the pipeline shown in FIG. 2, and the remaining steps are substantially the same.

또한, 본 발명의 캐쉬 우회 기법을 수행하기 위해서는 기존의 스트리밍 멀티프로세서에 새로운 하드웨어인 미스 예측 유닛(EMP unit:Early Miss Prediction unit, 일종의 '메모리')과 주기적 캐쉬 우회 유닛(P unit:Periodic Cache Bypassing unit, 예를 들면, '곱셈기와 멀티플렉서의 조합')이 구비되어야 하며, 이 하드웨드들은 기존의 스트리밍 멀티프로세서에 실질적으로 새롭게 추가될 수도 있고, 기존에 구비된 하드웨어를 프로그램적으로 재구성하는 방법으로 추가할 수도 있다.In order to perform the cache bypassing method of the present invention, an existing EMP unit (Early Miss Prediction unit, a kind of 'memory') and a periodic cache bypass unit (P unit) unit, for example, a combination of 'multiplier and multiplexer'), which may be substantially new to existing streaming multiprocessors, or may be implemented by programmatically reconfiguring existing hardware It can also be added.

또한, 상기 미스 예측 유닛에는 L1 데이터 캐쉬의 태그들(Tags)이 복사되어 저장되며, 상기 태그들에는 L1 데이터 캐쉬의 블록 주소에 해당하는 태그 정보가 포함된다.In addition, the tags of the L1 data cache are copied and stored in the miss prediction unit, and the tags include tag information corresponding to the block address of the L1 data cache.

또한, 상기 주기적 캐쉬 우회 유닛은 로드 스토어 큐와 L1 데이터 캐쉬 사이에 구비되며 L1 데이터 캐쉬로 접근하여 데이터를 인출할지 아니면, 하위 레벨 메모리(L2 캐쉬)로 바로 접근하여 데이터를 인출할지를 결정한다.In addition, the periodic cache bypass unit is provided between the load store queue and the L1 data cache, and determines whether to fetch data by accessing the L1 data cache or access the lower level memory (L2 cache) to fetch data.

이하에서는 본 발명의 따른 캐쉬 우회 기법(S1000)을 상세히 설명한다.Hereinafter, the cache bypassing method (S1000) according to the present invention will be described in detail.

먼저, 스트리밍 멀티프로세서의 주소 생성기(address generator)가 데이터를 인출할 주소를 생성하여 출력한다(S1100).First, an address generator of the streaming multiprocessor generates and outputs an address to fetch data (S1100).

그 전에 상기 주기적 캐쉬 우회 유닛에는 하위 레벨 메모리로의 우회 기회를 주는 기간인 캐쉬 우회 주기(Bypassing period,'a')와 상기 L1 데이터 캐쉬로 접근해야하는 기간인 리프레시 주기(Refreshing period,'b')가 설정된다(S1010).The periodic cache bypass unit is provided with a cache bypass period ('a'), which is a period for giving a chance to bypass the lower level memory, and a refresh period ('b'), which is a period for accessing the L1 data cache, (S1010).

또한, 상기 캐쉬 우회 주기와 상기 리프레시 주기는 반복되며, 하나의 주기(T)를 이룬다.In addition, the cache bypass cycle and the refresh cycle are repeated and form one cycle T.

또한, 상기 캐쉬 우회 주기의 지속 시간(0~t1)과 상기 리프레시 주기의 지속 시간(t1~t2)의 비율은 설계자가 실험에 의해 결정할 수 있으며, 본 발명에서는 주기 'T'와 캐쉬 우회 주기 'a'의 비율을 6:5로 설정하였다.The ratio of the duration (0 to t1) of the cache bypass cycle and the duration (t1 to t2) of the refresh cycle can be determined experimentally by the designer. In the present invention, the cycle 'T' and the cache bypass cycle ' a 'was set to 6: 5.

다음, 상기 주소 생성기에서 출력되는 주소는 로드 스토어 큐(Load/Store Queue)에 일시 저장된 후 출력된다(S1200).Next, the address output from the address generator is temporarily stored in a load store queue (S1200) and output.

동시에 상기 주소 생성기에서 출력되는 주소에 대응하는 태그 정보를 상기 미스 예측 유닛에서 검색하고, 상기 주소가 상기 태그 정보에 존재하는지 여부를 알려주는 블록 상태 정보(M-hint:Miss hint)가 출력된다(S1300).At the same time, the tag information corresponding to the address output from the address generator is retrieved from the miss prediction unit, and block state information (M-hint) indicating whether the address exists in the tag information is outputted S1300).

또한, 상기 상태정보는 상기 주소가 상태 태그에 존재할 경우 히트(Hit) 상태로 출력되고, 그렇지 않을 경우, 미스(Miss) 상태로 출력된다.In addition, the status information is output in a hit state when the address exists in the status tag, and is output in a miss status if not.

또한, 상기 로드 스토어 큐의 주소 출력과 상기 미스 예측 유닛의 블록 상태 정보 출력은 동일한 사이클에서 이루어진다.The address output of the load store queue and the block state information output of the miss prediction unit are performed in the same cycle.

다음, 상기 로드 스토어 큐에서 출력되는 주소와 상기 미스 예측 유닛에서 출력되는 블록 상태 정보가 상기 주기적 캐쉬 우회 유닛으로 입력되고, 상기 주기적 캐쉬 우회 유닛은 상기 블록 상태 정보에 따라서 상기 L1 데이터 캐쉬에 접근하여 데이터를 인출할지 상기 L1 데이터 캐쉬에 접근하지 않고 우회하여 바로 하위 레벨 메모리에서 데이터를 인출할지를 결정한다(S1400).Next, the address output from the load store queue and the block state information output from the miss prediction unit are input to the periodic cache bypassing unit. The periodic cache bypassing unit accesses the L1 data cache according to the block state information In operation S1400, it is determined whether to fetch data or bypass data without accessing the L1 data cache to fetch data from the lower level memory immediately.

또한, 상기 주기적 캐쉬 우회 유닛은 현재 시간(또는 현재 클럭 사이클)이 상기 캐쉬 우회 주기이고 상기 블록 상태 정보가 미스일 경우, 우회하여 바로 하위 레벨 메모리에서 데이터를 인출한다(S1700).In addition, if the current time (or current clock cycle) is the cache bypass cycle and the block state information is miss, the periodic cache bypass unit detaches data from the lower level memory immediately (S1700).

또한, 인출되는 데이터는 레지스터 파일에 적재되어 코어로 제공되며, 데이터의 변경되는 경우에만 상기 L1 데이터 캐쉬에 접근하여 데이터를 기록한다. Also, the fetched data is loaded into a register file and provided to the core, and the data is accessed by accessing the L1 data cache only when the data is changed.

그러나 현재 시간이 상기 캐쉬 우회 주기에 있더라도 상기 블록 상태 정보가 히트일 경우에는 L1 데이터 캐쉬에 원하는 데이터가 있는 경우이므로 상기 L1 데이터 캐쉬에서 코어로 데이터가 인출되게 한다(S1500). However, if the block state information is hit even if the current time is in the cache bypass cycle, data is fetched from the L1 data cache to the core because there is desired data in the L1 data cache (S1500).

또한, 현재 시간이 상기 리프레시 주기에 있을 경우 상기 블록 상태 정보에 관계없이 상기 L1 데이터 캐쉬로 접근하여 데이터가 인출되게 한다.In addition, when the current time is in the refresh period, the data is accessed by accessing the L1 data cache regardless of the block state information.

또한, 미스상태 홀딩 레지스터(MSHRs,Miss-Status Holding Registers)에는 데이터 인출의 미스(miss) 상태가 기록된다(S1600).In addition, a miss status of data fetch is recorded in the Miss Status Holding Registers (MSHRs) (S1600).

또한, 상기 리프레시 주기에서 상기 L1 데이터 캐쉬에서 데이터를 인출하거나 데이터를 변경이 있을 경우, 상기 L1 데이터 캐쉬의 태그가 상기 미스 예측 유닛에 복사되어 저장된다.When the data is fetched from the L1 data cache or the data is changed in the refresh period, the tag of the L1 data cache is copied to the miss prediction unit and stored.

다시 말해서, 상기 캐쉬 우회 주기에서 상기 L1 데이터 캐쉬에서 데이터가 인출되는 경우에는 상기 미스 예측 유닛에 복사되지 않는다.In other words, when the data is fetched from the L1 data cache in the cache bypass cycle, it is not copied to the miss prediction unit.

즉, 본 발명의 일 실시예에 따른 캐쉬 우회 기법은 매 사이클마다 L1 데이터 캐쉬에 접근하지 않고, 현재 시간이 리프레시 주기이거나 캐쉬 우회 주기에 있더라도 블록 상태 정보가 히트인 경우에만 접근함으로써 L1 데이터 캐쉬로의 접근을 줄여 임베디스 시스템의 성능을 향상시킬 수 있는 한편, 캐쉬 적중률을 향상시킬 수 있는 장점이 있다.That is, the cache bypassing technique according to an embodiment of the present invention does not access the L1 data cache every cycle, and only when the current time is the refresh cycle or the cache bypass cycle, the access is performed only when the block status information is hit, The performance of the embedded system can be improved while the cache hit ratio can be improved.

도 4는 본 발명의 일 실시예에 따른 캐쉬 우회 기법을 적용할 때, 클럭당 처리 명령어수(IPC:Instruction per Clock)를 보여주는 그래프로써(PEMP-16KB는 상기 미스 예측 유닛의 용량이 16KB인 경우이고, PEMP-32KB는 상기 미스 예측 유닛의 용량이 32KB인 경우임), 종래의 일반적인 파이프라인(Baseline)보다 본 발명의 캐쉬 우회 기법을 적용할 때, 클럭당 처리 명령어수가 모든 벤치마크 프로그램들에서 동일하거나 향상되는 것을 알 수 있다.FIG. 4 is a graph showing the number of instructions per clock (IPC) when applying the cache bypass technique according to an embodiment of the present invention (PEMP-16 KB is a graph showing the case where the capacity of the miss prediction unit is 16 KB And the capacity of the miss prediction unit is 32 KB), when applying the cache bypass technique of the present invention over the conventional general pipeline, the number of processing instructions per clock is reduced in all benchmark programs The same or improved.

도 5는 본 발명의 일 실시예에 따른 캐쉬 우회 기법을 적용할 때, L1 데이터 캐쉬의 미스율과 접근 횟수를 보여주는 그래프이다.FIG. 5 is a graph showing a miss ratio and an access count of the L1 data cache when the cache bypass method according to an embodiment of the present invention is applied.

도 5를 참조하면, 종래의 일반적인 파이프라인(Basdline)의 L1 데이터 캐쉬 미스율(Normalized L1 Miss Rate)이 '1'이라할 경우, LUD 프로그램을 제외한 모든 벤치마크 프로그램들에서 본 발명의 캐쉬 우회 기법을 적용할 때, 미스율(Miss Rate)이 동일하거나 줄어드는 것을 확인할 수 있다.Referring to FIG. 5, when the normalized L1 miss rate of a conventional general pipeline is '1', the cache bypass technique of the present invention is applied to all benchmark programs except for the LUD program. When applied, you can see that the Miss Rate is the same or less.

이상에서 살펴본 바와 같이 본 발명은 바람직한 실시예를 들어 도시하고 설명하였으나, 상기한 실시예에 한정되지 아니하며 본 발명의 정신을 벗어나지 않는 범위 내에서 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변경과 수정이 가능할 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is clearly understood that the same is by way of illustration and example only and is not to be taken by way of limitation, Various changes and modifications will be possible.

10:범용 계산 그래픽 처리장치 11:스트리밍 멀티프로세서
12:내부 네트워크 13:L2 캐쉬
14:메모리 컨트롤러 15:하위 메모리10: general purpose computing graphics processing device 11: streaming multiprocessor
12: Internal network 13: L2 cache
14: Memory controller 15: Lower memory

Claims

As a streaming multiprocessor (SM) in a general-purpose computing graphics processing unit (GPGPU)
The streaming multiprocessor fetches the data stored in the L1 data cache in the core and accesses the lower level memory to retrieve the desired data if the L1 data cache does not have the desired data, / RTI >
The tag information of the L1 data cache is copied and stored, and when an address generated by the address generator is received, if the address exists in the tag information, a hit state is output. Otherwise, a miss state is output An EMP unit (Early Miss Prediction unit); And
A cache bypassing period, which is a time for accessing the lower level memory, and a refresh period, which is a time for accessing the L1 data cache, from the address generated by the address generator and the hit state or the miss state from the miss prediction unit, And a periodic cache bypass unit (P unit) for setting a refresh period when the current time is the cache bypass cycle and an output of the miss prediction unit is missed, A Periodic Cache Bypassing unit).

The method according to claim 1,
Wherein the periodic cache bypass unit accesses the L1 data cache to fetch data when the output of the miss prediction unit is hit even if the current time is the cache bypass period.

3. The method of claim 2,
Wherein the periodic cache bypass unit accesses the L1 data cache and fetches data irrespective of the output of the miss prediction unit when the current time is the refresh period.

The method of claim 3,
And the tag information of the L1 data cache is copied and stored in the miss prediction unit during the refresh period.

5. The method according to any one of claims 1 to 4,
The streaming multiprocessor
An address generator for generating an address where desired data is stored; And
And a load store queue for receiving an address of the address generator, temporarily storing the address, and sequentially outputting the address to the periodic cache bypass unit,
When an address is output from the load store queue to the periodic cache bypass unit, if the tag corresponding to the output address exists in the miss prediction unit, the miss prediction unit outputs the hit state to the periodic cache bypass unit, And the miss prediction unit outputs a miss status to the periodic cache bypass unit if the miss prediction unit fails.

5. A general purpose computing graphics processing device comprising streaming multiprocessors as claimed in any one of claims 1 to 4.

An embedded system in which the general purpose computing graphics processing apparatus of claim 6 is embedded.

A cache bypassing method for fetching data from the L1 data cache using the streaming multiprocessor of claim 5 or fetching data from the lower level memory by bypassing the L1 data cache,
Determining the cache bypass cycle and the refresh cycle, and storing the cache bypass cycle and the refresh cycle in the periodic cache bypass unit;
Generating an address in the address generator;
The address generated by the address generator is loaded into the load store queue and the miss prediction unit checks whether the tag corresponding to the address is stored and outputs a hit state if it is stored and a miss state if not step;
Outputting an address from the load store queue to the periodic cache bypass unit and simultaneously inputting a hit state or a miss state in the miss prediction unit to the periodic cache bypass unit; And
If the current time is a cache bypass cycle and the output of the miss prediction unit is in a miss state, accessing the lower level memory by bypassing access to the L1 data cache and fetching data; Wherein the cache bypassing method comprises:

9. The method of claim 8,
Wherein the periodic cache bypassing unit accesses the L1 data cache and fetches data when the output from the missprediction unit is in a hit state even if the current time is the cache bypass period, The data access is made by accessing the L1 data cache regardless of the output of the miss prediction unit.