KR102340444B1

KR102340444B1 - A GPU cache bypassing method and apparatus with the adoption of monolithic 3D based network-on-chip

Info

Publication number: KR102340444B1
Application number: KR1020190173814A
Authority: KR
Inventors: 정성우; 콩 뚜안 두; 이영서
Original assignee: 고려대학교 산학협력단
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2021-12-16
Also published as: KR20210081644A

Abstract

본 발명은 모놀리식 3D 집적 기술 기반 NoC 구조를 활용한 GPU 캐시 바이패스 방법 및 장치를 개시한다. 본 실시예에 따르면, 개별 워프의 L1 캐시 이용률을 추적하고, 상기 개별 워프의 유형을 분류하며, 분류된 워프 유형에 따라 상기 개별 워프의 바이패스의 임계값을 결정하는 유형 분류 구조; 상기 개별 워프의 바이패스 이력을 기록하는 포화 카운터(Saturating Counter, SC) 및 상기 상기 개별 워프의 바이패스의 임계값을 비교하여 상기 개별 워프의 메모리 요청에 대한 바이패스를 적응적으로 결정하는 요청 바이패스 구조 및 상기 요청 바이패스 구조에 의해 바이패스되는 메모리 요청이 입력되는 모놀리식 3D 기반 NoC(Network on Chip)를 포함하는 모놀리식 3D 집적 기술 기반 NoC 구조를 활용한 GPU 캐시 바이패스 장치가 제공된다. The present invention discloses a GPU cache bypass method and apparatus utilizing a NoC structure based on a monolithic 3D integration technology. According to the present embodiment, there is provided a type classification structure for tracking the L1 cache utilization rate of an individual warp, classifying the type of the individual warp, and determining a bypass threshold value of the individual warp according to the classified warp type; A request bypass for adaptively determining a bypass for a memory request of each warp by comparing a saturation counter (SC) that records the bypass history of the individual warp and a threshold value of the bypass of the individual warp A GPU cache bypass device using a monolithic 3D integration technology-based NoC structure including a monolithic 3D-based NoC (Network on Chip) to which a memory request bypassed by the pass structure and the request bypass structure is input provided

Description

A GPU cache bypassing method and apparatus with the adoption of monolithic 3D based network-on-chip }

본 발명은 모놀리식 3D 집적 기술 기반 NoC 구조를 활용한 GPU 캐시 바이패스 방법 및 장치에 관한 것이다. The present invention relates to a GPU cache bypass method and apparatus utilizing a NoC structure based on a monolithic 3D integration technology.

GPU(Graphics Processing Unit)는 지난 10년 동안 범용 애플리케이션의 성능을 개선하기 위해 사용되었다. GPU의 계산 리소스를 활용하기 위해 많은 수의 스레드(워프)를 동시에 시작할 수 있다. 그러나 대규모 멀티 스레딩이 항상 성능 이점으로 이어지는 것은 아니다. Graphics Processing Units (GPUs) have been used to improve the performance of general-purpose applications over the past decade. A large number of threads (warps) can be started simultaneously in order to utilize the computational resources of the GPU. However, massively multi-threading does not always lead to performance benefits.

특히 너무 많은 스레드가 L1 데이터 캐시(L1 캐시)의 제한된 용량을 공유하므로 캐시 적중률이 매우 낮아짐에 따라 캐시 경합이 자주 발생한다. 또한 잘못 처리하는 리소스들, 예를 들어, MSHR(miss status holding registers) 및 미스 버퍼는 종종 정체 현상을 발생시킨다. 결과적으로 L1 캐시 병목 현상에 의해 GPU에서 캐시 계층의 효율성이 떨어진다.In particular, cache contention often occurs as the cache hit ratio becomes very low because too many threads share the limited capacity of the L1 data cache (L1 cache). Also, mishandled resources, such as miss status holding registers (MSHR) and miss buffers, often cause congestion. As a result, the efficiency of the cache layer on the GPU is reduced by the L1 cache bottleneck.

L1 캐시 용량이 제한되어 있기 때문에 기존의 캐시 교체 정책이 캐시 경합 해결에 항상 효율적인 것은 아니다. 데이터 지역성이 열악한 응용 프로그램의 경우 L1 캐시를 사용하면 잘못 처리되는 리소스가 정체되어 성능이 저하된다. 이러한 맥락에서, 메모리 요청이 캐시를 바이패스할 수 있도록 하는 것은 GPU 캐시 관리를 개선하기 위한 솔루션으로 간주된다.Due to the limited L1 cache capacity, traditional cache replacement policies are not always effective in resolving cache contention. For applications with poor data locality, using an L1 cache can cause poor performance due to congested resources. In this context, allowing memory requests to bypass the cache is considered a solution to improve GPU cache management.

GPU에서 L1 캐시는 네트워크 온 칩(Network on Chip, NoC)에 의해 하위 레벨 메모리(예: L2 캐시)에 연결된다. L1 캐시 미스가 발생하는 경우, 메모리 요청(memory request)은 데이터를 검색하기 위해 NoC를 통과해야 한다. 네트워크 지연이 임계값보다 크거나 들어오는 패킷의 네트워크 버퍼를 사용할 수 없으면 NoC가 정체된다. 캐시 바이패스가 켜져 있으면 L1 캐시는 메모리 요청을 낮은 수준의 메모리로 필터링하는 역할을 잃게 된다. 따라서 더 많은 메모리 요청이 짧은 기간 동안 NoC로 전달되어 NoC 혼잡이 악화된다. 캐시 바이패스로 인한 NoC 혼잡을 해결하려면 NoC가 더 나은 네트워크 지연 및 처리량을 제공해야 한다. .In GPUs, the L1 cache is connected to the lower level memory (eg L2 cache) by a Network on Chip (NoC). When an L1 cache miss occurs, the memory request must go through the NoC to retrieve the data. If the network delay is greater than the threshold or if the network buffer of incoming packets is unavailable, the NoC is congested. When cache bypass is on, the L1 cache loses its role of filtering memory requests to lower levels of memory. Therefore, more memory requests are directed to the NoC for a shorter period of time, exacerbating NoC congestion. Addressing NoC congestion caused by cache bypass requires NoCs to provide better network latency and throughput. .

3D 집적은 여러 레이어를 쌓아서 상호 연결 지연을 줄일 수 있다. 3D integration can reduce interconnect latency by stacking multiple layers.

가장 일반적으로 사용되는 3D 집적은 수직 연결을 구현하기 위해 TSV (Through-Silicon Via)를 기반으로 한다. 그러나 TSV 기반 3D 집적은 여전히 TSV의 큰 직경 및 피치와 같은 주요 단점을 가지고 있으며, 이는 다른 로직의 지연 및 영역을 증가시킨다. The most commonly used 3D integration is based on TSV (Through-Silicon Via) to realize vertical connectivity. However, TSV-based 3D integration still has major drawbacks such as large diameter and pitch of TSV, which increases the delay and area of other logic.

특허등록공보 제10-1761301호Patent Registration Publication No. 10-1761301

종래기술의 문제점을 해결하기 위해, 본 발명은 캐시 미스 시 지연을 줄일 수 있는 모놀리식 3D 집적 기술 기반 NoC 구조를 활용한 GPU 캐시 바이패스 방법 및 장치를 제안하고자 한다. In order to solve the problems of the prior art, the present invention intends to propose a GPU cache bypass method and apparatus using a NoC structure based on a monolithic 3D integration technology capable of reducing a delay in case of a cache miss.

상기한 바와 같은 목적을 달성하기 위하여, 본 발명의 일 실시예에 따르면, 개별 워프의 L1 캐시 이용률을 추적하고, 상기 개별 워프의 유형을 분류하며, 분류된 워프 유형에 따라 상기 개별 워프의 바이패스의 임계값을 결정하는 유형 분류 구조; 상기 개별 워프의 바이패스 이력을 기록하는 포화 카운터(Saturating Counter, SC) 및 상기 상기 개별 워프의 바이패스의 임계값을 비교하여 상기 개별 워프의 메모리 요청에 대한 바이패스를 적응적으로 결정하는 요청 바이패스 구조 및 상기 요청 바이패스 구조에 의해 바이패스되는 메모리 요청이 입력되는 모놀리식 3D 기반 NoC(Network on Chip)를 포함하는 모놀리식 3D 집적 기술 기반 NoC 구조를 활용한 GPU 캐시 바이패스 장치가 제공된다. In order to achieve the above object, according to an embodiment of the present invention, the L1 cache utilization rate of an individual warp is tracked, the type of the individual warp is classified, and the individual warp is bypassed according to the classified warp type. a type classification structure that determines the threshold of ; A request bypass for adaptively determining a bypass for a memory request of each warp by comparing a saturation counter (SC) that records the bypass history of the individual warp and a threshold value of the bypass of the individual warp A GPU cache bypass device using a monolithic 3D integration technology-based NoC structure including a monolithic 3D-based NoC (Network on Chip) to which a memory request bypassed by the pass structure and the request bypass structure is input provided

상기 유형 분류 구조는, 상기 개별 워프의 L1 캐시 이용률을 추적하는 추적부; 상기 개별 워프의 L1 캐시 이용률에 따라 적어도 3개의 유형을 결정하는 워드 유형 결정부; 및 상기 결정된 유형에 따라 상기 개별 워프의 바이패스 임계값을 할당하는 바이패스 임계값 할당부를 포함할 수 있다.The type classification structure may include: a tracking unit for tracking an L1 cache utilization rate of the individual warps; a word type determining unit that determines at least three types according to the L1 cache utilization ratio of the individual warps; and a bypass threshold allocator for allocating a bypass threshold of the individual warp according to the determined type.

상기 L1 캐시 이용률은 상기 개별 워프의 캐시 액세스 수(a), 캐시 미스 수(m) 및 예약 실패 수(r)이며, 상기 유형 결정부는, 모든 워프의 m/a 및 r/a의 평균과 상기 개별 워프의 m/a 및 r/a를 비교하여 상기 개별 워프의 유형을 결정할 수 있다. The L1 cache utilization ratio is the number of cache accesses (a), the number of cache misses (m), and the number of reservation failures (r) of the individual warps, and the type determining unit includes an average of m/a and r/a of all warps and the By comparing the m/a and r/a of the individual warps, the type of the individual warps can be determined.

상기 유형 결정부는, m/a 및 r/a가 모든 워프의 평균보다 크거나 같은 워프를 T1 워프, 모든 워프의 평균보다 m/a 또는 r/a 중 하나가 더 큰 워프를 T2 워프, m/a 및 r/a가 모든 워프의 평균보다 작은 워프를 T3 워프로 결정할 수 있다. The type determining unit may determine a T1 warp for a warp in which m/a and r/a are greater than or equal to the average of all warps, a T2 warp for a warp in which one of m/a or r/a is greater than the average of all warps, m/ A warp in which a and r/a are smaller than the average of all warps may be determined as a T3 warp.

상기 바이패스 임계값 할당부는 상기 T1, T2 및 T3 워프에 대해 서로 다른 바이패스 임계값을 할당할 수 있다. The bypass threshold allocator may allocate different bypass thresholds to the T1, T2, and T3 warps.

상기 메모리 요청의 프로브 결과 태그 미스인 경우 상기 포화 카운터의 값이 증가할 수 있다. When the probe result of the memory request is a tag miss, the value of the saturation counter may increase.

상기 포화 카운터는 상기 개별 워프의 아이디 별로 인덱싱되고, 상기 요청 바이패스 구조는 상기 포화 카운터의 값이 해당 워프의 바이패스 임계값 이상인 경우에만 L1 캐시를 바이패스하는 것으로 결정할 수 있다. The saturation counter is indexed for each ID of the individual warp, and the request bypass structure may determine to bypass the L1 cache only when the value of the saturation counter is equal to or greater than a bypass threshold value of the corresponding warp.

상기 모놀리식 3D 비아 기반 NoC는 그리드 형 메쉬 구조를 갖는 복수의 라우터를 포함하고, 각 라우터는 네트워크에서의 위치 정의를 위한 고유한 주소를 가지며, 4개의 기본 방향(North, East, South 및 West)과 업/다운 방향의 6개의 물리적 포트를 포함할 수 있다. The monolithic 3D via-based NoC includes a plurality of routers having a grid-type mesh structure, each router has a unique address for location definition in the network, and has four basic directions (North, East, South and West). ) and 6 physical ports in the up/down direction.

본 발명의 다른 측면에 따르면, 유형 분류 구조에서, 개별 워프의 L1 캐시 이용률을 추적하고, 상기 개별 워프의 유형을 분류하며, 분류된 워프 유형에 따라 상기 개별 워프의 바이패스의 임계값을 결정하는 단계; 및 요청 바이패스 구조에서, 상기 개별 워프의 바이패스 이력을 기록하는 포화 카운터(Saturating Counter, SC) 및 상기 상기 개별 워프의 바이패스의 임계값을 비교하여 상기 개별 워프의 메모리 요청에 대한 바이패스를 적응적으로 결정하는 단계를 포함하되, 상기 요청 바이패스 구조에 의해 바이패스되는 메모리 요청은 모놀리식 3D 기반 NoC(Network on Chip)로 입력되는 모놀리식 3D 집적 기술 기반 NoC 구조를 활용한 GPU 캐시 바이패스 방법이 제공된다. According to another aspect of the present invention, in the type classification structure, tracking the L1 cache utilization rate of an individual warp, classifying the type of the individual warp, and determining a threshold value of the bypass of the individual warp according to the classified warp type step; And in the request bypass structure, the bypass for the memory request of the individual warp is determined by comparing a saturation counter (SC) that records the bypass history of the individual warp and a threshold value of the bypass of the individual warp. A GPU using a monolithic 3D integration technology-based NoC structure, comprising the step of adaptively determining, wherein the memory request bypassed by the request bypass structure is input to a monolithic 3D-based NoC (Network on Chip) A cache bypass method is provided.

본 발명에 따르면, L1 캐시 경합 및 NoC 혼잡을 완화함으로써보다 효율적인 캐시 관리가 가능한 장점이 있다. According to the present invention, there is an advantage that more efficient cache management is possible by alleviating L1 cache contention and NoC congestion.

도 1은 본 발명의 일 실시예에 따른 GPU 캐시 바이패스 장치를 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 워프 분류 구조를 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 요청 바이패스 구조를 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 모놀리식 3D 기반 NoC 구조를 도시한 도면이다. 1 is a diagram illustrating a GPU cache bypass device according to an embodiment of the present invention.
2 is a diagram illustrating a warp classification structure according to an embodiment of the present invention.
3 is a diagram illustrating a request bypass structure according to an embodiment of the present invention.
4 is a diagram illustrating a monolithic 3D-based NoC structure according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세하게 설명하고자 한다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail.

그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention.

도 1은 본 발명의 일 실시예에 따른 GPU 캐시 바이패스 장치를 도시한 도면이다. 1 is a diagram illustrating a GPU cache bypass device according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 실시예에 따른 GPU 캐시 바이패스 장치는 워프 분류 구조(Warp Classification Structure, WCS, 100), 요청 바이패스 구조(Request Bypass Structure, RBS, 102) 및 모놀리식 3D 기반 NoC(104)를 포함할 수 있다. As shown in Fig. 1, the GPU cache bypass apparatus according to the present embodiment includes a warp classification structure (WCS, 100), a request bypass structure (RBS, 102), and a monolithic 3D structure. It may include an underlying NoC 104 .

워프 분류 구조(100)는 개별 워프의 L1 캐시 이용률을 추적하고, 각 워프의 유형을 분류하며, 분류된 워프 유형에 따른 바이패스의 임계값을 결정한다. The warp classification structure 100 tracks the L1 cache utilization rate of each warp, classifies the type of each warp, and determines a bypass threshold according to the classified warp type.

도 2는 본 발명의 일 실시예에 따른 워프 분류 구조를 도시한 도면이다. 2 is a diagram illustrating a warp classification structure according to an embodiment of the present invention.

워프는 최소 명령어 처리 단위로서 스레드라고도 한다. A warp is the smallest instruction processing unit, also called a thread.

본 실시예에 따른 워프 분류 구조(100)는 L1 캐시 이용률이 낮은 워프의 메모리 요청에 높은 바이패스 확률을 할당하고 L1 캐시 이용률이 높은 워프의 메모리 요청에 낮은 바이패스 확률을 할당한다. The warp classification structure 100 according to the present embodiment allocates a high bypass probability to a memory request in a warp having a low L1 cache utilization rate and a low bypass probability to a memory request in a warp having a high L1 cache utilization ratio.

도 2를 참조하면, 워프 분류 구조(100)는 추적부(200), 워프 유형 결정부(202) 및 바이패스 임계값 할당부(204)를 포함할 수 있다. Referring to FIG. 2 , the warp classification structure 100 may include a tracking unit 200 , a warp type determiner 202 , and a bypass threshold allocator 204 .

워프 분류 구조(100)는 개별 워프의 L1 캐시 이용률에 따라 각각의 워프를 서로 다른 유형으로 분류하도록 설계된다. The warp classification structure 100 is designed to classify each warp into different types according to the L1 cache utilization rate of each warp.

각 워프의 L1 캐시 이용률을 추정하기 위해, L1 캐시는 커널 실행 시작 시 훈련 기간이라는 기간에 액세스된다. To estimate the L1 cache utilization of each warp, the L1 cache is accessed during a period called the training period at the start of kernel execution.

훈련 기간 동안, 추적부(200)는 워프 당 캐시 액세스 수(a), 캐시 미스 수(m) 및 예약 실패 수(r)를 추적한다. During the training period, the tracking unit 200 tracks the number of cache accesses per warp (a), the number of cache misses (m), and the number of reservation failures (r).

L1 캐시 미스에 대한 미스 처리 리소스 (예 : MSHR, 미스 버퍼)를 사용할 수 없는 경우의 예약 실패 수가 계산된다. For L1 cache miss, the number of reservation failures is counted when the miss processing resource (eg MSHR, miss buffer) is not available.

훈련 기간이 끝나면 워프 유형 결정부(202)는 복수의 워프를 T1 워프 (높은 바이패스 확률), T2 워프 (정상 바이패스 확률) 및 T3 워프 (낮은 바이패스 확률) 중 하나로 결정한다. When the training period is over, the warp type determining unit 202 determines the plurality of warps as one of a T1 warp (high bypass probability), a T2 warp (normal bypass probability), and a T3 warp (low bypass probability).

T1 워프는 m/a 및 r/a가 모든 워프의 평균보다 크거나 같은 워프이고, T2 워프는 모든 워프의 평균보다 m/a 또는 r/a 중 하나가 더 큰 워프이며, T3 워프는 m/a 및 r/a가 평균보다 작은 워프이다. A T1 warp is a warp in which m/a and r/a are greater than or equal to the mean of all warps, a T2 warp is a warp in which either m/a or r/a is greater than the mean of all warps, and a T3 warp is a warp in which m/a or r/a is greater than the mean of all warps. a and r/a are warps smaller than average.

바이패스 임계값 할당부(204)는 각 워프 유형에 바이패스 임계값(Θ)을 할당하며, T1 워프에 T3 워프보다 높은 임계값을 할당한다. The bypass threshold allocator 204 assigns a bypass threshold Θ to each warp type, and assigns a T1 warp a higher threshold than a T3 warp.

도 3은 본 발명의 일 실시예에 따른 요청 바이패스 구조를 도시한 도면이다. 3 is a diagram illustrating a request bypass structure according to an embodiment of the present invention.

요청 바이패스 구조(102)는 요청 큐(Request Queue)에 대기중인 워프를 식별하는 제1 식별기(300), 각 워프의 바이패스 이력을 기록하는 2비트 포화 카운터(Saturating Counter, SC, 302), 제2 식별기(304) 및 바이패스 결정부(306)를 포함할 수 있다. The request bypass structure 102 includes a first identifier 300 that identifies a warp waiting in a request queue, a 2-bit saturation counter that records the bypass history of each warp (Saturating Counter, SC, 302); It may include a second identifier 304 and a bypass determiner 306 .

요청 바이패스 구조(102)는 워프의 바이패스 이력을 기록하는 2비트 포화 카운터(Saturating Counter, SC)를 사용하여 메모리 요청에 대한 바이패스를 적응적으로 결정한다. The request bypass structure 102 adaptively determines a bypass for a memory request by using a 2-bit Saturating Counter (SC) that records the bypass history of the warp.

포화 카운터(302)는 워프 ID별로 인덱싱된다. The saturation counter 302 is indexed by warp ID.

각 워프 ID에 대한 워프 유형 및 바이패스 임계값은 워프 분류 구조(100)에 의해 결정된다. The warp type and bypass threshold for each warp ID are determined by the warp classification structure 100 .

요청 바이패스 구조(102)는 훈련 기간 후에 운영된다. 메모리 요청이 요청 큐를 통해 입력되면 즉시 L1 캐시 태그 어레이에 액세스하며, 이를 태그 프로브라 한다. The request bypass structure 102 is operational after the training period. When a memory request is entered through the request queue, it immediately accesses the L1 cache tag array, which is called a tag probe.

메모리 요청의 프로브 결과는 요청 바이패스 구조(102)에서 해당 포화 카운터(302)를 업데이트하는데 사용된다. 모든 포화 카운터의 초기값은 '0'이다. 프로브 결과가 태그 미스인 경우, 즉, 일치하는 태그가 없는 경우, 포화 카운터의 값이 증가한다. 그렇지 않으면 포화 카운터의 값이 감소한다. The probe result of the memory request is used to update the corresponding saturation counter 302 in the request bypass structure 102 . The initial value of all saturation counters is '0'. If the probe result is a tag miss, that is, there is no matching tag, the value of the saturation counter is incremented. Otherwise, the value of the saturation counter decreases.

바이패스 결정부(306)는 포화 카운터(302)의 값을 메모리 요청을 발행하는 워프의 바이패스 임계값(Θ)과 비교하고, 포화 카운터(302)의 값이 해당 바이패스 임계값 이상인 경우에만 L1 캐시를 바이패스하는 것으로 결정한다. The bypass determining unit 306 compares the value of the saturation counter 302 with the bypass threshold Θ of the warp that issues the memory request, and only when the value of the saturation counter 302 is greater than or equal to the bypass threshold. It decides to bypass the L1 cache.

이 경우 반환된 데이터는 L1 캐시가 아닌 SIMT(single-instruction, multiple-thread) 코어로 직접 전송된다. 그렇지 않으면 메모리 요청이 L1 캐시에 액세스한다.In this case, the returned data is sent directly to the single-instruction, multiple-thread (SIMT) core rather than the L1 cache. Otherwise, the memory request accesses the L1 cache.

도 4는 본 발명의 일 실시예에 따른 모놀리식 3D 기반 NoC 구조를 도시한 도면이다. 4 is a diagram illustrating a monolithic 3D-based NoC structure according to an embodiment of the present invention.

도 4는 라우터 및 링크와 같은 일반적인 구성 요소로 구성된 M3D NoC(network-on-chip)의 구조를 도시한 것이며, 도 3의 요청 바이패스 구조(102)에 의해 바이패스되는 메모리 요청이 M3D NoC(104)로 입력된다. 4 shows the structure of an M3D network-on-chip (M3D NoC) composed of general components such as routers and links, and memory requests bypassed by the request bypass structure 102 of FIG. 104) is entered.

본 실시예에 따른 M3D NoC(104)의 라우터는 그리드 형 메쉬 구조와 여러 레이어로 배열된다. The router of the M3D NoC 104 according to the present embodiment has a grid-type mesh structure and is arranged in several layers.

각 라우터는 네트워크 인터페이스 컨트롤러 (NIC)를 통해 처리 요소(Processing Element, PE)와 인터페이스된다. 라우터는 네트워크에서의 위치를 정의하기 위한 고유한 주소(X, Y, Z)를 갖는다. 기본 방향 (North, East, South 및 West) 외에 관련 버퍼 및 VC / Switch 중재자와 함께 두 개의 물리적 포트(하나는 Up, 다른 하나는 Down)가 라우터에 추가된다. Each router interfaces with a Processing Element (PE) through a Network Interface Controller (NIC). A router has a unique address (X, Y, Z) to define its location on the network. In addition to the default directions (North, East, South and West), two physical ports (one Up, the other Down) are added to the router with their associated buffers and VC/Switch arbiters.

라우터의 크로스바가 5x5에서 7x7로 확장된다. 하나의 레이어 상의 라우터는 2D 와이어를 통해 연결되고 다른 계층의 라우터는 모놀리식 인터 비아(MIV)를 통해 연결된다. M3D NoC는 XYZ 라우팅 알고리즘을 사용한다.The router's crossbar expands from 5x5 to 7x7. A router on one layer is connected through a 2D wire, and a router on the other layer is connected through a monolithic inter-via (MIV). M3D NoC uses XYZ routing algorithm.

상기한 본 발명의 실시예는 예시의 목적을 위해 개시된 것이고, 본 발명에 대한 통상의 지식을 가지는 당업자라면 본 발명의 사상과 범위 안에서 다양한 수정, 변경, 부가가 가능할 것이며, 이러한 수정, 변경 및 부가는 하기의 특허청구범위에 속하는 것으로 보아야 할 것이다.The above-described embodiments of the present invention have been disclosed for purposes of illustration, and various modifications, changes, and additions will be possible within the spirit and scope of the present invention by those skilled in the art having ordinary knowledge of the present invention, and such modifications, changes and additions should be regarded as belonging to the following claims.

Claims

a type classification structure for tracking an L1 cache utilization rate of an individual warp, classifying a type of the individual warp, and determining a threshold value of a bypass of the individual warp according to the classified warp type;
A request bypass for adaptively determining a bypass for a memory request of the individual warp by comparing a saturation counter (SC) that records the bypass history of the individual warp and a threshold value of the bypass of the individual warp path structure and
Including a monolithic 3D-based NoC (Network on Chip) to which a memory request bypassed by the request bypass structure is input,
A GPU cache bypass device using a monolithic 3D integration technology-based NoC structure in which the value of the saturation counter is increased when the probe result of the memory request is a tag miss.

According to claim 1,
The type classification structure is
a tracking unit that tracks the L1 cache utilization rate of the individual warps;
a word type determination unit that determines at least three types according to the L1 cache utilization rate of the individual warps; and
A GPU cache bypass device utilizing a NoC structure based on a monolithic 3D integration technology, comprising a bypass threshold allocator for allocating a bypass threshold value of the individual warp according to the determined type.

3. The method of claim 2,
The L1 cache utilization ratio is the number of cache accesses (a), the number of cache misses (m), and the number of reservation failures (r) of the individual warps,
The type determining unit compares the average of m/a and r/a of all warps with m/a and r/a of the individual warps to determine the type of the individual warp, a NoC structure based on a monolithic 3D integration technology Utilized GPU cache bypass device.

4. The method of claim 3,
The type determining unit may determine a T1 warp for a warp in which m/a and r/a are greater than or equal to the average of all warps, a T2 warp for a warp in which one of m/a or r/a is greater than the average of all warps, m/ GPU cache bypass device utilizing a NoC structure based on monolithic 3D integration technology that determines a warp with a and r/a less than the average of all warps as a T3 warp.

5. The method of claim 4,
The bypass threshold allocator is a GPU cache bypass device utilizing a monolithic 3D integration technology-based NoC structure for allocating different bypass thresholds to the T1, T2, and T3 warps.

delete

According to claim 1,
The saturation counter is indexed by ID of the individual warp,
The request bypass structure is a GPU cache bypass device using a monolithic 3D integration technology-based NoC structure that determines to bypass the L1 cache only when the value of the saturation counter is equal to or greater than the bypass threshold of the corresponding warp.

According to claim 1,
The monolithic 3D-based NoC includes a plurality of routers having a grid-type mesh structure,
Each router has a unique address for defining its location on the network, and a NoC based on monolithic 3D aggregation technology that includes 4 primary directions (North, East, South and West) and 6 physical ports in up/down directions. GPU cache bypass device utilizing architecture.

in the type classification structure, tracking the L1 cache utilization rate of each warp, classifying the type of the individual warp, and determining a threshold value of bypass of the individual warp according to the classified warp type; and
In the request bypass structure, the bypass for the memory request of the individual warp is adapted by comparing a saturation counter (SC) that records the bypass history of the individual warp and a threshold value of the bypass of the individual warp. including the step of making a negative decision,
The memory request bypassed by the request bypass structure is input to a monolithic 3D-based NoC (Network on Chip),
A GPU cache bypass method using a monolithic 3D integration technology-based NoC structure in which the value of the saturation counter is increased when the probe result of the memory request is a tag miss.