KR20180012167A

KR20180012167A - Parallel processing unit, computing device including the same, and thread group scheduling method

Info

Publication number: KR20180012167A
Application number: KR1020160119514A
Authority: KR
Inventors: 정명수; 장지에
Original assignee: 주식회사 맴레이; 연세대학교 산학협력단
Priority date: 2016-07-26
Filing date: 2016-09-19
Publication date: 2018-02-05

Abstract

The present invention provides a parallel processing unit capable of reducing cache interference without dropping a thread level parallelism, a computing device including the same and a warp scheduling method. In the parallel processing unit including a multiprocessor, the multiprocessor comprises a memory unit which includes a level 1 data (L1D) cache and a sharing memory to which a plurality of cache lines are allocated. A thread group scheduler isolates a second thread group by changing a cache access of the second thread group into the sharing memory, when a first thread group of an active state satisfies a predetermined first condition.

Description

TECHNICAL FIELD [0001] The present invention relates to a parallel processing unit, a computing device including the same, and a thread group scheduling method.

본 발명은 병렬 프로세싱 유닛 및 이를 포함하는 컴퓨팅 디바이스, 그리고 쓰레드 그룹 스케줄링 방법에 관한 것이다.The present invention relates to a parallel processing unit, a computing device including the same, and a method of scheduling a thread group.

그래픽 프로세싱 유닛(graphic processing unit, GPU)와 같은 병렬 프로세싱 유닛(parallel processing unit)은 방대한 쓰레드 레벨 병렬화(thread-level parallelism, TLP)를 통해, 범용 어플리케이션에 대해서 많은 성능 향상을 보여주었다. 많은 스트림 멀티프로세서(stream multiprocessors, SMs)로 이루어지는 현재의 GPU 아키텍처는 복수의 쓰레드를 쓰레드 그룹[워프(warp)로 불릴 수 있음]으로 클러스터화한다. 많은 워프를 효과적으로 관리하기 위해서, 각 SM은 하드웨어 기반의 워프 스케줄러를 채용하고 있다. 고성능을 위해서 가능한 많은 워프를 각 SM에 할당하여서 실행 파이프라인의 유용성을 최대화하는 것이 필요하다.A parallel processing unit, such as a graphics processing unit (GPU), has shown a lot of performance improvements for general purpose applications through massive thread-level parallelism (TLP). Current GPU architectures, which are made up of many stream multiprocessors (SMs), cluster multiple threads into a thread group [warp]. To effectively manage many warps, each SM employs a hardware-based warp scheduler. For high performance it is necessary to allocate as many warps as possible to each SM to maximize the utility of the execution pipeline.

이러한 워프 스케줄링 방법은 메모리 집약적 어플리케이션은 많은 양의 데이터를 프로세싱할 필요가 있을 때, TLP를 최대화하는데 충분하지 않다. 이는 액티브 워프가 너무 많은 메모리 요청을 집약적으로 생성하고 한정된 용량의 온칩(on-chip) 캐시에 대해서 경쟁하기 때문이다. 이러한 캐시 경쟁으로 성능이 심각하게 열화될 수 있다. 이러한 문제를 해결하기 위해서 다이버스 캐시 인식 워프 스케줄링 정책(diverse cache-aware warp scheduling policies)이 제안되었지만, 이들 워프 스케줄링 정책은 L1D 캐시 히트율(hit rate)을 향상시키고, 높은 가능성의 데이터 지역성을 가지는 액티브 워프를 식별해서 이들 워프에서 다른 워프보다 높은 스케줄링 우선 순위를 부여함으로써 성능을 향상시키는 것을 목표로 한다. 즉, 이들 워프 스케줄링 정책은 낮은 가능성의 데이터 지역성을 가지는 워프를 스로틀(throttle)하여 TLP를 감소시킴으로써, 전체 L1D 캐시 히트율을 증가시키고 이에 따라 전체 성능을 향상시키고 있다. 그러나 이러한 워프 스케줄링 정책은 다음 이유 때문에 불규칙적인 캐시 액세스 패턴을 수반하는 메모리 집약적 어플리케이션에 대해서는 충분하지 않을 수 있다. 높은 가능성의 데이터 지역성을 가지는 액티브 워프의 캐시 액세스가 종종 다른 워프와 심각하게 간섭을 일으킬 수 있다. 결과적으로 데이터 지역성의 가능성에 기초해서 단순히 액티브 워프를 스케줄링하면 캐시 간섭을 증가시키고 캐시 히트율 향상에 부정적인 영향을 줄 수 있다. 또한 캐시 히트율 향상을 대가로 TLP를 열화시키는 것은 바람직하지 않다. 즉, 액티브 워프를 스로틀하는 것이 캐시 히트율을 향상시킬 수 있지만, 전체 성능은 열화될 수 있다.This warp scheduling method is not sufficient to maximize TLP when memory intensive applications need to process large amounts of data. This is because Active Warp intensively generates too many memory requests and competes against a limited amount of on-chip cache. This cache contention can severely degrade performance. Although diverse cache-aware warp scheduling policies have been proposed to overcome this problem, these warp scheduling policies have been proposed to improve the L1D cache hit rate, It is aimed to improve performance by identifying warps and assigning higher scheduling priority to these warps than other warps. That is, these warp scheduling policies throttle the warp with low potential data locality to reduce the TLP, thereby increasing the overall L1D cache hit ratio and thereby improving overall performance. However, such a warp scheduling policy may not be sufficient for memory intensive applications involving irregular cache access patterns for the following reasons. Cache access of ActiveWar with high probability of data locality can often seriously interfere with other warps. As a result, simply scheduling Active Warp based on the likelihood of data localization can increase cache interference and negatively impact cache hit rate improvement. It is also undesirable to degrade TLP in exchange for cache hit rate improvement. That is, throttling the active warp can improve the cache hit rate, but the overall performance may deteriorate.

본 발명이 이루고자 하는 과제는 쓰레드 레벨 병렬화를 떨어뜨리지 않으면서 캐시 간섭을 줄일 수 있는 병렬 프로세싱 유닛 및 이를 포함하는 컴퓨팅 디바이스, 그리고 워프 스케줄링 방법을 제공하는 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to provide a parallel processing unit, a computing device including the parallel processing unit, and a warp scheduling method capable of reducing cache interference without degrading thread level parallelism.

본 발명의 한 실시예에 따르면, 멀티프로세서를 포함하는 병렬 프로세싱 유닛이 제공된다. 상기 멀티프로세서는 메모리 유닛과 쓰레드 그룹 스케줄러를 포함한다. 상기 메모리 유닛은 L1D 캐시 및 복수의 캐시 라인이 할당되는 공유 메모리를 포함한다. 상기 쓰레드 그룹 스케줄러는, 액티브 상태의 제1 쓰레드 그룹에 대해서 소정의 제1 조건을 만족하는 경우에, 제2 쓰레드 그룹의 캐시 액세스를 상기 공유 메모리로 변경하여 상기 제2 쓰레드 그룹을 격리한다.According to one embodiment of the present invention, a parallel processing unit comprising a multiprocessor is provided. The multiprocessor includes a memory unit and a thread group scheduler. The memory unit includes an L1D cache and a shared memory to which a plurality of cache lines are allocated. The thread group scheduler changes the cache access of the second thread group to the shared memory to isolate the second thread group when a predetermined first condition is satisfied for the first thread group in the active state.

상기 제1 조건은 상기 제1 쓰레드 그룹에 대한 캐시 간섭의 레벨이 제1 컷오프를 넘는 조건을 포함할 수 있다.The first condition may include a condition that a level of cache interference for the first thread group exceeds a first cutoff.

상기 제2 쓰레드 그룹은 상기 제1 쓰레드 그룹에 간섭을 일으키는 쓰레드 그룹 중에서 가장 자주 간섭을 일으키는 쓰레드 그룹을 포함할 수 있다.The second thread group may include a thread group that most frequently interferes with the thread group causing the first thread group to interfere with the first thread group.

소정의 제2 조건을 만족하는 경우에, 상기 쓰레드 그룹 스케줄러는 상기 제2 쓰레드 그룹의 캐시 액세스를 상기 L1D 캐시로 다시 변경할 수 있다.If the predetermined second condition is satisfied, the thread group scheduler may change the cache access of the second thread group back to the L1D cache.

상기 제2 조건은 상기 제1 쓰레드 그룹에 대한 캐시 간섭의 레벨이 상기 제1 컷오프보다 낮은 제2 컷오프보나 낮은 조건을 포함할 수 있다.The second condition may include a second cutoff beam or a lower condition in which the level of the cache interference for the first thread group is lower than the first cutoff.

상기 제2 조건은 상기 제1 쓰레드 그룹의 실행이 완료되는 조건을 포함할 수 있다.The second condition may include a condition that execution of the first thread group is completed.

제1 시기와 제2 시기로 이루어지는 사이클에서, 상기 쓰레드 그룹 스케줄러는 제1 시기의 종료 시에 상기 제1 조건을 판단하고, 상기 제2 시기의 종료 시에 상기 제2 조건을 판단할 수 있다.In the cycle consisting of the first and second time periods, the thread group scheduler can determine the first condition at the end of the first time and determine the second condition at the end of the second time.

소정의 제2 조건을 만족하는 경우에, 상기 쓰레드 그룹 스케줄러는 상기 격리된 제2 쓰레드 그룹을 스톨(stall)할 수 있다.If the predetermined second condition is satisfied, the thread group scheduler may stall the isolated second thread group.

상기 제2 조건은 상기 공유 메모리로 격리된 제3 쓰레드 그룹에 대한 캐시 간섭의 레벨이 제1 컷오프를 넘는 조건을 포함할 수 있다.The second condition may include a condition that a level of cache interference for a third thread group isolated to the shared memory exceeds a first cutoff.

소정의 제3 조건을 만족하는 경우에, 상기 쓰레드 그룹 스케줄러는 상기 스톨된 제2 쓰레드 그룹을 재활성화할 수 있다.If the predetermined third condition is satisfied, the thread group scheduler may reactivate the stalled second thread group.

상기 제3 조건은 상기 제3 쓰레드 그룹에 대한 캐시 간섭의 레벨이 상기 제1 컷오프보다 낮은 제2 컷오프보나 낮은 조건을 포함할 수 있다.The third condition may include a second cutoff beam level or a lower condition where the level of the cache interference for the third thread group is lower than the first cutoff.

상기 제2 조건은 상기 제3 쓰레드 그룹의 실행이 완료되는 조건을 포함할 수 있다.The second condition may include a condition that execution of the third thread group is completed.

제1 시기와 제2 시기로 이루어지는 사이클에서, 상기 쓰레드 그룹 스케줄러는 제1 시기의 종료 시에 상기 제2 조건을 판단하고, 상기 제2 시기의 종료 시에 상기 제3 조건을 판단할 수 있다.In the cycle consisting of the first and second periods, the thread group scheduler can determine the second condition at the end of the first period and determine the third condition at the end of the second period.

본 발명의 다른 실시예에 따르면, 앞의 실시예에 따른 병렬 프로세싱 유닛, CPU, 시스템 메모리, 그리고 상기 병렬 프로세싱 유닛, 상기 CPU 및 상기 시스템 메모리를 연결하는 메모리 브릿지를 포함하는 컴퓨팅 디바이스가 제공된다.According to another embodiment of the present invention there is provided a computing device comprising a parallel processing unit, a CPU, a system memory, and a memory bridge connecting the parallel processing unit, the CPU and the system memory according to the preceding embodiments.

본 발명의 또 다른 실시예에 따르면, 멀티프로세서를 포함하는 병렬 프로세싱 유닛이 제공된다. 상기 멀티프로세서는, L1D 캐시 및 공유 메모리를 포함하는 메모리 유닛, 그리고 쓰레드 그룹 스케줄러를 포함한다. 상기 공유 메모리는, 복수의 행을 가진 복수의 공유 메모리 뱅크를 포함하며, 상기 복수의 공유 메모리 뱅크는 복수의 뱅크 그룹으로 그룹화되어 있다. 각 뱅크 그룹에 속한 복수의 공유 메모리 뱅크에 대해서 복수의 캐시 라인이 할당되어 있으며, 상기 쓰레드 그룹 스케줄러는 일부 쓰레드 그룹의 캐시 접근을 상기 공유 메모리의 캐시 라인으로 변경한다.According to another embodiment of the present invention, a parallel processing unit comprising a multiprocessor is provided. The multiprocessor includes a memory unit including an L1D cache and a shared memory, and a thread group scheduler. The shared memory includes a plurality of shared memory banks having a plurality of rows, and the plurality of shared memory banks are grouped into a plurality of bank groups. A plurality of cache lines are allocated to a plurality of shared memory banks belonging to each bank group, and the thread group scheduler changes cache accesses of some of the thread groups to cache lines of the shared memory.

어떤 캐시 라인이 할당되는 뱅크 그룹과 다른 뱅크 그룹에 상기 어떤 캐시 라인을 위한 태그와 대응하는 쓰레드 그룹의 번호가 할당될 수 있다.A bank group to which a certain cache line is allocated and a bank group to which another cache line is assigned may be assigned a number of the thread group corresponding to the tag for the certain cache line.

본 발명의 또 다른 실시예에 따르면, L1D 캐시와 공유 메모리를 포함하는 메모리 유닛을 포함하는 병렬 프로세싱 유닛의 쓰레드 그룹 스케줄링 방법이 제공된다. 상기 쓰레드 그룹 스케줄링 방법은, 제1 쓰레드 그룹에 대해서 캐시 간섭의 레벨을 검출하는 단계, 그리고 상기 캐시 간섭의 레벨이 제1 컷오프를 넘는 경우, 상기 제1 쓰레드 그룹에 간섭을 일으키는 쓰레드 그룹 중 소정의 조건을 만족하는 제2 쓰레드 그룹의 캐시 액세스를 상기 공유 메모리로 변경하여 상기 제2 쓰레드 그룹을 격리하는 단계를 포함한다.According to another embodiment of the present invention, there is provided a thread group scheduling method of a parallel processing unit comprising a memory unit including an L1D cache and a shared memory. Wherein the thread group scheduling method comprises the steps of: detecting a level of cache interference for a first thread group; and if a level of the cache interference exceeds a first cutoff, And isolating the second thread group by changing the cache access of the second thread group satisfying the condition to the shared memory.

상기 쓰레드 그룹 스케줄링 방법은, 상기 제1 쓰레드 그룹에 대해서 캐시 간섭의 레벨을 다시 검출하는 단계, 그리고 상기 다시 검출한 캐시 간섭의 레벨이 상기 제1 컷오프보다 낮은 제2 컷오프보다 낮은 경우, 상기 제2 쓰레드 그룹의 캐시 액세스를 상기 L1D 캐시로 다시 변경하는 단계를 더 포함할 수 있다.Wherein the thread group scheduling method further comprises: detecting a level of cache interference for the first thread group, and if the level of the detected cache interference is lower than a second cutoff lower than the first cutoff, And changing the cache access of the thread group back to the L1D cache.

상기 쓰레드 그룹 스케줄링 방법은, 상기 공유 메모리로 격리된 제3 쓰레드 그룹에 대해서 캐시 간섭의 레벨을 검출하는 단계, 그리고 상기 제3 쓰레드 그룹에 대한 상기 캐시 간섭의 레벨이 제2 컷오프를 넘는 경우, 상기 격리된 제2 쓰레드 그룹을 스톨하는 단계를 더 포함할 수 있다.Wherein the thread group scheduling method comprises: detecting a level of cache interference for a third thread group isolated from the shared memory; and if the level of the cache interference for the third thread group exceeds a second cutoff, And stalling the isolated second thread group.

상기 쓰레드 그룹 스케줄링 방법은, 상기 제3 쓰레드 그룹에 대해서 캐시 간섭의 레벨을 다시 검출하는 단계, 그리고 상기 제3 쓰레드 그룹에 대해서 상기 다시 검출한 캐시 간섭의 레벨이 상기 제2 컷오프보나 낮은 제3 컷오프보다 낮은 경우, 상기 스톨한 제2 쓰레드 그룹을 재활성화하는 단계를 더 포함할 수 있다.The method of claim 1, further comprising: detecting a level of cache interference for the third thread group; and detecting a level of cache interference for the third thread group from the second cut- The step of re-activating the stalled second thread group may further include the step of re-activating the stalled second thread group.

본 발명의 한 실시예에 따르면, 쓰레드 레벨 병렬화를 유지하면서 캐시 간섭을 줄일 수 있다.According to one embodiment of the present invention, cache interference can be reduced while maintaining thread-level parallelism.

도 1은 본 발명의 한 실시예에 따른 컴퓨팅 디바이스의 개략적인 블록도이다.
도 2는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛의 개략적인 블록도이다.
도 3은 전형적인 멀티프로세서에서 캐시 간섭이 L1D 캐시에서 데이터 지역성을 악화시키는 예를 나타내는 도면이다.
도 4는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스케줄링 방법을 나타내는 흐름도이다.
도 5, 도 6 및 도 7은 각각 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 워프 파티셔닝을 설명하는 도면이다.
도 8은 본 발명의 다른 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스케줄링 방법을 나타내는 흐름도이다.
도 9, 도 10 및 도 11은 각각 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스로틀링을 설명하는 도면이다.
도 12는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 가장 자주 간섭을 일으키는 워프를 관리하는 방법을 설명하는 도면이다.
도 13은 병렬 프로세싱 유닛에서의 캐시 간섭을 설명하는 도면이다.
도 14는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 빅팀 태그 어레이를 설명하는 도면이다.
도 15 및 도 16은 각각 본 발명의 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스케줄링 방법을 나타내는 흐름도이다.
도 17은 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 멀티프로세서의 개략적인 블록도이다.
도 18은 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 메모리 유닛의 동작을 설명하는 도면이다.
도 19는 한 실시예에 따른 병렬 프로세싱 유닛에서 메모리 유닛의 구조를 설명하는 도면이다.
도 20은 한 실시예에 따른 병렬 프로세싱 유닛에서 어드레스 번역을 설명하는 도면이다.1 is a schematic block diagram of a computing device according to one embodiment of the present invention.
2 is a schematic block diagram of a parallel processing unit according to one embodiment of the present invention.
3 is an illustration showing an example where cache interference in a typical multiprocessor degrades data locality in an L1D cache.
4 is a flowchart illustrating a method of scheduling a warp in a parallel processing unit according to an embodiment of the present invention.
5, 6, and 7 are diagrams illustrating warp partitioning in a parallel processing unit according to an embodiment of the present invention, respectively.
8 is a flowchart illustrating a method of scheduling a warp in a parallel processing unit according to another embodiment of the present invention.
9, 10, and 11 are diagrams illustrating warp throttling in the parallel processing unit according to an embodiment of the present invention, respectively.
12 is a view for explaining a method of managing a warp that most frequently causes interference in a parallel processing unit according to an embodiment of the present invention.
13 is a view for explaining cache interference in the parallel processing unit;
14 is a view for explaining a big-tag array in a parallel processing unit according to an embodiment of the present invention.
15 and 16 are flowcharts illustrating a method of a warp scheduling in a parallel processing unit according to an embodiment of the present invention, respectively.
17 is a schematic block diagram of a multiprocessor in a parallel processing unit according to an embodiment of the present invention.
18 is a view for explaining the operation of the memory unit in the parallel processing unit according to an embodiment of the present invention.
19 is a view for explaining the structure of a memory unit in the parallel processing unit according to one embodiment.
20 is a diagram illustrating address translation in a parallel processing unit according to one embodiment.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

도 1은 본 발명의 한 실시예에 따른 컴퓨팅 디바이스의 개략적인 블록도이다. 도 1은 가능한 컴퓨팅 디바이스의 한 예이며, 본 발명의 실시예에 따른 컴퓨팅 디바이스의 다른 다양한 구조로 구현될 수 있다.1 is a schematic block diagram of a computing device according to one embodiment of the present invention. 1 is an example of a possible computing device and may be implemented in various other configurations of computing devices according to embodiments of the present invention.

도 1을 참고하면, 본 발명의 한 실시예에 따른 컴퓨팅 디바이스는 중앙 프로세싱 유닛(central processing unit, CPU)(110), 시스템 메모리(120) 및 병렬 프로세싱 유닛(130)을 포함한다.1, a computing device according to one embodiment of the present invention includes a central processing unit (CPU) 110, a system memory 120, and a parallel processing unit 130.

시스템 메모리(120)는 메모리 브릿지(140)를 통해 CPU(110)와 통신할 수 있다. 메모리 브릿지(140)는 예를 들면 노스브릿지(northbridge)일 수 있다. 또한 메모리 브릿지(140)는 버스 또는 통신 채널을 거쳐 입출력(input/output, I/O) 브릿지(150)에 연결될 수 있다. I/O 브릿지(150)는 예를 들면 사우스브릿지(southbridge)일 수 있으며, 사용자 입력 장치(도시하지 않음)로부터 사용자 입력을 수신하고 이를 메모리 브릿지(140)를 거쳐 CPU(110)로 전달할 수 있다.The system memory 120 may communicate with the CPU 110 via the memory bridge 140. The memory bridge 140 may be, for example, a north bridge. The memory bridge 140 may also be connected to an input / output (I / O) bridge 150 via a bus or communication channel. The I / O bridge 150 may be, for example, a southbridge and may receive user input from a user input device (not shown) and forward it to the CPU 110 via the memory bridge 140 .

병렬 프로세싱 유닛(130)은 버스 또는 통신 채널을 거쳐 메모리 브릿지(140)에 연결되어 CPU(110) 및 시스템(120)와 통신할 수 있다. 병렬 프로세싱 유닛(130)은 예를 들면 그래픽 프로세싱 유닛(graphic processing unit, GPU)일 수 있다.The parallel processing unit 130 may be connected to the memory bridge 140 via a bus or communication channel to communicate with the CPU 110 and the system 120. The parallel processing unit 130 may be, for example, a graphics processing unit (GPU).

어떤 실시예에서, CPU(110), 시스템 메모리(120), 메모리 브릿지(140) 및 I/O 브릿지(150)를 포함하는 시스템을 호스트(host)라 할 수 있다.In some embodiments, a system including the CPU 110, the system memory 120, the memory bridge 140, and the I / O bridge 150 may be referred to as a host.

도 2는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛의 개략적인 블록도이다.2 is a schematic block diagram of a parallel processing unit according to one embodiment of the present invention.

도 2를 참고하면, 병렬 프로세싱 유닛은 하나 이상의 멀티프로세서(200)를 포함한다. 병렬 프로세싱 유닛은 예를 들면 GPU일 수 있으며, 멀티프로세서(200)는 예를 들면 스트리밍 멀티프로세서(streaming multiprocessor, SM)일 수 있다.Referring to FIG. 2, the parallel processing unit includes one or more multiprocessors 200. The parallel processing unit may be, for example, a GPU, and the multiprocessor 200 may be, for example, a streaming multiprocessor (SM).

멀티프로세서(200)로 전달된 일련의 명령어가 쓰레드(thread)를 형성할 수 있다. 이러한 쓰레드는 프로그램의 인스턴스일 수 있다. 또한 멀티프로세서(200)에서 일정 개수(예를 들면, 32개)의 쓰레드가 동시에 실행되며, 이러한 일정 개수의 쓰레드를 쓰레드 그룹이라 한다. 예를 들면, 이러한 쓰레드 그룹을 "워프(warp)", "컨테이너(container)" 또는 "웨이브프론트(wavefront)"라 할 수 있으며, 아래에서는 설명의 편의상 쓰레드 그룹을 워프라 한다. 어떤 실시예에서, 쓰레드 그룹은 서로 다른 입력 데이터에 대해서 동일한 프로그램을 동시에 실행하는 쓰레드의 그룹일 수 있다. 또한 멀티프로세서(200) 내에서 복수의 워프가 동시에 액티브 상태일 수 있으며, 이러한 복수의 워프의 쓰레드를 CTA(cooperative thread array)라 할 수 있다.A series of instructions passed to the multiprocessor 200 may form a thread. These threads may be instances of the program. In addition, a predetermined number (for example, 32) of threads are simultaneously executed in the multiprocessor 200, and this predetermined number of threads is called a thread group. For example, such a thread group may be referred to as a "warp", a "container", or a "wavefront". In some embodiments, a thread group may be a group of threads that simultaneously execute the same program for different input data. Also, a plurality of warps may be active simultaneously in the multiprocessor 200, and a plurality of threads of the warp may be referred to as a cooperative thread array (CTA).

멀티프로세서(200)는 명령어 버퍼(210), 워프 스케줄러(220), 복수의 프로세싱 유닛(processing unit)(230), 복수의 로드-저장 유닛(load-store unit, LSU)(240) 및 메모리 유닛(250)을 포함한다.The multiprocessor 200 includes an instruction buffer 210, a warp scheduler 220, a plurality of processing units 230, a plurality of load-store units (LSUs) 240, (250).

명령어 버퍼(210)는 멀티프로세서(200)로 전달된 명령어를 저장하며, 워프 스케줄러(220)는 워프의 명령어를 스케줄링하고, 스케줄링에 따라 명령어 버퍼(210)로부터 해당하는 워프의 명령어를 선택해서 출력한다. 어떤 실시예에서, 멀티프로세서는 펫치/디코더(205)를 더 포함할 수 있으며, 펫치/디코더(205)는 멀티프로세서(200)로 전달된 명령어를 불러서 디코딩한 다음에 명령어 버퍼(210)에 저장된다. 어떤 실시예에서, 명령어 버퍼(210)에서 각 액티브 워프는 전용 엔트리 세트를 가질 수 있다. 어떤 실시예에서, 멀티프로세서(200)는 데이터 위험성을 확인하기 위해서 스코어보드(도시하지 않음)를 더 포함할 수 있다. 이 경우, 스코어보드가 워프에 대해서 위험성이 없다고 지시하면, 워프 스케줄러(220)가 명령어 버퍼(210)로부터 이 워프의 디코딩된 명령어를 선택할 수 있다.The instruction buffer 210 stores instructions transferred to the multiprocessor 200. The warp scheduler 220 schedules the instructions of the warp and selects the corresponding instruction of the warp from the instruction buffer 210 according to the scheduling, do. In some embodiments, the multiprocessor may further include a fetch / decoder 205, which fetches and decodes the instruction passed to the multiprocessor 200 and then stores it in the instruction buffer 210 do. In some embodiments, each active warp in the instruction buffer 210 may have a dedicated entry set. In some embodiments, the multiprocessor 200 may further include a scoreboard (not shown) to identify data risks. In this case, if the scoreboard indicates that there is no risk to the warp, the warp scheduler 220 may select the decoded instruction of this warp from the instruction buffer 210.

각 프로세싱 유닛(230)은 워프 스케줄러(220)에서 출력되는 명령어를 처리한다. 프로세싱 유닛(230)은 다양한 컴퓨팅 연산, 예를 들면 정수 및 부동 소수점 연산(integer and floating point arithmetic), 비교 연산(comparison operation), 논리형 연산(Boolean operation), 비트 시프팅(bit-shifting), 다양한 대수 함수의 계산 등의 연산을 지원할 수 있다. 어떤 실시예에서, 프로세싱 유닛(230)은 예를 들면 산술 논리 유닛(arithmetic logic unit, ALU)일 수 있다. 각 로드-저장 유닛(240)은 메모리 유닛(250)으로부터 데이터를 로드하거나 메모리(250)에 저장한다.Each processing unit 230 processes the instructions output from the warp scheduler 220. The processing unit 230 may perform various computing operations, such as integer and floating point arithmetic, comparison operation, Boolean operation, bit-shifting, It can support operations such as calculation of various algebraic functions. In some embodiments, the processing unit 230 may be, for example, an arithmetic logic unit (ALU). Each load-store unit 240 loads data from the memory unit 250 or stores it in the memory 250.

메모리 유닛(250)은 L1D(level 1 data) 캐시(251)와 공유 메모리(shared memory)(252)를 포함한다. 어떤 실시예에서, 메모리 유닛(250)은 멀티프로세서(200)의 온 칩 메모리 유닛(on-chip memory unit)일 수 있다. L1D 캐시(251)와 공유 메모리(252)는 예를 들면 512개 또는 256개의 행을 가진 32개의 뱅크를 포함하는 단일 온칩 메모리 구조를 공유할 수 있다. N개의 행이 L1D 캐시(251)에 할당되고, 나머지(예를 들면, (512-N) 또는 (256-N)개의) 행이 공유 메모리(252)에 할당될 수 있다. 예를 들면, 각 멀티프로세서(200)는 16KB (N=128) 또는 48KB (N=384)가 L1D 캐시(251)에 할당될 수 있는 64KB 온칩 메모리 구조를 지원할 수 있다. The memory unit 250 includes a level 1 data (L1D) cache 251 and a shared memory 252. In some embodiments, the memory unit 250 may be an on-chip memory unit of the multiprocessor 200. The L1D cache 251 and the shared memory 252 may share a single on-chip memory structure including, for example, 32 banks with 512 or 256 rows. N rows may be allocated to the L1D cache 251 and remaining (e.g., (512-N) or (256-N)) rows may be allocated to the shared memory 252. [ For example, each multiprocessor 200 may support a 64 KB on-chip memory architecture where 16 KB (N = 128) or 48 KB (N = 384) can be allocated to the L1D cache 251.

공유 메모리(252)는 소프트웨어에 의해 관리될 수 있다. 각 CTA는 쓰레드간(inter-thread) 통신을 위해 배타적인 공유 메모리 공간을 요청할 수 있다. 따라서 어떤 실시예에서 멀티프로세서(200)는 독립적인 공유 메모리 관리 테이블(shared memory management table, SMMT)(261)을 보유할 수 있다. 각 CTA는 하나의 SMMT 엔트리를 예약하고 있으며, 엔트리는 주어진 CTA(CTAid)에 대해서 공유 메모리 공간의 시작 어드레스와 크기를 기록한다. 또한 어떤 실시예에서 멀티프로세서(200)는 크로스바(crossbar) 네트워크(262)를 통해 로드-저장 유닛(240)과 메모리 유닛(250)을 연결할 수 있다.The shared memory 252 may be managed by software. Each CTA can request an exclusive shared memory space for inter-thread communication. Thus, in some embodiments, the multiprocessor 200 may have an independent shared memory management table (SMMT) 261. Each CTA reserves one SMMT entry, and the entry records the start address and size of the shared memory space for a given CTA (CTAid). Also, in some embodiments, the multiprocessor 200 may connect the load-store unit 240 and the memory unit 250 via a crossbar network 262.

본 발명의 한 실시예에서, 공유 메모리(252)에는 복수의 캐시 라인이 할당되어 있어서, 일부 워프의 캐시 액세스가 공유 메모리(252)로 변경된다.In an embodiment of the present invention, a plurality of cache lines are allocated to the shared memory 252, so that cache access of some warps is changed to the shared memory 252. [

어떤 실시예에서, 멀티프로세서(200)는, 워프의 쓰레드에 의해 생성된 복수의 메모리 요청을 메모리 대역폭의 유용성을 향상시키기 위해서 적지만 더 큰 메모리 요청으로 통합하기 위해서 하드웨어 복합 유닛(coalescing unit)(263) 및 미스 상태 홀딩 레지스터(miss status holding register, MSHR)(264)를 더 포함할 수 있다.In some embodiments, the multiprocessor 200 may include a hardware coalescing unit (e.g., a coalescing unit) to combine the plurality of memory requests generated by the threads of the warp into a small but large memory request to improve the usability of the memory bandwidth 263, and a miss status holding register (MSHR) 264.

이러한 멀티프로세서(200)에서, 복수의 워프가 제한된 용량을 가지는 L1D 캐시(251)를 공유하므로, 복수의 워프는 동일한 캐시 라인(cache line)에 대해서 경쟁할 수 있다. 따라서 L1D 캐시(251)에서 어떤 액티브 워프의 캐시된 데이터는 메모리 액세스를 요청하는 다른 액티브 워프의 캐시 액세스에 의해 퇴거될 수 있으며, 이에 따라 데이터 지역성(data locality)를 잃어버릴 수 있다. 이러한 현상을 캐시 간섭(cache interference)이라 한다. 특히 캐시 액세스 패턴이 불규칙적일 때 캐시 간섭이 악화될 수 있다.In this multiprocessor 200, since a plurality of warps share an L1D cache 251 having a limited capacity, a plurality of warps can compete for the same cache line. Thus, in the L1D cache 251, the cached data of some active warps may be evicted by cache access of other active warps requesting memory access, thereby losing data locality. This phenomenon is called cache interference. In particular, cache irregularities can cause cache interference when the cache access pattern is irregular.

도 3은 전형적인 멀티프로세서에서 캐시 간섭이 L1D 캐시에서 데이터 지역성을 악화시키는 예를 나타내는 도면이다.3 is an illustration showing an example where cache interference in a typical multiprocessor degrades data locality in an L1D cache.

도 3에 도시한 것처럼, 예를 들면 두 워프(W0, W1)가 L1D 캐시의 동일한 캐시 세트(Set0)에서 각각 데이터(D0, D4)를 반복적으로 요청하는 것으로 가정한다. 그러면 사이클 CA에서 데이터(D0)를 요청하는 워프(W0)는 워프(W1)가 가진 데이터(D4)를 L1D 캐시(270)에서 퇴거시키지만, 캐시 세트(Set0)에 데이터(D0)가 없으므로 콜드 미스(cold miss)가 발생한다. 사이클 CB에서 데이터(D4)를 요청하는 워프(W1)는 캐시 세트(Set0)에 데이터(D4)가 없으므로 미스(충돌 미스)가 발생하며, 또한 워프(W1)의 데이터 요청은 워프(W0)가 가진 데이터(D0)를 캐시 세트(Set0)에서 퇴거시킨다. 마찬가지로, 사이클 CE에서 데이터(D0)를 요청하는 워프(W0)는 워프(W1)가 가진 데이터(D4)를 캐시 세트(Set0)에서 퇴거시키고, 사이클 CF에서 데이터(D4)를 요청하는 워프(W1)는 워프(W0)가 가진 데이터(D0)를 캐시 세트(Set0)에서 퇴거시킬 수 있다. 이에 따라 계속 충돌 미스가 발생한다.Assume, for example, that two warps W0 and W1 repeatedly request data D0 and D4, respectively, in the same cache set (Set0) of the L1D cache, as shown in Fig. The warp W0 requesting the data D0 in the cycle CA evacuates the data D4 held by the warp W1 from the L1D cache 270. However, since there is no data D0 in the cache set Set0, (cold miss) occurs. The warp W1 requesting the data D4 in the cycle CB causes a miss (collision miss) because there is no data D4 in the cache set Set0 and the data request of the warp W1 is a warp W0 (D0) from the cache set (Set0). Similarly, the warp W0 requesting the data D0 in the cycle CE evicts the data D4 with the warp W1 from the cache set Set0 and the warp W1 requesting the data D4 from the cycle CF. ) Can evacuate the data D0 held by the warp W0 from the cache set Set0. As a result, the collision continues to occur.

한편, 어플리케이션, 즉 커널의 실행 동안, 동일한 데이터가 여러 번 참조되면 데이터 지역성이 존재한다. 도 3에서 워프(W0, W1)는 사이클 CA, CB, CE, CF에서 각각 데이터(D0, D4)를 반복적으로 얻기 위해서 메모리 요청을 계속 생성한다. 이들 메모리 요청은 데이터(D0, D4)가 다른 워프(W1, W0)에 의해 퇴거되지 않았다면 L1D 캐시에 히트했어야 한다. 이러한 캐시 히트 기회(cache hit opportunity)는 데이터 지역성의 가능성(potential)으로 불릴 수 있으며, 캐시 간섭이 발생하지 않은 경우 동일한 데이터를 재참조하는 빈도로서 측정된다. 도 3에서 두 워프(W0, W1)는 높은 가능성의 데이터 지역성을 보이지만, 이들 사이의 캐시 간섭이 불필요한 캐시 미스를 초래하고, 이는 규칙적인 메모리 액세스를 비규칙적인 캐시 액세스로 바꿀 수 있다.On the other hand, if the same data is referenced many times during the execution of the application, that is, the kernel, data locality exists. In FIG. 3, the warps W0 and W1 continue to generate memory requests to repeatedly obtain the data D0 and D4 in the cycles CA, CB, CE and CF, respectively. These memory requests must have hit the L1D cache if the data DO and D4 have not been evicted by other warps W1 and W0. This cache hit opportunity can be referred to as the potential for data localization and is measured as the frequency of re-referencing the same data in the absence of cache interference. In FIG. 3, both warps W0 and W1 exhibit a high likelihood of data locality, but cache interference between them results in an unnecessary cache miss, which can turn regular memory access into irregular cache access.

이러한 캐시 간섭을 줄이기 위해서, 간섭을 일으키는 워프를 간섭을 받는 워프로부터 격리(isolation)할 수 있다. 이를 위해, 어떤 실시예에서 L1D 캐시 공간을 파티션하고, 파티션된 캐시 라인을 간섭을 일으키는 워프에 할당할 수 있다. 현재 CPU를 위해 공유된 캐시 공간을 파티션하는 다양한 기술이 제안되어 있다. 그러나 GPU와 같은 병렬 프로세싱 유닛에서 L1D 캐시 라인을 공유하는 쓰레드의 수가 CPU 쓰레드의 수에 비해서 매우 많은 반면, L1D 캐시의 크기가 CPU 기반 캐시 파티셔닝 기술을 병렬 프로세싱 유닛에 적용할 만큼 크지 않다. 예를 들면, CPU 기반 캐시 파티셔닝 기술을 L1D 캐시에 적용하면, 단지 두 개 또는 세 개의 캐시 라인만이 각 워프에 할당될 수 있다. 그러면 워프당 작은 수의 캐시 라인으로 인해 캐시 스래싱(cache thrashing)이 악화될 수 있다.To reduce this cache interference, the interfering warp can be isolated from the interfering warp. To this end, in some embodiments, the L1D cache space may be partitioned and the partitioned cache line allocated to the interfering warp. Various techniques for partitioning the shared cache space for the current CPU have been proposed. However, in a parallel processing unit such as a GPU, the number of threads sharing the L1D cache line is very large compared to the number of CPU threads, whereas the size of the L1D cache is not large enough to apply the CPU-based cache partitioning technique to the parallel processing unit. For example, applying a CPU-based cache partitioning technique to an L1D cache allows only two or three cache lines to be allocated to each warp. Then cache thrashing can be aggravated by a small number of cache lines per warp.

한편, PolyBench, Mars, Rodinia 등의 벤치마크 방법으로 21개의 어플리케이션에 대한 공유 메모리의 사용 공간을 분석하였을 때, 평균적으로 공유 메모리의 75%가 사용되지 않고 있었다. 본 발명의 한 실시예에서는, 이러한 사용되지 않은 공유 메모리 공간을 이용해서, 간섭을 일으키는 워프가 L1D 캐시 대신에 사용하지 않는 공유 메모리 공간에 액세스하도록 한다.On the other hand, 75% of the shared memory was not used on average when the shared memory usage space for 21 applications was analyzed by the benchmark method such as PolyBench, Mars, Rodinia and the like. In one embodiment of the present invention, this unused shared memory space is used to allow an interfering warp to access a shared memory space that is not used instead of the L1D cache.

아래에서는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 캐시 간섭을 줄이기 위한 워프 스케줄링 방법에 대해서 설명한다.Hereinafter, a description will be made of a warp scheduling method for reducing cache interference in a parallel processing unit according to an embodiment of the present invention.

도 4는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스케줄링 방법을 나타내는 흐름도이고, 도 5, 도 6 및 도 7은 각각 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 워프 파티셔닝(warp partitioning)을 설명하는 도면이다.FIG. 4 is a flowchart illustrating a method of scheduling warps in a parallel processing unit according to an embodiment of the present invention. FIGS. 5, 6, and 7 are flowcharts of warp partitioning in a parallel processing unit according to an embodiment of the present invention warp partitioning.

도 4 및 도 5를 참고하면, 커널 실행의 초기에는 캐시 간섭이 없으므로, 모든 워프(W0, W1, W2, W3)의 메모리 요청은 L1D 캐시(510)로 향한다(S410). 커널 실행이 진행됨에 따라, 일부 워프가 L1D 캐시(510)에서 특정 캐시 라인을 획득하기 위해서 서로 경쟁하기 시작할 수 있다.4 and 5, since there is no cache interference at the beginning of kernel execution, memory requests of all warps W0, W1, W2 and W3 are directed to the L1D cache 510 (S410). As kernel execution proceeds, some warps may begin to compete with each other to obtain a particular cache line in the L1D cache 510.

소정의 조건을 만족하는 경우에(S420), 병렬 프로세싱 유닛의 멀티프로세서는 간섭을 일으키는 워프(예를 들면 도 5 및 도 6의 데이터(D3)를 요청하는 W3)를 검출한다(S430). 어떤 실시예에서, 소정의 조건은 L1D 캐시(510)에서의 캐시 간섭의 강도가 임계치를 넘는 조건일 수 있다. 즉, 캐시 간섭의 강도가 임계치를 넘어가면, 멀티프로세서는 간섭을 일으키는 워프를 검출할 수 있다. 다음, 도 6에 도시한 것처럼, 멀티프로세서는 액티브 워프를 간섭을 일으키는 워프와 간섭을 받는 워프로 파티션하고, 간섭을 일으키는 워프(W3)의 캐시 액세스를 사용하지 않는 공유 메모리 공간(520)으로 변경(redirection)하여서 간섭을 일으키는 워프(W3)를 격리(isolation)한다(S440). 이로 인해 병렬 프로세싱 유닛에서 쓰레드 레벨 병렬화(thread-level parallelism, TLP)를 떨어뜨리지 않으면서 캐시 경쟁이 줄어들 수 있다.If the predetermined condition is satisfied (S420), the multiprocessor of the parallel processing unit detects a warp causing interference (W3, for example, requesting data D3 in FIGS. 5 and 6) (S430). In certain embodiments, the predetermined condition may be that the intensity of the cache interference in the L1D cache 510 exceeds the threshold. That is, when the intensity of the cache interference exceeds the threshold value, the multiprocessor can detect the warp causing the interference. Next, as shown in FIG. 6, the multiprocessor partitions the active warfar into the warp that interferes with the interfering warp, and changes to the shared memory space 520 that does not use the cache access of the interfering warp (W3) (step S440) to isolate the warping W3 causing the interference by redirection. This can reduce cache contention without compromising thread-level parallelism (TLP) in the parallel processing unit.

캐시 액세스 패턴의 변화나 일부 워프(예를 들면, 간섭을 받는 워프)의 실행 완료 등으로 인해 다른 소정의 조건이 만족되는 경우에(S450), 도 7에 도시한 것처럼 멀티프로세서는 간섭을 일으키는 워프의 격리를 철회하고, 이 워프의 메모리 요청을 다시 공유 메모리(520)에서 L1D 캐시(510)로 변경한다(S460). 어떤 실시예에서, 다른 소정의 조건은 캐시 경쟁이 의미 있게 줄어드는 조건으로, 예를 들면 캐시 간섭의 강도가 임계치보다 낮은 조건일 수 있다. 즉, 캐시 경쟁이 의미 있게 줄어드는 것을 검출하면, 멀티프로세서는 간섭을 일으키는 워프의 격리 정책을 철폐하고 이 워프의 메모리 요청을 다시 L1D 캐시(510)로 변경한다. 한 실시예에서, 단계 S450에서의 임계치는 단계 S420에서의 임계치와 다를 수 있다. 예를 들면, 단계 S450에서의 임계치는 단계 S420에서의 임계치보다 작을 수 있다. 다른 실시예에서, 단계 S450에서의 임계치는 단계 S420에서의 임계치와 동일할 수 있다.In the case where other predetermined conditions are satisfied due to a change in the cache access pattern or the completion of execution of some warps (for example, interfering warps) (S450), the multiprocessor, as shown in FIG. 7, And changes the memory request of this warp from the shared memory 520 to the L1D cache 510 (S460). In some embodiments, the other predetermined condition may be a condition that the cache contention is significantly reduced, for example, a condition where the intensity of the cache interference is lower than the threshold value. That is, upon detecting a significant reduction in cache contention, the multiprocessor aborts the isolation policy of the interfering warp and changes the memory request of this warp back to the L1D cache 510. In one embodiment, the threshold at step S450 may be different from the threshold at step S420. For example, the threshold at step S450 may be less than the threshold at step S420. In another embodiment, the threshold at step S450 may be equal to the threshold at step S420.

어떤 실시예에서, 단계 S420 및 단계 S450에서의 캐시 경쟁의 수, 즉 캐시 간섭의 강도를 판단하기 위해서, 멀티프로세서는 간섭 검출기(도시하지 않음)를 사용할 수 있다.In some embodiments, the multiprocessor may use an interference detector (not shown) to determine the number of cache contention in steps S420 and S450, i.e., the strength of the cache interference.

도 4 내지 도 7을 참고로 하여 설명한 캐시 간섭 감소 방법, 즉 워프 파티셔닝 방법은 멀티프로세서의 워프 스케줄러(도 2의 220)에 의해서 수행될 수 있다.The cache interference reduction method, that is, the warp partitioning method described with reference to FIGS. 4 to 7, may be performed by a warp scheduler (220 in FIG. 2) of the multiprocessor.

어떤 실시예에서, 메모리 요청(즉, 캐시 액세스 요청)은 크로스바 네트워크(530)를 통해 L1D 캐시(510) 또는 공유 메모리(530)로 전달될 수 있다.In some embodiments, a memory request (i. E., A cache access request) may be communicated to the L1D cache 510 or shared memory 530 via the crossbar network 530.

한편, 멀티프로세서가 간섭을 일으키는 워프(W3)의 메모리 요청을 L1D 캐시(510)에서 공유 메모리(520)로 변경할 때, 공유 메모리(520)는 워프(W3)가 요청하는 데이터(D3)를 가지고 있지 않을 수 있다. 이 경우, 해당 워프(W3)에 대해서 콜드 미스(cold miss)에 따른 성능 열화 및 일관성(coherence) 문제가 발생할 수 있다. 이러한 문제를 해결하기 위해서, 어떤 실시예에서, 공유 메모리(520)에 액세스할 필요가 있을 때, 타깃 데이터(D3)가 L1D 캐시(510)에 존재하면, 도 6에 도시한 것처럼 멀티프로세서는 L1D 캐시(510)로부터 데이터(D3)를 소정의 큐(540)로 직접 퇴거시킬 수 있다. 한 실시예에서, 공유 메모리(520)에 액세스할 필요가 있을 때, 멀티프로세서는 L1D 캐시(510)의 태그 어레이를 체크해서 L1D 캐시(510)에 타깃 데이터(D3)가 존재하는지 확인할 수 있다. 다음 멀티프로세서는 도 7에 도시한 것처럼 소정의 큐(540)로 퇴거된 데이터(D3)를 공유 메모리(520)에 채울 수 있다. 이에 따라, L1D 캐시(510)로부터 공유 메모리(520)에 데이터를 이주시켜서 콜드 캐시 미스와 일관성 문제를 해결할 수 있다.On the other hand, when the memory request of the warp W3 causing the multiprocessor to interfere with is changed from the L1D cache 510 to the shared memory 520, the shared memory 520 has the data D3 requested by the warp W3 It may not be. In this case, performance degradation and coherence due to a cold miss may occur with respect to the warp W3. To solve this problem, in some embodiments, when the target data D3 is present in the L1D cache 510 when accessing the shared memory 520 is needed, the multiprocessor, as shown in Figure 6, The data D3 can be directly retired from the cache 510 to the predetermined queue 540. [ In one embodiment, when it is necessary to access the shared memory 520, the multiprocessor may check the tag array of the L1D cache 510 to see if the target data D3 is present in the L1D cache 510. [ The next multiprocessor may populate shared memory 520 with data D3 evacuated to a predetermined queue 540 as shown in FIG. Accordingly, the data can be migrated from the L1D cache 510 to the shared memory 520 to solve the inconsistency problem with the cold cache miss.

어떤 실시예에서, 소정의 큐(540)는 L2 캐시로부터 가지고 온 데이터를 버퍼링하고 L1D 캐시에서의 캐시 라인을 무효화시키는데 사용되는 응답 큐(response queue)일 수 있다.In some embodiments, the predetermined queue 540 may be a response queue used to buffer data fetched from the L2 cache and invalidate the cache line in the L1D cache.

어떤 실시예에서, 공유 메모리(520)는 채움 요청(fill request)을 MSHR(도시하지 않음)로 발행하고, MSHR의 엔트리에 기초해서 타깃 데이터가 소정의 큐(540)로부터 공유 메모리(520)로 직접 채워질 수 있다.In some embodiments, the shared memory 520 issues a fill request to an MSHR (not shown), and based on the entry in the MSHR, the target data is transferred from the given queue 540 to the shared memory 520 It can be filled directly.

위에서 설명한 실시예에 따르면, 사용하지 않는 공유 메모리 공간을 L1D 캐시로 사용함으로써, 간섭을 일으키는 워프를 간섭 받는 워프로부터 효과적으로 차단할 수 있다. 한편, 차단하는 효율은 차단된 워프의 수 및 공유 메모리에서 사용되지 않는 공간 등의 다양한 런타임 요소에 달려 있을 수 있다. 예를 들면, 간섭을 일으키는 워프가 공유 메모리를 스래시(thrash)하게 될 때 캐시 간섭을 효과적으로 줄일 수 없을 수도 있다. 즉, 사용하지 않는 공유 메모리 공간의 크기 및/또는 대역폭이 간섭을 일으키는 워프로부터의 많은 양의 메모리 요청을 짧은 기간 내에 처리하기에 충분하지 않을 수도 있다.According to the embodiment described above, by using the unused shared memory space as the L1D cache, it is possible to effectively prevent the interference-causing warp from being interfered with the warp. On the other hand, the blocking efficiency may depend on various runtime factors such as the number of warp blocks interrupted and space not used in shared memory. For example, it may not be possible to effectively reduce cache interference when an interfering warp is thrashing shared memory. That is, the size and / or bandwidth of unused shared memory space may not be sufficient to handle large amounts of memory requests from the interfering warp in a short period of time.

한편, 사용하지 않는 공유 메모리 공간(520)을 L1D 캐시(510)로 사용하는 데는 두 가지 주요한 문제점이 있을 수 있다.On the other hand, there are two major problems in using the unused shared memory space 520 as the L1D cache 510.

첫 번째는 공유 메모리(520)가 글로벌 메모리와 분리된 자신의 어드레스 공간을 가지고, 글로벌 메모리 어드레스를 공유 메모리 어드레스로 번역하는 하드웨어 지원이 없다는 점이다. 이를 위해, 어떤 실시예에서 멀티프로세서는 공유 메모리 앞에 주어진 글로벌 메모리 어드레스를 공유 메모리(520)에서의 로컬 메모리 어드레스로 번역하는 어드레스 번역 유닛(도시하지 않음)을 더 포함할 수 있다.The first is that the shared memory 520 has its own address space separate from the global memory and there is no hardware support to translate global memory addresses into shared memory addresses. To that end, in some embodiments, the multiprocessor may further include an address translation unit (not shown) that translates the global memory address given in front of the shared memory into a local memory address in the shared memory 520.

두 번째는 공유 메모리(520)는 L2 캐시와 메인 메모리와 같은 하위 메모리 계층에 직접 액세스하는 데이터 경로를 가지지 않는다는 점이다. 이를 위해 어떤 실시예에서 멀티프로세서는 공유 메모리 공간이 캐시로 동작할 때 공유 메모리(520)가 L2 캐시에 액세스할 수 있도록 L1D 캐시(510)와 L2 캐시(도시하지 않음) 사이의 데이터 경로를 조정할 수 있다. 한 실시예에서, 멀티프로세서는 L2 캐시와 같은 하위 메모리의 데이터를 버퍼링하는 응답 큐(540) 등의 큐와 메모리 유닛(510, 520) 사이에 멀티플렉서(550)를 더 포함할 수 있다. 멀티플렉서(550)는 큐를 L1D 캐시(510) 또는 공유 메모리(520)에 선택적으로 연결할 수 있다. 또한 멀티프로세서는 어드레스 번역 유닛으로부터 전달되는 메모리 요청의 공유 메모리 어드레스를 저장할 수 있다. 한 실시예에서, 멀티프로세서는 각 MSHR 엔트리에 확장 필드를 추가하여서 공유 메모리 어드레스를 저장할 수 있다. 이 경우, 공유 메모리(520)가 미스(miss) 이후에 채움 요청(fill request)을 발행하면, 요청은 글로벌 및 번역된 공유 메모리 어드레스를 채움으로써 하나의 MSHR 엔트리를 예약한다. L2 캐시로부터의 응답, 즉 응답 큐(540)로 퇴거된 데이터가 대응하는 MSHR 엔트리에 기록된 글로벌 메모리 어드레스와 일치하면, 데이터는 번역된 공유 메모리 어드레스에 기초해서 공유 메모리(520)에 직접 저장될 수 있다.The second is that the shared memory 520 does not have a data path that directly accesses the lower memory layer, such as the L2 cache and main memory. To this end, in some embodiments, the multiprocessor may adjust the data path between the L1D cache 510 and the L2 cache (not shown) so that the shared memory 520 can access the L2 cache when the shared memory space is acting as a cache . In one embodiment, the multiprocessor may further include a multiplexer 550 between the memory unit 510, 520 and a queue, such as a response queue 540, for buffering data in a lower memory, such as an L2 cache. The multiplexer 550 may selectively connect the queue to the L1D cache 510 or the shared memory 520. [ The multiprocessor may also store a shared memory address of a memory request transferred from the address translation unit. In one embodiment, the multiprocessor may add an extension field to each MSHR entry to store the shared memory address. In this case, if the shared memory 520 issues a fill request after a miss, the request reserves one MSHR entry by populating the global and translated shared memory addresses. If the response from the L2 cache, i.e., the data evicted to the response queue 540, matches the global memory address recorded in the corresponding MSHR entry, the data is stored directly in the shared memory 520 based on the translated shared memory address .

도 8은 본 발명의 다른 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스케줄링 방법을 나타내는 흐름도이고, 도 9, 도 10 및 도 11은 각각 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스로틀링(warp throttling)을 설명하는 도면이다.FIG. 8 is a flowchart illustrating a method of scheduling a warp in a parallel processing unit according to another embodiment of the present invention. FIGS. 9, 10, and 11 are diagrams illustrating warp throttling in a parallel processing unit according to an embodiment of the present invention, (warp throttling).

도 8 및 도 9를 참고하면, 두 워프(W0, W1)가 요청하는 데이터가 L1D 캐시(910)에서 캐시 세트(Set0, Set1)에 각각 매핑되어 있으며, 워프(W3)의 메모리 요청이 공유 메모리(920)로 변경되어서 워프(W3)가 요청하는 데이터가 공유 메모리(920)에서 캐시 세트(Set0)에 매핑되어 있는 것으로 가정한다. 또한 L1D 캐시(910)의 캐시 세트(Set1)에 대해서 간섭을 일으키는 워프(W2)의 메모리 요청이 공유 메모리(920)의 캐시 세트(Set1)로 변경된 것으로 가정한다. 이 경우, 워프(W4)의 메모리 요청이 공유 메모리(920)의 캐시 세트(Set1)로 변경되면, 두 워프(W2, W4)가 공유 메모리(920)의 캐시 세트(Set1)에 반복적으로 액세스하여서 캐시 간섭이 발생한다(S810). 따라서 캐시 간섭을 일으키는 워프(W2, W4)가 공유 메모리(920)를 결국 스래시(thrash)할 수 있다.8 and 9, the data requested by the two warps W0 and W1 are respectively mapped to the cache set (Set0 and Set1) in the L1D cache 910 and the memory request of the warp W3 is mapped to the shared memory And the data requested by the warp W3 is mapped to the cache set (Set0) in the shared memory 920. [ It is also assumed that the memory request of the warp W2 causing the interference to the cache set (Set1) of the L1D cache 910 is changed to the cache set (Set1) of the shared memory 920. [ In this case, if the memory request of the warp W4 is changed to the cache set (Set1) of the shared memory 920, the two warps W2 and W4 repeatedly access the cache set (Set1) of the shared memory 920 Cache interference occurs (S810). Thus, the warps W2 and W4, which cause cache interference, can eventually thrash the shared memory 920. [

멀티프로세서는 공유 메모리(920)로 변경된 워프, 즉 격리된 워프(W2, W3, W4)에서 공유 메모리 간섭의 강도를 모니터링한다(S820). 모니터링의 결과 소정의 조건이 만족되는 경우에(S830), 도 10에 도시한 것처럼 멀티프로세서는 소정의 조건에 따라 선택된 워프(예를 들면, 도 10의 W2)를 스톨(stall)(또는 스로틀(throttle))한다(S840). 어떤 실시예에서, 소정의 조건은 공유 메모리 간섭의 강도가 임계치를 넘는 조건일 수 있다. 어떤 실시예에서, 소정의 조건에 따라 선택된 워프는 가장 많은 캐시 간섭을 일으키는 워프일 수 있다. 어떤 실시예에서, 스톨 처리(S840)는 공유 메모리 간섭의 강도가 임계치 아래로 떨어질 때까지 계속 진행될 수 있다.The multiprocessor monitors the strength of the shared memory interference in the warp that has been changed to the shared memory 920, i.e., the isolated warps W2, W3 and W4 (S820). As a result of the monitoring, if the predetermined condition is satisfied (S830), the multiprocessor stall (or the throttle) (for example, W2 in FIG. 10) throttle) (S840). In certain embodiments, the predetermined condition may be that the intensity of the shared memory interference exceeds the threshold. In certain embodiments, the selected warp in accordance with certain conditions may be a warp that causes the most cache interference. In some embodiments, stalling (S840) may continue until the strength of the shared memory interference falls below a threshold.

캐시 액세스 패턴의 변화나 일부 워프의 실행 완료 등으로 인해 다른 소정의 조건이 만족되는 경우에(S850), 도 11에 도시한 것처럼 멀티프로세서는 스톨된 워프(W2)를 재활성화한다(S860). 재활성화된 워프(W2)는 다시 공유 메모리(920)의 해당 캐시 세트(Set1)에 액세스할 수 있다. 이에 따라, 병렬 프로세싱 유닛에서 높은 TLP를 유지하고, 공유 메모리(920)의 활용성을 최대화할 수 있다. 어떤 실시예에서, 다른 소정의 조건은 공유 메모리(920)에서의 간섭의 강도가 줄어드는 조건으로, 예를 들면 공유 메모리(920)에서의 캐시 경쟁의 수가 임계치보다 낮은 조건일 수 있다. 한 실시예에서, 단계 S850에서의 임계치는 단계 S830에서의 임계치와 다를 수 있다. 예를 들면, 단계 S850에서의 임계치는 단계 S830에서의 임계치보다 작을 수 있다. 다른 실시예에서, 단계 S850에서의 임계치는 단계 S830에서의 임계치와 동일할 수 있다.When the other predetermined conditions are satisfied due to the change of the cache access pattern or the completion of the execution of some warp (S850), the multiprocessor reactivates the stalled warp W2 as shown in Fig. 11 (S860). The re-activated warp W2 can again access the corresponding cache set (Set1) of the shared memory 920. [ Accordingly, it is possible to maintain a high TLP in the parallel processing unit and to maximize the utilization of the shared memory 920. [ In some embodiments, other predetermined conditions may be conditions under which the intensity of interference in shared memory 920 is reduced, e.g., the number of cache contention in shared memory 920 is lower than a threshold. In one embodiment, the threshold at step S850 may differ from the threshold at step S830. For example, the threshold at step S850 may be less than the threshold at step S830. In another embodiment, the threshold at step S850 may be equal to the threshold at step S830.

어떤 실시예에서, 단계 S830 및 단계 S850에서의 간섭의 강도를 판단하기 위해서, 멀티프로세서는 간섭 검출기(도시하지 않음)를 사용할 수 있다. 한 실시예에서, 멀티프로세서는 도 4에서의 단계 S420 및 S450에 사용하는 간섭 검출기를 단계 S830 및 단계 S850에서 사용할 수 있다. 이는 격리된 워프는 L1D 캐시(910)에 액세스하는 정기적인 워프와 L1D 캐시(910)를 경쟁하지 않고, 공유 메모리(920)에서 다른 격리된 워프와 간섭을 일으키기 때문이다. 그러므로 L1D 캐시(910)와 공유 메모리(920)에서의 간섭은 서로에게 영향을 미치지 않고, 이에 따라 멀티프로세서는 간섭 검출기를 공유할 수 있다. 다른 실시예에서, 멀티프로세서는 도 4에서의 단계 S20 및 S450에 사용하는 간섭 검출기와 다른 간섭 검출기를 단계 S830 및 단계 S850에서 사용할 수 있다.In some embodiments, the multiprocessor may use an interference detector (not shown) to determine the strength of the interference at steps S830 and S850. In one embodiment, the multiprocessor may use the interference detectors used in steps S420 and S450 in FIG. 4 in steps S830 and S850. This is because the isolated warp does not compete with the L1D cache 910 with the regular warp accessing the L1D cache 910 and causes interference with other isolated warps in the shared memory 920. [ Therefore, the interference in the L1D cache 910 and the shared memory 920 does not affect each other, and thus the multiprocessor can share the interference detector. In another embodiment, the multiprocessor may use steps S830 and S850 for interference detectors other than the interference detectors used in steps S20 and S450 in FIG.

위에서 설명한 실시예에 따르면, 공유 메모리(920)에서 캐시 간섭의 강도가 높은 경우에 공유 메모리(920)로 설정된 일부 워프를 스톨함으로써, 간섭을 일으키는 워프가 공유 메모리를 스래시하는 것을 방지할 수 있다.According to the embodiment described above, it is possible to prevent some interfering warps from skimming the shared memory by stalling some warps set in the shared memory 920 when the intensity of the cache interference in the shared memory 920 is high .

다음 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 간섭 검출에 대해서 설명한다.Next, interference detection in the parallel processing unit according to one embodiment of the present invention will be described.

도 12는 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 가장 자주 간섭을 일으키는 워프를 관리하는 방법을 설명하는 도면이며, 도 13은 병렬 프로세싱 유닛에서의 캐시 간섭을 설명하는 도면이다.FIG. 12 is a view for explaining a method for managing a warp that most frequently causes interference in a parallel processing unit according to an embodiment of the present invention, and FIG. 13 is a view for explaining cache interference in a parallel processing unit.

일부 워프의 캐시 액세스가 다른 워프의 캐시 액세스를 심각하게 간섭하여서 비균일한 캐시 간섭을 초래할 수 있다. 그러나 비규칙적인 캐시 액세스 패턴을 가지는 어플리케이션에 대해서 단순히 어플리케이션 코드나 캐시 대체 정책을 검사함으로써 이러한 비균일 캐시 간섭 정보를 포착하는 것은 어렵다. 어떤 실시예에서, 각 워프에 대해서 모든 다른 워프에 의해 발생하는 캐시 미스를 추적해서 간섭을 검출할 수 있다. 이 경우 n(n-1)개의 엔트리 스토리지 구조가 요구되므로, 높은 스토리지 비용이 발생할 수 있다. 여기서 n은 멀티프로세서에서 액티브 워프의 개수이다.Cache accesses of some warps may seriously interfere with cache accesses of other warps, resulting in non-uniform cache interferences. However, it is difficult to capture such non-uniform cache interference information by simply examining the application code or cache replacement policy for an application having an irregular cache access pattern. In some embodiments, it is possible to track the cache misses caused by all other warps for each warp to detect interference. In this case, since n (n-1) entry storage structures are required, high storage costs can occur. Where n is the number of active warps in the multiprocessor.

도 12를 참고하면, 멀티프로세서는 간섭 리스트를 관리하고, 간섭 리스트의 한 엔트리를 각 워프에 할당하고, 현재 실행되는 워프와 최근에 가장 자주 간섭을 일으키는 워프의 워프 번호(WID)를 현재 실행되는 워프에 해당하는 엔트리에 저장할 수 있다. 어떤 실시예에서, 멀티프로세서는 각 워프에 대해서 최근에 가장 자주 간섭을 일으키는 워프를 추적할 수 있다. 한 실시예에서, 멀티프로세서는 각 워프에 대해서 최근에 가장 자주 간섭을 일으키는 워프만 추적할 수 있다. 그러면 각 워프에서 대해서 간섭을 일으키는 워프를 추적하기 위한 스토리지 비용을 줄일 수 있다.Referring to FIG. 12, the multiprocessor manages the interference list, assigns one entry of the interference list to each warp, and updates the warp number (WID) of the warp most recently interfering with the currently executed warp, It can be stored in the entry corresponding to the warp. In some embodiments, the multiprocessor can track a warp that most frequently interferes with each warp recently. In one embodiment, the multiprocessor can only track the most frequently interfering warps for each warp. This can reduce storage costs to track the warping that causes interference for each warp.

주어진 워프에 대해서 최근에 가장 자주 간섭을 일으키는 워프를 추적하기 위해서, 멀티프로세서는 간섭 리스트의 각 엔트리에 연결된 포화(saturation) 카운터를 더 포함할 수 있다. 한 실시예에서, 포화 카운터는 2 비트 포화 카운터일 수 있다. To track the most frequently interfering warp for a given warp, the multiprocessor may further include a saturation counter associated with each entry in the interference list. In one embodiment, the saturation counter may be a 2-bit saturation counter.

도 13에 도시한 것처럼, 워프(W34)에 대해서 최근에 간섭을 일으키는 워프(W32, W36, W38, W40, W42) 중에서 워프(W32)가 가장 자주 간섭을 일으키는 워프이며, 나머지 워프(W2-W4)는 워프(W34)에 전혀 간섭하지 않는 워프로 가정한다. 멀티프로세서가 전에 실행된 워프(W32)가 현재 실행되는 워프(W34)와 가장 자주 간섭을 일으키는 것으로 결정하면, 도 12에 도시한 것처럼 간섭을 일으키는 워프(W32)의 WID를 간섭 리스트에서 워프(W34)와 관련된 엔트리에 저장한다. 워프(W32)가 워프(W34)와 간섭을 일으킬 때마다, 포화 카운터는 1씩 증가한다. 포화 카운터가 어떤 사이클에서 포화값(예를 들면 '11')에 이미 도달했다고 가정한다(S1310). 이어지는 사이클에서 다른 워프(W42)가 워프(W34)와 간섭을 일으킬 때, 포화 카운터는 1만큼 감소한다(S1320). 다음 워프(W32)가 워프(W34)에 다시 간섭을 일으키면, 포화 카운터는 1만큼 증가한다(S1330). 이와 같이, 어떤 워프(W34)에 대해서 간섭 리스트의 엔트리에 저장된 워프(W32) 이외의 다른 워프의 간섭에 의해 포화 카운터가 감소해서 '00'으로 되면, 간섭 리스트에서의 워프(W32)는 다른 워프로 대체될 수 있다. 이에 따라 가장 빈번한 캐시 간섭이 간섭 리스트에 유지될 수 있다.As shown in Fig. 13, among the warps W32, W36, W38, W40 and W42 that cause interference recently with respect to the warp W34, the warp W32 causes the interference most frequently, and the remaining warps W2- ) Is assumed to be a warp that does not interfere with the warp W34 at all. If the multiprocessor determines that the previously executed warp W32 causes the interference with the currently executed warp W34 most often, the WID of the warping W32 causing the interference as shown in Fig. ). &Lt; / RTI > Each time the warp W32 interferes with the warp W34, the saturation counter increases by one. It is assumed that the saturation counter has already reached a saturation value (e.g., '11') in some cycle (S1310). When another warp W42 interferes with the warp W34 in the following cycle, the saturation counter is decreased by 1 (S1320). If the next warp W32 interferes again with the warp W34, the saturation counter is increased by one (S1330). As described above, when the saturation counter decreases to '00' due to interference of warp W32 other than the warp W32 stored in the entries of the interference list for a certain warp W34, the warp W32 in the interference list becomes' &Lt; / RTI > Thus, the most frequent cache interference can be maintained in the interference list.

어떤 실시예에서, 멀티프로세서는 개별 워프가 경험하는 캐시 간섭의 강도(레벨)를 개별 재참조 스코어(individual re-reference score, IRS)에 의해 검출할 수 있다. 워프 i(Wi)의 IRS(IRS_i)는 예를 들면 아래의 수학식 1처럼 표현될 수 있다.In some embodiments, the multiprocessor may detect the strength (level) of the cache interference experienced by the individual warp by an individual re-reference score (IRS). The IRS (IRS _i ) of the warp i Wi can be expressed, for example, by the following equation (1).

여기서, i는 액티브 워프의 번호이고,

는 워프 i에 대한 빅팀 태그 어레이(victim tag array, VTA) 히트의 개수이며, N_executed _- _inst는 실행된 명령어의 총 개수이고, N_active-warp는 멀티프로세서 상에서 동작하는 액티브 워프의 개수이다.Here, i is the number of the active warp,

Is the number of victim tag array (VTA) hits for warp i, N _executed _- _inst is the total number of executed instructions, and N _{active - warp} is the number of active warps running on the multiprocessor.

한 실시예에서, VTA는 각 워프의 캐시 액세스 이력을 관리하기 위해서 멀티프로세서의 메모리 유닛에 포함될 수 있다. 도 14를 참고하면, VTA는 복수의 워프에 각각 대응하는 복수의 엔트리를 포함하며, 각 엔트리의 필드에는 대응하는 워프의 태그가 저장될 수 있다.In one embodiment, the VTA may be included in the memory unit of the multiprocessor to manage the cache access history of each warp. Referring to FIG. 14, the VTA includes a plurality of entries each corresponding to a plurality of warps, and the field of each entry can store the tag of the corresponding warp.

멀티프로세서는 L1D 캐시의 캐시 라인에 대응하는 워프의 WID를 부착할 수 있다. 캐시 라인에 부착된 WID는 어느 액티브 워프가 현재의 데이터를 캐시 라인에 가지고 왔는지를 추적하는데 사용될 수 있다. 캐시 라인이 퇴거될 때, 멀티프로세서는 WID를 퇴거된 캐시 라인으로부터 회수하고, 퇴거된 캐시 라인의 태그를 해당 WID와 결합된 VTA 엔트리에 저장한다. 따라서 L1D 캐시에서 미스가 있을 때마다, 멀티프로세서는 VTA를 관찰할 수 있다. 이 경우, 해당하는 워프의 VTA 엔트리에 태그가 발견되면, 멀티프로세서는 VTA 히트를 카운트할 수 있다. 즉, 멀티프로세서는 VTA를 통해 워프가 메모리 요청을 동일한 캐시 라인에 반복적으로 놓는 것을 관찰하면(즉, VTA 히트를 관찰하면), 해당 워프가 데이터 지역성의 가능성을 보인다고 판단할 수 있다.The multiprocessor can attach the WID of the warp corresponding to the cache line of the L1D cache. The WID attached to the cache line can be used to track which active warp has brought the current data to the cache line. When the cache line is evicted, the multiprocessor retrieves the WID from the evicted cache line and stores the tag of the evicted cache line in the VTA entry associated with that WID. Thus, whenever there is a miss in the L1D cache, the multiprocessor can observe the VTA. In this case, if a tag is found in the VTA entry of the corresponding warp, the multiprocessor can count the VTA hit. That is, the multiprocessor can determine that the warp shows the possibility of data localization if the VTA observes that the warp repeatedly places memory requests on the same cache line (i.e., observes VTA hits).

다시 수학식 1을 참고하면, IRS_i는 워프 i에 대해서 명령어마다의 VTA 히트(즉, VTA 히트의 강도)를 지시한다. 따라서 높은 IRS_i는 워프 i가 주어진 시기(epoch) 동안 심한 캐시 간섭을 겪는다는 것을 지시할 수 있다. N_{executed-inst}와 N_active-warp는시간에 따라 변할 수 있으므로, 한 실시예에서 IRS_i는 주어진 시기에 대한 (N_{executed-inst}/N_active-warp)를 고려하기 위해서 주기적으로 갱신될 수 있다. 주어진 시간에 대한 IRS_i에 기초해서, 워프 스케줄러는 (1) 워프 i와 간섭을 일으키는 워프를 파티션할지(즉, 격리할지), (2) 간섭을 일으키는 워프를 스톨할지 또는 (3) 이전에 스톨된 워프를 재활성화할지를 결정할 수 있다.Referring again to Equation (1), IRS _i indicates a VTA hit for each instruction (i.e., intensity of VTA hit) for warp i. Thus, a high IRS _i can indicate that warp i experiences severe cache interference during a given epoch. N _{executed-inst} and N _active-warp May vary over time, so in one embodiment IRS _i may be periodically updated to account for (N _{executed-inst} / N _active-warp ) for a given time period. Based on the IRS _i for a given time, the warp scheduler will either (1) partition the warp causing the interference with the warp i (ie, isolate), (2) stall the interfering warp, or (3) And to re-activate the warp.

어떤 실시예에서, 워프 스케줄러가 간섭을 일으키는 워프를 파티션할지, 스톨할지 또는 재활성화할지를 결정하기 위해서, 두 가지 임계치, 즉 (1) 높은 컷오프(high cutoff)와 (2) 낮은 컷오프(low cutoff)를 사용할 수 있다. 높은 컷오프보다 높은 IRS_i는 워프 i가 높은 레벨의 캐시 간섭을 겪는다는 것을 지시할 수 있다. 따라서 IRS_i가 높은 컷오프보다 높으면, 워프 스케줄러는 워프 i에 대해서 최근에 가장 자주 간섭을 일으키는 워프를 파티션하거나 스톨할 수 있다. 낮은 컷오프보다 낮은 IRS_i는 워프 i는 낮은 레벨의 캐시 간섭을 겪거나 실행을 완료하였다는 것을 지시할 수 있다. 따라서 IRS_i가 낮은 IRS보다 낮으면, 워프 스케줄러는 이전에 스톨된 워프를 재활성화하거나 이전에 격리된 워프의 격리를 철회(이전에 격리된 워프의 메모리 요청을 L1D 캐시로 변경)할 수 있다.(2) low cutoff and (2) high cutoff to determine whether the warp scheduler should partition, stall, or reactivate the interfering warp. In some embodiments, Can be used. An IRS _i higher than a high cutoff may indicate that warp i experiences a high level of cache interference. Thus, if the IRS _i is higher than the high cutoff, the warp scheduler can partition or stall the warp that most recently interferes with warp i most recently. An IRS _i lower than the low cutoff may indicate that warp i is experiencing low level cache interference or has completed execution. Thus, if the IRS _i is lower than the lower IRS, the warp scheduler can reactivate the previously stowed warp or withdraw the isolation of the previously isolated warp (change the memory request of the previously isolated warp to the L1D cache).

아래 표 1에서 예시한 조건에서, 다이버스 메모리 집약적 어플리케이션을 평가한 결과, 예를 들면 높은 컷오프와 낮은 컷오프를 각각 대략 1%와 0.5%로 설정할 수 있다.Under the conditions illustrated in Table 1 below, the evaluation of the diverse memory intensive applications can result in, for example, a high cutoff and a low cutoff of approximately 1% and 0.5%, respectively.

IRS가 시간에 따라 변함에 따라, 워프 스케줄러는 워프가 파티션, 스톨 또는 재활성화될 필요가 있는지 정확하게 결정하기 위해서, 높은 컷오프와 낮은 컷오프와 IRS를 주기적으로 비교할 수 있다. 이를 위해, 어떤 실시예에서 워프 스케줄러는 실행 시간을 높은 컷오프 시기와 낮은 컷오프 시기로 분할할 수 있다. 높은 컷오프 시기와 낮은 컷오프 시기로 이루어지는 사이클이 반복될 때, 각 높은 컷오프 시기의 끝에 워프 스케줄러는 IRS를 높은 컷오프와 비교할 수 있으며, 각 낮은 컷오프 시기의 끝에 워프 스케줄러는 IRS를 낮은 컷오프와 비교할 수 있다.As the IRS changes over time, the warp scheduler can periodically compare the IRS with the high cutoff and low cutoff to determine exactly if the warp needs to be partitioned, stalled, or reactivated. To this end, in some embodiments, the warp scheduler may divide the execution time into a high cutoff period and a low cutoff period. At the end of each high cutoff period, the warp scheduler can compare the IRS to the high cutoff when cycles of high cutoff and low cutoff times are repeated, and at the end of each low cutoff period the warp scheduler can compare the IRS to the low cutoff .

한 실시예에서, 낮은 컷오프 시기는 높은 컷오프 시기보다 짧을 수 있다. 이와 같이 낮은 컷오프 시기는 높은 컷오프 시기보다 짧으면, 이전에 스톨된 워프가 런타임에서 다른 워프와 현저하게 간섭을 일으키지 않기 시작하자마자, 이들 워프를 재활성화시킴으로써 스톨하고 있는 워프의 부정적 영향을 최소화할 수 있다. 이에 따라 높은 TLP를 유지할 수 있다.In one embodiment, the low cutoff period may be shorter than the high cutoff period. As such the low cutoff period is shorter than the high cutoff period, once the previously stalled warp begins to interfere significantly with the other warp at runtime, the negative effects of the stalled warp can be minimized by reactivating these warp . As a result, high TLP can be maintained.

아래 표 1에서 예시한 조건에서, 다이버스 메모리 집약적 어플리케이션을 평가한 결과, 예를 들면 높은 컷오프 시기와 낮은 컷오프 시기의 길이를 각각 대략 5000 및 100 명령어로 설정할 수 있다.As a result of evaluating the diverse memory intensive applications under the conditions illustrated in Table 1 below, for example, the lengths of the high cutoff period and the low cutoff period can be set to approximately 5000 and 100 instructions, respectively.

다음 높은 컷오프와 낮은 컷오프를 사용하는 워프 스케줄링 방법에 대해서 도 15 및 도 16을 참고로 하여 설명한다.Next, a warp scheduling method using a high cutoff and a low cutoff will be described with reference to FIGS. 15 and 16. FIG.

도 15 및 도 16은 각각 본 발명의 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스케줄링 방법을 나타내는 흐름도이다.15 and 16 are flowcharts illustrating a method of a warp scheduling in a parallel processing unit according to an embodiment of the present invention, respectively.

도 15을 참고하면, 병렬 프로세싱 유닛의 멀티프로세서는 액티브 워프 i에 대해서 캐시 간섭의 레벨을 검출한다(S1510). 어떤 실시예에서, 멀티프로세서는 높은 컷오프 시기가 종료할 때 캐시 간섭 레벨을 검출할 수 있다. 워프 i의 캐시 간섭 레벨이 소정의 조건을 만족하는 경우(S1520), 멀티프로세서는 워프 i에 대해서 가장 자주 간섭을 일으키는 워프 j를 검출한다(S1530). 어떤 실시예에서, 소정의 조건은 캐시 간섭 레벨, 예를 들면 ISR_i가 높은 컷오프보다 높은 조건일 수 있다. 이에 따라, 멀티프로세서는 워프 j의 캐시 액세스를 공유 메모리로 격리한다(S1540). 어떤 실시예에서, 멀티프로세서는 워프 j를 격리한 워프 i의 워프 번호를 기록할 수 있다.Referring to Fig. 15, the multiprocessor of the parallel processing unit detects the level of cache interference for active warp i (S1510). In some embodiments, the multiprocessor may detect a cache interference level at the end of a high cutoff period. If the cache interference level of warp i satisfies a predetermined condition (S1520), the multiprocessor detects a warp j that most frequently interferes with warp i (S1530). In some embodiments, the predetermined condition may be a condition where the cache interference level, e.g., ISR _i, is higher than the high cutoff. Accordingly, the multiprocessor isolates the cache access of warp j to the shared memory (S1540). In some embodiments, the multiprocessor may record the warp number of warp i that isolated warp j.

다음, 멀티프로세서는 워프 j를 격리시킨 워프(즉, 워프 i)에 대해서 캐시 간섭 레벨을 검출한다(S1550). 어떤 실시예에서, 멀티프로세서는 낮은 컷오프 시기가 종료할 때 캐시 간섭 레벨을 검출할 수 있다. 워프 i의 캐시 간섭 레벨이 소정의 다른 조건을 만족하는 경우(S1560), 멀티프로세서는 워프 j의 격리를 철회한다(S1570). 이에 따라 워프 j는 다시 L1D 캐시로 액세스할 수 있다. 한 실시예에서, 소정의 다른 조건은 워프 i의 캐시 간섭 레벨, 예를 들면 ISR_i가 낮은 컷오프보다 낮은 조건일 수 있다. 다른 실시예에서, 소정의 다른 조건은 워프 i의 실행이 완료된 조건일 수 있다. 이와는 달리, 소정의 다른 조건을 만족하지 않는 경우(S1560), 멀티프로세서는 워프 j의 캐시 액세스를 계속 격리한다(S1570).Next, the multiprocessor detects the cache interference level for the warp (i.e., warp i) that isolated warp j (S1550). In some embodiments, the multiprocessor may detect a cache interference level at the end of a low cutoff period. If the cache interference level of warp i satisfies another predetermined condition (S1560), the multiprocessor withdraws the isolation of warp j (S1570). As a result, warp j can again access the L1D cache. In one embodiment, some other condition may be a condition where the cache interference level of warp i, e.g. ISR _i, is lower than the low cutoff. In another embodiment, some other condition may be a condition in which the execution of warp i is complete. Otherwise, if the predetermined condition is not satisfied (S 1560), the multiprocessor continues to isolate the cache access of warp j (S 1570).

이와 같이, 워프 i에 대해서 간섭을 자주 일으키는 워프 j를 격리함으로써, 캐시 간섭을 줄일 수 있으면서 TLP를 유지할 수도 있다.Thus, by isolating warp j, which frequently causes interference with warp i, it is possible to reduce cache interference while maintaining TLP.

도 16을 참고하면, 병렬 프로세싱 유닛의 멀티프로세서는 워프 i에 대해서 캐시 간섭의 레벨을 검출한다(S1610). 워프 i는 L1D 캐시로 액세스하는 워프이거나 공유 메모리로 액세스하는 워프일 수 있다. 어떤 실시예에서, 멀티프로세서는 높은 컷오프 시기가 종료할 때 캐시 간섭 레벨을 검출할 수 있다. 워프 i의 캐시 간섭 레벨이 소정의 조건을 만족하는 경우(S1620), 멀티프로세서는 워프 i에 대해서 가장 자주 간섭을 일으키는 워프 j를 검출한다(S1630). 어떤 실시예에서, 소정의 조건은 캐시 간섭 레벨, 예를 들면 ISR_i가 높은 컷오프보다 높은 조건일 수 있다.Referring to FIG. 16, the multiprocessor of the parallel processing unit detects the level of cache interference for warp i (S1610). Warp i may be a warp that accesses the L1D cache or a warp that accesses the shared memory. In some embodiments, the multiprocessor may detect a cache interference level at the end of a high cutoff period. When the cache interference level of warp i satisfies a predetermined condition (S1620), the multiprocessor detects a warp j that most frequently interferes with warp i (S1630). In some embodiments, the predetermined condition may be a condition where the cache interference level, e.g., ISR _i, is higher than the high cutoff.

멀티프로세서는 워프 j의 상태가 격리 상태인지를 판단한다(S1640). 워프 j가 격리 상태이면(S1640), 멀티프로세서는 워프 j를 스톨한다(S1642). 어떤 실시예에서, 멀티프로세서는 워프 j를 스톨한 워프 i의 워프 번호를 기록할 수 있다. 워프 j가 격리 상태가 아니면, 즉 워프 j가 액티브 상태이면(S1640), 멀티프로세서는 워프 j의 캐시 액세스를 공유 메모리로 격리한다(S1644). 어떤 실시예에서, 멀티프로세서는 워프 j를 격리한 워프 i의 워프 번호를 기록할 수 있다.The multiprocessor determines whether the warp j is in an isolated state (S1640). If warp j is in an isolated state (S1640), the multiprocessor stalls warp j (S1642). In some embodiments, the multiprocessor may record the warp number of warp i that stuck warp j. If the warp j is not in the isolated state, that is, if the warp j is active (S1640), the multiprocessor isolates the cache access of warp j to the shared memory (S1644). In some embodiments, the multiprocessor may record the warp number of warp i that isolated warp j.

다음, 멀티프로세서는 워프 j를 격리시킨 워프(즉, 워프 i)에 대해서 캐시 간섭 레벨을 검출한다(S1650). 어떤 실시예에서, 멀티프로세서는 낮은 컷오프 시기가 종료할 때 캐시 간섭 레벨을 검출할 수 있다. 워프 i의 캐시 간섭 레벨이 소정의 다른 조건을 만족하는 경우(S1660), 멀티프로세서는 워프 j의 격리를 철회한다(S1670). 이에 따라 워프 j는 다시 L1D 캐시로 액세스할 수 있다. 한 실시예에서, 소정의 다른 조건은 워프 i의 캐시 간섭 레벨, 예를 들면 ISR_i가 낮은 컷오프보다 낮은 조건일 수 있다. 다른 실시예에서, 소정의 다른 조건은 워프 i의 실행이 완료된 조건일 수 있다. 이와는 달리, 소정의 다른 조건을 만족하지 않는 경우(S1660), 멀티프로세서는 워프 j의 캐시 액세스를 계속 격리한다.Next, the multiprocessor detects the cache interference level for the warp (i.e., warp i) that isolated the warp j (S1650). In some embodiments, the multiprocessor may detect a cache interference level at the end of a low cutoff period. If the cache interference level of warp i satisfies another predetermined condition (S1660), the multiprocessor withdraws the isolation of warp j (S1670). As a result, warp j can again access the L1D cache. In one embodiment, some other condition may be a condition where the cache interference level of warp i, e.g. ISR _i, is lower than the low cutoff. In another embodiment, some other condition may be a condition in which the execution of warp i is complete. On the other hand, if the predetermined condition is not satisfied (S1660), the multiprocessor continues to isolate the cache access of warp j.

또는 멀티프로세서는 워프 j를 스톨시킨 워프(즉, 워프 i)에 대해서 캐시 간섭 레벨을 검출한다(S1680). 어떤 실시예에서, 멀티프로세서는 낮은 컷오프 시기가 종료할 때 캐시 간섭 레벨을 검출할 수 있다. 워프 i의 캐시 간섭 레벨이 소정의 다른 조건을 만족하는 경우(S1690), 멀티프로세서는 워프 j를 재활성화한다(S1695). 한 실시예에서, 소정의 다른 조건은 워프 i의 캐시 간섭 레벨, 예를 들면 ISR_i가 낮은 컷오프보다 낮은 조건일 수 있다. 다른 실시예에서, 소정의 다른 조건은 워프 i의 실행이 완료된 조건일 수 있다. 이와는 달리, 소정의 다른 조건을 만족하지 않는 경우(S1690), 멀티프로세서는 워프 j의 캐시 액세스를 계속 스톨한다.Alternatively, the multiprocessor detects a cache interference level for a warp (i.e., warp i) in which warp j is stalled (S1680). In some embodiments, the multiprocessor may detect a cache interference level at the end of a low cutoff period. When the cache interference level of warp i satisfies another predetermined condition (S1690), the multiprocessor reactivates warp j (S1695). In one embodiment, some other condition may be a condition where the cache interference level of warp i, e.g. ISR _i, is lower than the low cutoff. In another embodiment, some other condition may be a condition in which the execution of warp i is complete. On the other hand, if the predetermined condition is not satisfied (S1690), the multiprocessor continues to stall the cache access of warp j.

이와 같이, 격리된 워프 j가 다시 캐시 간섭을 일으키는 경우 격리된 워프 j를 스톨함으로써 캐시 간섭을 줄일 수 있다. 또한 캐시 간섭이 줄어드는 경우, 스톨한 워프 j를 재활성화함으로써 TLP를 유지할 수 있다.As such, cache interference can be reduced by stowing the isolated warp j if the isolated warp j again causes cache interference. In addition, if cache interference is reduced, the TLP can be maintained by reactivating the stalled warp j.

다음, 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 멀티프로세서의 구조에 대해서 설명한다.Next, a structure of a multiprocessor in the parallel processing unit according to an embodiment of the present invention will be described.

도 17은 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 멀티프로세서의 개략적인 블록도이다. 도 17에서는 멀티프로세서에서 워프 스케줄러와 메모리 유닛을 도시하였다.17 is a schematic block diagram of a multiprocessor in a parallel processing unit according to an embodiment of the present invention. In Fig. 17, a warp scheduler and a memory unit are shown in a multiprocessor.

도 17을 참고하면, 멀티프로세서의 워프 스케줄러(1710)는 캐시 간섭의 레벨을 검출하는 카운터(1711)를 포함한다. 어떤 실시예에서, 카운터(1711)는 복수의 VTA 히트 카운터(VTACount0-VTACountk)와 전체 명령어 카운터(Inst-total)를 포함할 수 있다. 이 경우, VTA 히트 카운터는 워프별로 제공될 수 있다. 각 VTA 히트 카운터(VTACounti)는 대응하는 워프(Wi)에 대해서 VTA 히트 수(

)를 기록하고, 전체 명령어 카운터(Inst-total)는 해당 멀티프로세서에 의해 실행된 명령어의 총 수(N_{executed-inst})를 기록한다. 멀티프로세서는 복수의 VTA 히트 카운터(VTACount0-VTACountk)와 전체 명령어 카운터(Inst-total)를 통해 개별 워프가 겪는 캐시 간섭의 레벨을 검출할 수 있다. 어떤 실시예에서 캐시 간섭의 레벨은 IRS로 표현될 수 있다.17, the multiprocessor's warp scheduler 1710 includes a counter 1711 for detecting the level of cache interference. In some embodiments, the counter 1711 may include a plurality of VTA hit counters (VTACount0-VTACountk) and an entire instruction counter (Inst-total). In this case, the VTA hit counter may be provided per warp. Each VTA hit counter (VTACounti) represents the number of VTA hits (< RTI ID = 0.0 >

), And the total instruction counter (Inst-total) records the total number (N _{executed-inst} ) of instructions executed by the corresponding multiprocessor. The multiprocessor can detect the level of cache interference experienced by individual warps through multiple VTA hit counters (VTACount0-VTACountk) and the entire instruction counter (Inst-total). In some embodiments, the level of cache interference may be expressed as an IRS.

워프 스케줄러(1710)는 캐시 간섭의 레벨과 임계치를 비교하기 위해서 컷오프 검사 유닛(1712)을 더 포함할 수 있다. 어떤 실시예에서, 캐시 간섭의 레벨은 IRS이고, 임계치는 높은 컷오프와 낮은 컷오프를 포함할 수 있다. 그러면 컷오프 검사 유닛(1712)은 IRS와 높은 컷오프 및 낮은 컷오프를 비교한다. 한 실시예에서, 컷오프 검사 유닛(1712)은 레지스터, 시프팅 유닛 및 비교 로직을 사용해서 구현될 수 있다. 워프 스케줄러는 샘플러(1713)를 더 포함할 수 있으며, 샘플러(1713)는 실행된 명령어의 수를 카운트하여서 높은 컷오프 시기 또는 낮은 컷오프 시기가 종료되었는지를 결정한다.The warp scheduler 1710 may further include a cutoff check unit 1712 for comparing the level of cache interference with a threshold. In some embodiments, the level of cache interference is IRS, and the threshold may include a high cutoff and a low cutoff. The cutoff checking unit 1712 then compares the IRS with the high cutoff and the low cutoff. In one embodiment, the cutoff check unit 1712 may be implemented using registers, shifting units, and compare logic. The warp scheduler may further include a sampler 1713 which counts the number of instructions executed to determine if a high cutoff period or a low cutoff period has expired.

워프 스케줄러(1710)는 간섭을 일으키는 워프를 관리하기 위한 리스트를 포함한다. 어떤 실시예에서, 리스트는 워프 리스트(1714), 간섭 리스트(1715) 및 페어 리스트(pair list)(1716)를 포함할 수 있다.The warp scheduler 1710 includes a list for managing the warp causing the interference. In some embodiments, the list may include a warp list 1714, an interference list 1715, and a pair list 1716.

간섭 리스트(1714)는 간섭을 일으키는 워프를 추적하는데 관련된 정보를 관리하는데 사용될 수 있다. 간섭 리스트(1714)는 복수의 워프에 각각 할당된 복수의 엔트리를 포함한다. 각 엔트리는 할당된 워프의 WID에 의해 인덱스되어 있으며, 간섭을 일으키는 워프의 WID와 포화 카운터(C)를 저장한다. VTA 히트가 발생할 때마다, 간섭 리스트의 해당하는 엔트리가 갱신될 수 있다. 예를 들면, 엔트리에 저장된 WID와 동일한 워프가 간섭을 일으킬 때마다 포화 카운터(C)는 1씩 증가하고, 엔트리에 저장된 WID와 다른 워프가 간섭을 일으킬 때마다 포화 카운터(C)는 1씩 감소할 수 있다. 이 경우, 포화 카운터(C)가 포화값(예를 들면, '11')에 도달하면 추가로 증가하지 않고, 포화 카운터(C)가 소정값(예를 들면, '00')에 도달하면 해당 엔트리의 WID는 다른 워프의 WID로 갱신될 수 있다.The interference list 1714 can be used to manage information related to tracking the warp causing the interference. The interference list 1714 includes a plurality of entries each assigned to a plurality of warps. Each entry is indexed by the WID of the assigned warp and stores the WID and saturation counter (C) of the interfering warp. Every time a VTA hit occurs, the corresponding entry in the interference list can be updated. For example, each time the warp equal to the WID stored in the entry causes interference, the saturation counter C is incremented by one, and each time the WID stored in the entry and the other warp cause interference, the saturation counter C is decremented by one can do. In this case, when the saturation counter C reaches a saturation value (for example, '11'), it does not increase further, and when the saturation counter C reaches a predetermined value (for example, '00' The WID of the entry can be updated with the WID of another warp.

간섭 리스트(1714)는 ISR에 기초해서 간섭을 일으키는 워프를 파티션할 필요가 있을 때마다 체크될 수 있다. 이를 위해, 복수의 준비(ready) 워프 엔트리를 포함하는 워프 리스트(1715)는 각 준비 워프 엔트리에 연결된 상태 플래그를 포함한다. 워프 스케줄러는 상태 플래그를 이용해서 주어진 워프의 상태, 즉 주어진 워프가 액티브 상태인지 또는 격리 상태인지를 식별할 수 있다. 예를 들면, 상태 플래그는 격리 플래그(I)를 포함하고, 격리 플래그(I)가 0인 경우가 액티브 상태(I=0)로, 격리 플래그(I)가 1인 경우가 격리 상태(=1)로 식별될 수 있다.The interference list 1714 may be checked whenever it is necessary to partition the interfering warp based on the ISR. To this end, a warp list 1715 comprising a plurality of ready warp entries contains a status flag coupled to each ready warp entry. The warp scheduler can use the status flags to identify the status of a given warp, that is, whether the given warp is active or isolated. For example, the state flag includes the isolation flag I, and the case where the isolation flag I is 0 is the active state (I = 0) and the isolation flag I is 1 is the isolation state (= 1 ). &Lt; / RTI >

다른 실시예에서, 워프 스케줄러가 스톨 상태를 추가로 관리할 수 있다. 이 경우, 간섭 리스트(1714)는 ISR에 기초해서 간섭을 일으키는 워프를 파티션 또는 스톨할 필요가 있을 때마다 체크될 수 있다. 어떤 실시예에서, 상태 플래그는 액티브 플래그(V)와 격리 플래그(I)를 포함할 수 있다. 한 실시예에서 액티브 플래그와 격리 플래그는 각각 1 비트일 수 있다. 워프 스케줄러는 상태 플래그(V, I)를 이용해서 주어진 워프의 상태, 즉 주어진 워프가 액티브 상태인지, 격리 상태인지 또는 스톨 상태인지를 식별할 수 있다. 예를 들면, 액티브 플래그가 1이고 격리 플래그가 0인 경우가 액티브 상태(V=1, I=0)로, 액티브 플래그와 격리 플래그가 모두 1인 경우가 격리 상태(V=1, I=1)로, 액티브 플래그가 0인 경우가 스톨 상태(V=0)로 식별될 수 있다.In another embodiment, the warp scheduler may further manage the stall state. In this case, the interference list 1714 can be checked whenever it is necessary to partition or stall the interfering warp based on the ISR. In some embodiments, the status flag may include an active flag V and an isolation flag I. In one embodiment, the active flag and the isolation flag may each be one bit. The warp scheduler can use the status flags (V, I) to identify the status of a given warp, i.e. whether the given warp is active, isolated or stalled. For example, when the active flag is 1 and the isolation flag is 0, the active state (V = 1, I = 0) ), And when the active flag is 0, the stall state (V = 0) can be identified.

페어 리스트(1716)는 복수의 워프에 각각 할당된 복수의 엔트리를 포함하며, 각 엔트리는 할당된 워프의 WID에 의해 인덱스되어 있다. 각 엔트리는 어느 간섭받는 워프가 과거에 워프 파티셔닝을 일으켰는지를 기록한다. 이를 위해, 어떤 실시예에서 각 엔트리는 워프 파티셔닝을 일으킨 간섭받는 워프의 WID를 기록하기 위한 필드를 포함할 수 있다. 워프 i에 해당하는 엔트리의 필드의 WID에 기초해서, 워프 스케줄러는 이전에 워프 i의 파티셔닝을 일으킨 간섭받는 워프 k의 ISR(ISR_k)을 확인한다. 예를 들면, 워프(W0)가 워프(W1)에 의해 심하게 간섭을 받음에 따라, 워프 스케줄러는 워프(W1)를 파티션하여 그 캐시 액세스를 격리하는 것으로 결정할 수 있다. 그러면 도 17에 도시한 것처럼 워프(W0)의 WID가 워프(W1)에 대응하는 페어 리스트 엔트리의 필드에 기록되고, 워프(W1)에 대응하는 간섭 리스트 엔트리의 격리 플래그(I)가 설정될 수 있다. 워프(W1)는 파티션된 후에 공유 메모리로 액세스하기 시작한다. 나중에 워프 스케줄러가 워프(W1)의 격리를 철회할 필요가 있을 때, 워프(W1)에 대응하는 페어 리스트 엔트리의 필드와 워프(W1)에 대응하는 간섭 리스트 엔트리의 격리 플래그(I)가 클리어될 수 있다.The pair list 1716 includes a plurality of entries each assigned to a plurality of warps, each entry being indexed by the WID of the assigned warp. Each entry records which interfering warp has caused warp partitioning in the past. To this end, in some embodiments, each entry may include a field for recording the WID of the interfering warp that caused the warp partitioning. Based on the WID of the field of the entry corresponding to warp i, the warp scheduler identifies the ISR (ISR _k ) of the interfering warp k that caused the partitioning of warp i previously. For example, as warp W0 is severely interfered by warp W1, the warp scheduler may decide to partition warp W1 to isolate its cache access. 17, the WID of the warp W0 is recorded in the field of the pair list entry corresponding to the warp W1, and the isolation flag I of the interference list entry corresponding to the warp W1 is set have. Warp W1 begins to access the shared memory after being partitioned. When the warp scheduler needs to withdraw the separation of the warp W1 later, the field of the pair list entry corresponding to the warp W1 and the isolation flag I of the interference list entry corresponding to the warp W1 are cleared .

다른 실시예에서, 페어 리스트(1716)의 각 엔트리는 어느 간섭받는 워프가 과거에 워프 스톨링을 일으켰는지를 추가로 기록할 수 있다. 이를 위해, 어떤 실시예에서 각 엔트리는 두 필드를 포함할 수 있다. 제1 필드는 워프 파티셔닝을 일으킨 간섭받는 워프의 WID를 기록하고, 제2 필드는 워프 스톨링을 일으킨 간섭받는 워프의 WID를 기록할 수 있다. 워프 i에 해당하는 엔트리의 두 필드의 WID에 기초해서, 워프 스케줄러는 이전에 워프 i의 파티셔닝 또는 스톨을 일으킨 간섭받는 워프 k의 ISR(ISR_k)을 확인한다. 예를 들면, 워프(W0)가 워프(W1)에 의해 심하게 간섭을 받음에 따라, 워프 스케줄러는 워프(W1)를 파티션하여 그 캐시 액세스를 격리하는 것으로 결정할 수 있다. 그러면 도 17에 도시한 것처럼 워프(W0)의 WID가 워프(W1)에 대응하는 페어 리스트 엔트리의 제1 필드에 기록되고, 워프(W1)에 대응하는 간섭 리스트 엔트리의 격리 플래그(I)가 설정될 수 있다. 워프(W1)는 파티션된 후에 공유 메모리로 액세스하기 시작한다. 이 경우, 워프 스케줄러가 워프(W1)가 공유 메모리로 액세스하고 있는 다른 워프(W3)에 심하게 간섭을 일으키는 것으로 판단하면, 워프(W1)를 스톨할 수 있다. 워프 스케줄러가 워프(W1)를 스톨하기로 결정함에 따라, 워프(W3)의 WID가 워프(W1)에 대응하는 페어 리스트 엔트리의 제2 필드에 기록되고, 워프(W1)에 대응하는 간섭 리스트 엔트리의 액티브 플래그(V)가 클리어될 수 있다. 나중에 워프 스케줄러가 워프(W1)를 재활성화할 필요가 있을 때, 워프(W1)에 대응하는 페어 리스트 엔트리의 제2 필드와 워프(W1)에 대응하는 간섭 리스트 엔트리의 액티브 플래그(V)가 클리어될 수 있다. 또한 워프 스케줄러가 워프(W1)의 격리를 철회할 필요가 있을 때, 워프(W1)에 대응하는 페어 리스트 엔트리의 제1 필드와 워프(W1)에 대응하는 간섭 리스트 엔트리의 격리 플래그(I)가 클리어될 수 있다.In another embodiment, each entry in the pair list 1716 may additionally record which interfering warp has caused warp stalling in the past. To this end, in some embodiments, each entry may include two fields. The first field records the WID of the interfering warp that caused the warp partitioning and the second field can record the WID of the interfering warp that caused the warp stalling. Based on the WIDs of the two fields of the entry corresponding to warp i, the warp scheduler identifies the ISR (ISR _k ) of the interfering warp k that previously caused the partitioning or stalling of warp i. For example, as warp W0 is severely interfered by warp W1, the warp scheduler may decide to partition warp W1 to isolate its cache access. 17, the WID of the warp W0 is recorded in the first field of the pair list entry corresponding to the warp W1, and the isolation flag I of the interference list entry corresponding to the warp W1 is set . Warp W1 begins to access the shared memory after being partitioned. In this case, if the warp scheduler judges that the warp W1 severely interferes with another warp W3 accessing to the shared memory, the warp W1 can be stalled. The WID of the warp W3 is recorded in the second field of the pair list entry corresponding to the warp W1 and the interference list entry corresponding to the warp W1 The active flag V can be cleared. When the warp scheduler needs to reactivate the warp W1 later, the second field of the pair list entry corresponding to the warp W1 and the active flag V of the interference list entry corresponding to the warp W1 are cleared . Further, when the warp scheduler needs to withdraw the separation of the warp W1, the first field of the pair list entry corresponding to the warp W1 and the isolation flag I of the interference list entry corresponding to the warp W1 Can be cleared.

멀티프로세서의 메모리 유닛(1720)은 L1D 캐시(1721)와 공유 메모리(1722)를 포함한다. L1D 캐시(1721)의 캐시 라인에는 데이터와 태그가 저장되고, 대응하는 워프의 WID가 부착될 수 있다. 공유 메모리(1722)에서 사용하지 않는 공간은 캐시로서 사용되고, 캐시로 사용되는 공간의 캐시 라인에는 데이터와 태그가 저장될 수 있으며, 태그에는 대응하는 워프의 WID가 부착될 수 있다. 어떤 실시예에서 메모리 유닛(1720)는 캐시 라인의 미스 시에 대응하는 워프의 엔트리에 태그를 기록하기 위한 VTA(1730)를 더 포함할 수 있다.The multiprocessor memory unit 1720 includes an L1D cache 1721 and a shared memory 1722. Data and tags are stored in the cache line of the L1D cache 1721, and the WID of the corresponding warp can be attached. The unused space in the shared memory 1722 is used as a cache, data and tags can be stored in the cache line of the space used as a cache, and the WID of the corresponding warp can be attached to the tag. In some embodiments, the memory unit 1720 may further include a VTA 1730 for writing the tag in the entry of the warp corresponding to the miss of the cache line.

워프 스케줄러(1710)와 메모리 유닛(1720)은 로드-저장 유닛(1730)을 통해 연결될 수 있다.The warp scheduler 1710 and the memory unit 1720 may be connected through a load-store unit 1730. [

도 18은 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서 메모리 유닛의 동작을 설명하는 도면이며, 도 19는 한 실시예에 따른 병렬 프로세싱 유닛에서 메모리 유닛의 구조를 설명하는 도면이며, 도 20은 한 실시예에 따른 병렬 프로세싱 유닛에서 어드레스 번역을 설명하는 도면이다.FIG. 18 is a view for explaining the operation of a memory unit in the parallel processing unit according to an embodiment of the present invention, FIG. 19 is a view for explaining the structure of a memory unit in the parallel processing unit according to an embodiment, 1 illustrates address translation in a parallel processing unit according to one embodiment.

도 18을 참고하면, 멀티프로세서는 메모리 유닛(1810), 제어 로직(1820) 및 멀티플렉서(1830) 및 큐(1840)를 포함하며, 메모리 유닛(1810)은 L1D 캐시(1811)와 공유 메모리(1812)를 포함한다.18, the multiprocessor includes a memory unit 1810, control logic 1820 and a multiplexer 1830 and a queue 1840, and the memory unit 1810 includes an L1D cache 1811 and a shared memory 1812 ).

공유 메모리(1812)를 캐시로 사용할 때, 공유 메모리(1812)는 글로벌 메모리로부터 논리적으로 분리되어 있으므로, 멀티프로세서는 공유 메모리(1812)와 L2 캐시 사이의 데이터 경로를 설정한다. 이를 위해, 멀티플렉서(1830)가 글로벌 메모리 또는 L2 캐시로부터 가지고 온 데이터를 버퍼링하는 큐(1840)와 L1D 캐시(1811) 또는 공유 메모리(1812)를 선택적으로 연결한다. 이러한 큐(1840)는 쓰기 큐(1841)와 응답 큐(1842)를 포함할 수 있다.When using the shared memory 1812 as a cache, the shared memory 1812 is logically separated from the global memory so that the multiprocessor establishes a data path between the shared memory 1812 and the L2 cache. To this end, the multiplexer 1830 selectively connects the L1D cache 1811 or the shared memory 1812 with the queue 1840, which buffers data from the global memory or the L2 cache. This queue 1840 may include a write queue 1841 and a response queue 1842.

제어 로직(1820)은 주어진 워프의 상태에 기초하여서 멀티플렉서(1830)를 제어한다. 주어진 워프의 상태가 격리 상태이면(예를 들면, 도 17의 격리 플래그(I)가 '1'이면), 제어 로직(1820)은 공유 메모리(1812)와 큐(1840) 사이에 데이터 경로가 형성되도록 멀티플렉서(1830)를 제어할 수 있다. 제어 로직(1820)은 멀티플렉서(1830)를 제어할 때 캐시 태그를 체크한 결과(히트/미스)를 더 참조할 수도 있다.The control logic 1820 controls the multiplexer 1830 based on the state of the given warp. If the state of a given warp is in an isolated state (e. G., If the isolation flag I in FIG. 17 is ' 1 ', then the control logic 1820 will generate a data path between the shared memory 1812 and the queue 1840 So that the multiplexer 1830 can be controlled. The control logic 1820 may further reference the result of checking the cache tag (hit / miss) when controlling the multiplexer 1830.

또한 멀티프로세서는 어드레스 번역 유닛으로부터의 메모리 요청의 공유 메모리 어드레스(SHM Addr)를 저장할 수 있다. 어떤 실시예에서, 멀티프로세서는 각 MSHR 엔트리에 확장 필드를 추가하여서 어드레스 번역 유닛으로부터의 메모리 요청의 공유 메모리 어드레스를 저장할 수 있다. 각 MSHR 엔트리는 어드레스 유효 플래그(V), 명령 ID(예를 들면 프로그램 카운터(program counter, PC)) 및 글로벌 메모리 어드레스(Gl Addr)를 더 포함할 수 있다.The multiprocessor may also store a shared memory address (SHM Addr) of the memory request from the address translation unit. In some embodiments, the multiprocessor may add an extension field to each MSHR entry to store the shared memory address of the memory request from the address translation unit. Each MSHR entry may further include an address valid flag V, an instruction ID (e.g. a program counter (PC)) and a global memory address Gl Addr.

공유 메모리(1812)가 미스 이후에 채움(fill) 요청을 발행하면, 요청은 글로벌 및 번역된 공유 메모리 어드레스를 채움으로써 하나의 MSHR 엔트리를 예약한다. L2 캐시로부터의 응답이 대응하는 MSHR 엔트리에 기록된 글로벌 메모리 어드레스와 일치하면, 해당 데이터는 번역된 공유 메모리 어드레스에 기초해서 공유 메모리(1812)에 직접 저장될 수 있다.If the shared memory 1812 issues a fill request after a miss, the request reserves one MSHR entry by populating the global and translated shared memory addresses. If the response from the L2 cache matches the global memory address recorded in the corresponding MSHR entry, the data may be stored directly in the shared memory 1812 based on the translated shared memory address.

LID 캐시(1811)와 비교해서, 공유 메모리(1812)는 캐시 태그를 수용하기 위한 개별 메모리 구조를 가지지 않는다. 한 실시예에서 공유 메모리(1812)의 캐시 태그를 수용하기 위해 추가적인 태그 어레이를 채용할 수 있다.In contrast to the LID cache 1811, the shared memory 1812 does not have a separate memory structure for accommodating cache tags. In an embodiment, additional tag arrays may be employed to accommodate the cache tag of the shared memory 1812.

어떤 실시예에서, 추가적인 태그 어레이 대신에 공유 메모리(1812)에 복수의 캐시 데이터 블록(즉, 캐시 라인)과 이들 각각의 태그를 둘 수 있다. 예를 들면 캐시 데이터 블록은 128 바이트 캐시 데이터 블록일 수 있다. 이에 따라 L1D 캐시(1811)와 공유 메모리(1812)로 구성된 온칩 메모리 유닛(1810)의 변경을 최소화할 수 있다. 이를 위해, 도 19에 도시한 것처럼, 공유 메모리(1812)에 포함된 복수의 공유 메모리 뱅크(예를 들면 32개의 공유 메모리 뱅크)를 두 개 이상의 뱅크 그룹으로 분할한다. 한 실시예에서, 분할된 뱅크 그룹의 개수는 두 개일 수 있다. 그리고 하나의 뱅크 그룹 내에서 복수의 뱅크(예를 들면 16개의 뱅크)에 대해서 캐시 데이터 블록을 행 방향으로 스트라이프한다. 전형적인 병렬 프로세싱 유닛에서 각 공유 메모리 뱅크는 64 비트 액세스를 허용하므로, 캐시 데이터 블록, 예를 들면 128 바이트 캐시 데이터 블록은 병렬로 액세스될 수 있다.In some embodiments, a plurality of cache data blocks (i.e., cache lines) and their respective tags may be placed in shared memory 1812 instead of an additional tag array. For example, the cache data block may be a 128-byte cache data block. Accordingly, the change of the on-chip memory unit 1810 composed of the L1D cache 1811 and the shared memory 1812 can be minimized. To this end, as shown in FIG. 19, a plurality of shared memory banks (for example, 32 shared memory banks) included in the shared memory 1812 are divided into two or more bank groups. In one embodiment, the number of divided bank groups may be two. Then, the cache data blocks are striped in the row direction for a plurality of banks (for example, 16 banks) in one bank group. Since each shared memory bank in a typical parallel processing unit allows 64-bit access, a cache data block, for example a 128-byte cache data block, can be accessed in parallel.

하나의 태그와 그 WID는 제한된 비트, 예를 들면 태그가 25 비트를 사용하고 워프 번호(WID)가 6 비트를 사용하는 경우 31 비트를 요구하므로, 제한된 공유 메모리 공간으로 공유 메모리(1812)의 캐시 라인을 위한 태그를 제공할 수 있다. 예를 들면, 공유 메모리 뱅크는 64 비트 액세스를 허용하고 태그와 WID가 31 비트를 사용하는 경우, 두 개의 태그가 각 공유 메모리 뱅크에 형성될 수 있다. 한 실시예에서 두 개의 태그가 놓이는 뱅크는 대응하는 데이터 블록(예를 들면 128 바이트 데이터 블록)를 저장하는 뱅크와 다른 뱅크일 수 있다. 즉, 태그가 놓인 뱅크가 속한 뱅크 그룹은 대응하는 데이터 블록을 저장하는 뱅크들이 속한 뱅크 그룹과 다른 뱅크 그룹일 수 있다. 이와 같이 태그와 대응하는 데이터 블록을 서로 다른 뱅크 그룹에 두어서 뱅크 충돌을 피할 수 있으며, 이에 따라 태그와 데이터 블록의 병렬 액세스를 가능하게 할 수 있다. 또한 사용하지 않는 공유 메모리 공간을 직접 매핑된 캐시(direct mapped cache)로 사용해서, 데이터 블록과 대응하는 태그의 쌍이 단일 공유 메모리 액세스로 액세스될 수 있다.One tag and its WID require 31 bits when a limited bit, e.g., a tag uses 25 bits and a warp number (WID) uses 6 bits, so that the shared memory space of the shared memory 1812 You can provide a tag for the line. For example, if the shared memory bank allows 64-bit access and the tag and WID use 31 bits, two tags can be formed in each shared memory bank. In one embodiment, the bank in which the two tags are located may be a bank different from the bank storing the corresponding data block (e.g., a 128 byte data block). That is, the bank group to which the tag to which the tag belongs may be a bank group different from the bank group to which the bank storing the corresponding data block belongs. In this manner, bank collision can be avoided by placing the data block corresponding to the tag in different bank groups, thereby enabling parallel access of the tag and the data block. Also, by using unused shared memory space as a direct mapped cache, a pair of data blocks and corresponding tags can be accessed with a single shared memory access.

한편, 공유 메모리(1812)의 사용하지 않는 공간은 커널의 구현에 따라 변할 수 있다. 또한 각 CTA가 서로 다른 양의 공유 메모리(1812)를 요구함에 따라, 공유 메모리(1812)의 사용이 커널의 실행에 따라 변할 수 있다. 멀티프로세서는 공유 메모리 공간을 관리하기 위해서 공유 메모리 관리 테이블(shared memory management table, SMMT)(도시하지 않음)을 보유할 수 있다. SMMT는 복수의 엔트리를 포함하며, 각 엔트리는 CTA에 대응할 수 있다. 각 CTA는 하나의 SMMT 엔트리를 예약하고 있으며, 엔트리는 주어진 CTA가 사용하는 공유 메모리 공간의 시작 어드레스와 크기를 기록하고 있다. 어떤 실시예에서, SMMT의 각 엔트리는 워프의 쓰레드 또는 워프 그룹의 쓰레드에 대응할 수 있다.On the other hand, the unused space of the shared memory 1812 may vary depending on the implementation of the kernel. Also, as each CTA requires a different amount of shared memory 1812, the use of shared memory 1812 may vary with the execution of the kernel. The multiprocessor may have a shared memory management table (SMMT) (not shown) to manage the shared memory space. The SMMT includes a plurality of entries, each entry corresponding to a CTA. Each CTA reserves one SMMT entry, and the entry records the start address and size of the shared memory space used by the given CTA. In some embodiments, each entry in the SMMT may correspond to a thread in a warp or a thread in a warp group.

멀티프로세서는 CTA가 론치될 때, 대응하는 SMMT 엔트리를 체크하여서 CTA가 사용하지 않는 공유 메모리의 양을 판단할 수 있다. 그러면 멀티프로세서는 캐시 라인과 태그를 저장하기 위한 공간을 확보하기 위해서 사용하지 않는 공유 메모리의 시작 어드레스와 크기를 SMMT의 새로운 엔트리에 넣을 수 있다.When the CTA is launched, the multiprocessor can check the corresponding SMMT entry to determine the amount of shared memory that the CTA does not use. The multiprocessor can then put the start address and size of the shared memory that it does not use into the SMMT's new entry to free up space for storing cache lines and tags.

도 19에 도시한 것처럼, 멀티프로세서는 하드웨어 어드레스 번역 유닛(1910)을 공유 메모리(1812) 앞에 두어서 공유 메모리(1812)에서 타깃 캐시 라인과 그 태그가 어디에 존재하는지를 판단할 수 있다.As shown in FIG. 19, the multiprocessor may place a hardware address translation unit 1910 in front of the shared memory 1812 to determine where in the shared memory 1812 the target cache line and its tag are.

도 19를 참고하면, 어드레스 번역 유닛(1910)은 글로벌 메모리 어드레스를 공유 메모리 어드레스로 번역한다. 글로벌 메모리 어드레스는 LSB에서 MSB 순으로 태그(T), 블록 인덱스(L) 및 바이트 옵셋(F)으로 분할(decompose)될 수 있다. 예를 들면, 글로벌 메모리 어드레스는 16 비트의 태그(T), 9 비트의 블록 인덱스(L) 및 7 비트의 바이트 옵셋(F)으로 분할될 수 있다. 공유 메모리 어드레스는 타깃 캐시 데이터 블록을 지시하는 캐시 블록 어드레스와 타깃 태그를 지시하는 태그 어드레스를 포함할 수 있다. 캐시 블록 어드레스는 네 개의 필드를 포함하며, 네 필드는 LSB에서 MSB 순으로 바이트 옵셋(F), 뱅크 인덱스(B), 뱅크 그룹 인덱스(G) 및 행 인덱스(R)에 해당할 수 있다. 예를 들면, 공유 메모리(1812)가 뱅크당 8 바이트 행, 뱅크 그룹당 16개의 뱅크, 두 개의 뱅크 그룹 및 256개의 행을 사용하는 경우, 바이트 옵셋(F), 뱅크 인덱스(B), 뱅크 그룹 인덱스(G) 및 행 인덱스(R)는 각각 3, 4, 1 및 8 비트일 수 있다. 19, the address translation unit 1910 translates the global memory address into a shared memory address. The global memory address may be decomposed into a tag (T), a block index (L) and a byte offset (F) in LSB to MSB order. For example, the global memory address may be divided into a 16-bit tag T, a 9-bit block index L, and a 7-byte byte offset F. [ The shared memory address may include a cache block address indicating the target cache data block and a tag address indicating the target tag. The cache block address includes four fields, and four fields may correspond to byte offset F, bank index B, bank group index G, and row index R in LSB to MSB order. For example, if the shared memory 1812 uses eight bytes per bank, sixteen banks, two bank groups, and 256 rows per bank group, byte offset F, bank index B, (G) and row index (R) may be 3, 4, 1 and 8 bits, respectively.

어떤 실시예에서, 글로벌 메모리 어드레스를 캐시 블록 어드레스로 번역한 후에 남은 비트(예를 들면 태그(T)의 16 비트)는 타깃 캐시 데이터 블록의 태그의 일부로 사용될 수 있다. 요구되는 캐시 데이터 블록의 수가 행의 수보다 많으므로, 공유 메모리(1812)의 태그는 워프 번호(WID)와 캐시 블록 인덱스를 포함할 수 있다. 워프 번호(WID)와 캐시 블록 인덱스는 예를 들면 각각 6 비트와 9 비트일 수 있다.In some embodiments, the remaining bits (e.g., 16 bits of the tag T) after translation of the global memory address into a cache block address may be used as part of the tag of the target cache data block. Since the number of cache data blocks required is greater than the number of rows, the tag of the shared memory 1812 may include a warp number (WID) and a cache block index. The warp number (WID) and the cache block index may be, for example, 6 bits and 9 bits, respectively.

어드레스 번역 유닛(1910)은 N 비트 마스크를 사용해서 글로벌 메모리 어드레스의 블록 인덱스(L)를 공유 메모리(1812)에서의 행의 위치를 지시하는 행 인덱스(R)로 변환할 수 있다. 예를 들면, 공유 메모리(1812)가 256개의 행을 가지는 경우, 어드레스 번역 유닛(1910)은 8 비트 마스크를 사용할 수 있다. 어떤 실시예에서, 어드레스 번역 유닛(1910)은 글로벌 메모리 어드레스의 블록 인덱스(L)의 상위 N 비트와 N 비트 마스크를 연산해서 행 인덱스(R)를 생성할 수 있다. 한 실시예에서, 어드레스 번역 유닛(1910)은 글로벌 메모리 어드레스의 블록 인덱스(L)의 상위 N 비트와 N 비트 마스크의 배타적 논리합(XOR)으로 행 인덱스(R)를 생성할 수 있다. 어드레스 번역 유닛(1910)은 글로벌 메모리 어드레스의 블록 인덱스(L)의 하위 M 비트를 공유 메모리(1812)에서의 뱅크 그룹을 지시하는 뱅크 인덱스(B)로 변환할 수 있다. 예를 들면, 9 비트의 블록 인덱스(L)에서 하위 8 비트가 행 인덱스(R)로 변환되고, 공유 메모리(1812)의 뱅크 그룹이 두 개인 경우, 블록 인덱스(L)의 LSB가 뱅크 인덱스(B)로 사용될 수 있다.The address translation unit 1910 may use a N-bit mask to convert the block index L of the global memory address into a row index R that points to the location of the row in the shared memory 1812. For example, if the shared memory 1812 has 256 rows, the address translation unit 1910 may use an 8-bit mask. In some embodiments, the address translator unit 1910 may generate the row index R by computing an N-bit mask with the upper N bits of the block index L of the global memory address. In one embodiment, the address translation unit 1910 may generate the row index R by XORing the upper N bits of the block index L of the global memory address with an N bit mask. The address translation unit 1910 may convert the lower M bits of the block index L of the global memory address into a bank index B indicating the bank group in the shared memory 1812. [ For example, if the lower 8 bits in the 9-bit block index L are converted to the row index R and the number of the bank groups in the shared memory 1812 is 2, the LSB of the block index L is divided into the bank index B).

어드레스 번역 유닛(1910)은 글로벌 메모리 어드레스의 바이트 옵셋(F)의 일부 비트를 공유 메모리(1812)의 뱅크를 지시하는 뱅크 인덱스(B)로, 바이트 옵셋(F)의 나머지 일부 비트를 캐시 블록 어드레스의 바이트 옵셋(F)으로 변환할 수 있다. 어떤 실시예에서, 글로벌 메모리 어드레스의 바이트 옵셋(F)의 일부 비트를 뱅크 인덱스(B)로 되고, 바이트 옵셋(F)의 나머지 일부 비트를 캐시 블록 어드레스의 바이트 옵셋(F)으로 될 수 있다. 한 실시예에서, 글로벌 메모리 어드레스의 바이트 옵셋(F)의 상위 L 비트를 뱅크 인덱스(B)로 되고, 바이트 옵셋(F)의 하위 K 비트를 캐시 블록 어드레스의 바이트 옵셋(F)으로 될 수 있다. 예를 들면, 공유 메모리(1812)가 16개의 뱅크를 가지는 경우, 글로벌 메모리 어드레스의 7 비트 바이트 옵셋(F)의 상위 4 비트가 뱅크 인덱스(B)가 되고, 바이트 옵셋(F)의 하위 3 비트가 캐시 블록 어드레스의 바이트 옵셋(F)이 될 수 있다.The address translation unit 1910 translates some bits of the byte offset F of the global memory address into a bank index B indicating a bank of the shared memory 1812 and a remaining number of bits of the byte offset F as a cache block address To the byte offset (F). In some embodiments, some bits of the byte offset F of the global memory address may be the bank index B and some of the remaining bytes of the byte offset F may be the byte offset F of the cache block address. In one embodiment, the upper L bits of the byte offset F of the global memory address may be the bank index B and the lower K bits of the byte offset F may be the byte offset F of the cache block address . For example, when the shared memory 1812 has 16 banks, the upper 4 bits of the 7-bit byte offset F of the global memory address becomes the bank index B, and the lower 3 bits of the byte offset F May be the byte offset F of the cache block address.

태그 어드레스는 네 개의 필드를 포함하며, 네 필드는 LSB에서 MSB 순으로 바이트 옵셋(F), 뱅크 인덱스(B), 뱅크 그룹 인덱스(/G) 및 행 인덱스(R)에 해당할 수 있다. 예를 들면, 뱅크마다 물리적인 행이 두 개의 태그를 저장하여서 뱅크 그룹의 하나의 행이 32개의 태그를 보유하는 경우, 태그의 실제 위치는 5 비트, 즉 1 비트의 바이트 옵셋(F)과 4 비트의 뱅크 인덱스(B)에 의해 지시될 수 있다. 이를 위해서 어드레스 번역 유닛(1910)은 캐시 블록 어드레스의 행 인덱스(R)의 하위 P 비트를 태그 어드레스의 바이트 옵셋(F)으로, 행 인덱스(R)의 나머지 하위 Q 비트를 태그 어드레스의 뱅크 인덱스(B)로 번역할 수 있다. 어떤 실시예에서, 캐시 블록 어드레스의 행 인덱스(R)의 하위 P 비트가 바이트 옵셋(F)으로 되고, 바이트 옵셋(F)의 나머지 일부 비트를 캐시 블록 어드레스의 바이트 옵셋(F)으로 될 수 있다. 앞서 설명한 예에서, 캐시 블록 어드레스의 행 인덱스(R)의 LSB가 태그 어드레스의 바이트 옵셋(F)이 되고, 행 인덱스(R)의 나머지 하위 4 비트가 태그 어드레스의 뱅크 인덱스(B)로 될 수 있다.The tag address includes four fields and four fields may correspond to byte offset F, bank index B, bank group index / G, and row index R in LSB to MSB order. For example, if a physical row per bank stores two tags and one row of the bank group holds 32 tags, then the actual position of the tag is 5 bits, that is, 1 byte of byte offset F and 4 May be indicated by the bank index (B) of the bit. For this, the address translation unit 1910 converts the lower P bits of the row index R of the cache block address into the byte offset F of the tag address and the remaining lower Q bits of the row index R to the bank index of the tag address B). In some embodiments, the lower P bits of the row index R of the cache block address may be the byte offset F, and some of the remaining bits of the byte offset F may be the byte offset F of the cache block address . The LSB of the row index R of the cache block address becomes the byte offset F of the tag address and the remaining lower four bits of the row index R become the bank index B of the tag address have.

또한 어드레스 번역 유닛(1910)은 태그의 행 인덱스(R)를 지시하기 위해서, 캐시 블록 어드레스의 행 인덱스(R)의 나머지 비트를 태그 어드레스의 행 인덱스(R)로 번역할 수 있다. 어떤 실시예에서, 캐시 블록 어드레스의 행 인덱스(R)의 나머지 비트가 태그 어드레스의 행 인덱스(R)로 될 수 있다. 앞서 설명한 예에서, 캐시 블록 어드레스의 행 인덱스(R)의 상위 3 비트가 태그 어드레스의 행 인덱스(R)로 될 수 있다. 어드레스 번역 유닛(1910)은 캐시 블록 어드레스의 뱅크 그룹 인덱스(G)를 태그 어드레스의 뱅크 그룹 인덱스(/G)로 번역한다. 어떤 실시예에서, 캐시 데이터 블록과 대응하는 태그에 병렬로 액세스하기 위해서, 캐시 블록 어드레스의 뱅크 그룹 인덱스(G)가 플립(flip)되어 태그 어드레스의 뱅크 그룹 인덱스(/G)가 될 수 있다.The address translation unit 1910 may also translate the remaining bits of the row index R of the cache block address into the row index R of the tag address to indicate the row index R of the tag. In some embodiments, the remaining bits of the row index R of the cache block address may be the row index R of the tag address. In the example described above, the upper three bits of the row index R of the cache block address may be the row index R of the tag address. The address translation unit 1910 translates the bank group index G of the cache block address into the bank group index / G of the tag address. In some embodiments, the bank group index G of the cache block address may be flipped to become the bank group index / G of the tag address in order to access the tag corresponding to the cache data block in parallel.

어떤 실시예에서, 어드레스 번역 유닛(1910)은 캐시 블록과 태그에 대한 인덱스의 시작 위치를 캐시 옵셋 레지스터와 태그 옵셋 레지스터를 고려해서 재정렬할 수 있다. 캐시 옵셋 레지스터와 태그 옵셋 레지스터는 사용하지 않는 공유 메모리 공간에서 캐시의 변하는 크기를 포착하는데 사용될 수 있다.In some embodiments, the address translation unit 1910 may reorder the starting position of the index for the cache block and tag, taking into account the cache offset register and the tag offset register. The cache offset register and the tag offset register can be used to capture the varying size of the cache in unused shared memory space.

다음 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛에서의 워프 스케줄링 방법을 구현하는 알고리즘의 한 예에 대해서 설명한다.An example of an algorithm for implementing a warp scheduling method in a parallel processing unit according to an embodiment of the present invention will be described below.

알고리즘 1에서 i는 스케줄링될 워프의 워프 번호를 지시하며, 예를 들면 getWarpToBeScheduled() 함수를 통해서 획득될 수 있다. InstNo는 실행된 명령어의 총 개수이며, 예를 들면 getNumInstructions() 함수를 통해 획득될 수 있다. ActiveWarpNo는 멀티프로세서 상에서 동작하는 액티브 워프의 개수를 지시하며, 예를 들면 getNumActiveWarp() 함수를 통해 획득될 수 있다.In Algorithm 1, i indicates the warp number of the warp to be scheduled, and can be obtained, for example, via the getWarpToBeScheduled () function. InstNo is the total number of executed instructions, and can be obtained, for example, by the getNumInstructions () function. ActiveWarpNo indicates the number of active warps running on the multiprocessor, and can be obtained, for example, by the getNumActiveWarp () function.

워프 i(Warp(i))에 대해서, 액티브 플래그(V)가 '0'이고 낮은 컷오프 시기가 종료되었으면, 워프 i를 스톨 상태에서 재활성화할지가 결정된다. 이를 위해서 워프 i의 스톨을 일으킨 워프 k(Warp(k))의 워프 번호(k)가 워프 i에 해당하는 페어 리스트 엔트리의 제2 필드(Pair_List[i][1])로부터 획득된다. 이 경우, 워프 k의 캐시 간섭의 레벨(ISR_k)이 낮은 컷오프보다 높고 워프 k(Warp(k))가 계속 실행될 것을 필요로 하면, 워프 i의 스톨 상태는 계속 유지된다. 그렇지 않으면, 워프 i가 재활성화된다. 이에 따라, 워프 i의 액티브 플래그가 '1'로 설정되고, 페어 리스트 엔트리의 제2 필드(Pair_List[i][1])가 클리어된다(예를 들면, '-1'로 설정된다).For the warp i (Warp (i)), if the active flag V is '0' and the low cutoff period has expired, it is determined whether to re-activate warp i in the stalled state. For this purpose, the warp number k of the warp k (Warp (k)) causing the warp i to be stalled is obtained from the second field Pair_List [i] [1] of the pair list entry corresponding to the warp i. In this case, you need to have the level of the warp k cache interference (ISR _k) is higher than the lower cut-off warp k (Warp (k)) continue to run, i-warp stall condition of is maintained. Otherwise, warp i is reactivated. Accordingly, the active flag of the warp i is set to '1', and the second field of the pair list entry Pair_List [i] [1] is cleared (for example, set to '-1').

워프 i에 대해서, 격리 플래그(I)가 '1'이고 낮은 컷오프 시기가 종료되었으면, 워프 i를 격리 상태를 철회할지가 결정된다. 이를 위해서 워프 i의 파티션을 일으킨 워프 k(Warp(k))의 워프 번호(k)가 워프 i에 해당하는 페어 리스트 엔트리의 제1 필드(Pair_List[i][0])로부터 획득된다. 이 경우, 워프 k의 캐시 간섭의 레벨(ISR_k)이 낮은 컷오프보다 높고 워프 k가 계속 실행될 것을 필요로 하면, 워프 i의 격리 상태는 계속 유지된다. 그렇지 않으면, 워프 i의 격리가 철회된다. 이에 따라, 워프 i의 격리 플래그가 '0'으로 설정되고, 페어 리스트 엔트리의 제1 필드(Pair_List[i][0])가 클리어된다(예를 들면, '-1'로 설정된다).For the warp i, if the isolation flag I is '1' and the low cutoff period has expired, it is determined whether to withdraw the warp i from the isolated state. For this purpose, the warp number k of the warp k (Warp (k)) causing the partition of the warp i is obtained from the first field Pair_List [i] [0] of the pair list entry corresponding to the warp i. In this case, if the level (ISR _k ) of the cache interference of warp k is higher than the low cutoff and warp k needs to be continuously executed, the isolation state of warp i is still maintained. Otherwise, the quarantine of warp i is withdrawn. Thus, the isolation flag of the warp i is set to '0', and the first field Pair_List [i] [0] of the pair list entry is cleared (for example, set to '-1').

워프 i에 대해서, 액티브 플래그(V)가 '1'이고 높은 컷오프 시기가 종료되었으면, 액티브 상태의 워프 i에 간섭을 일으키는 워프 j를 파티션할지가 결정된다. 이를 위해서 워프 i에 간섭을 일으키는 워프 j의 워프 번호(j)가 워프 i에 해당하는 간섭 리스트 엔트리(Interference_List[i])로부터 획득된다. 워프 i의 캐시 간섭의 레벨(ISR_i)가 높은 컷오프보다 높고 워프 j가 워프 i와 다른 워프인 경우, 워프 j가 격리 상태이면(즉, 워프 j의 격리 플래그(I)가 '1'이면), 워프 j가 스톨된다. 즉, 워프 j의 액티브 플래그(V)가 '0'으로 설정되고, 워프 j에 해당하는 페어 리스트 엔트리의 제2 필드(Pair_List[j][1])가 i로 설정된다. 그렇지 않고, 워프 j가 액티브 상태이면(즉, 워프 j의 격리 플래그(I)가 '0'이면), 워프 j가 격리된다. 즉, 워프 j의 격리 플래그(I)가 '1'로 설정되고, 워프 j에 해당하는 페어 리스트 엔트리의 제1 필드(Pair_List[j][0])가 i로 설정된다.For the warp i, if the active flag V is '1' and the high cutoff period has ended, it is determined whether to partition the warp j causing the interference to the active warp i. For this purpose, the warp number j of the warp j causing the interference with the warp i is obtained from the interference list entry (Interference_List [i]) corresponding to the warp i. If warp j is in an isolated state (i.e., when the isolation flag I of warp j is '1'), the level of cache interference (ISR _i ) of warp i is higher than the high cutoff and warp j is different from warp i, , Warp j is stalled. That is, the active flag V of warp j is set to '0' and the second field Pair_List [j] [1] of the pair list entry corresponding to warp j is set to i. Otherwise, if warp j is active (i.e., if the isolation flag I of warp j is '0'), warp j is isolated. That is, the isolation flag I of warp j is set to '1', and the first field Pair_List [j] [0] of the pair list entry corresponding to warp j is set to i.

[알고리즘 1][Algorithm 1]

1 i := getWarpToBeScheduled()1 i: = getWarpToBeScheduled ()

2 InstNo := getNumInstructions()2 InstNo: = getNumInstructions ()

3 ActiveWarpNo := getNumActiveWarp()3 ActiveWarpNo: = getNumActiveWarp ()

4 if Warp(i).V == 0 and end of low cut-off epoch then4 if Warp (i) .V == 0 and end of low cut-off epoch then

/* Warp(i) is deactivated */ / * Warp (i) is deactivated * /

5 k := Pair_List[i][1]5 k: = Pair_List [i] [1]

6 IRS_k := VTAHit[k]/(InstNo/ActiveWarpNo)6 IRS _k : = VTAHit [k] / (InstNo / ActiveWarpNo)

7 if IRS_k > low-cutoff and Warp(k) needs executing then7 if IRS _k > low-cutoff and warp (k) needs executing then

8 continue8 continue

9 else9 else

10 Warp(i).V := 110 Warp (i) .V: = 1

11 Pair_List[i][1] := -1 // cleared11 Pair_List [i] [1]: = -1 // cleared

12 else if Warp(i).I == 1 and end of low cut-off epoch then12 else if Warp (i) .I == 1 and end of low cut-off epoch then

/* Warp(i) redirects to access shared memory */ / * Warp (i) redirects to access shared memory * /

13 k := Pair_List[i][0]13 k: = Pair_List [i] [0]

14 IRS_k := VTAHit[k]/(InstNo/ActiveWarpNo)14 IRS _k : = VTAHit [k] / (InstNo / ActiveWarpNo)

15 if IRS_k > low-cutoff and Warp(k) needs executing then15 if IRS _k > low-cutoff and warp (k) needs executing then

16 continue16 continue

17 else17 else

18 Warp(i).I := 018 Warp (i) .I: = 0

19 Pair_List[i][0] := -1 // cleared19 Pair_List [i] [0]: = -1 // cleared

20 if Warp(i).V == 1 and end of high cut-off epoch then20 if Warp (i) .V == 1 and end of high cut-off epoch then

/* Warp(i) is active */ / * Warp (i) is active * /

21 IRS_i := VTAHit[i]/(InstNo/ActiveWarpNo)21 IRS _i : = VTAHit [i] / (InstNo / ActiveWarpNo)

22 j := Interference_List[i]22 j: = Interference_List [i]

23 if IRS_i > high-cutoff and j != i then23 if IRS _i > high-cutoff and j! = I then

24 if Warp(j).I == 1 then24 if Warp (j) .I == 1 then

25 Warp(j).V := 025 Warp (j) .V: = 0

26 Pair_List[j][1] := i26 Pair_List [j] [1]: = i

27 else if Warp(j).I == 0 then27 else if Warp (j) .I == 0 then

28 Warp(j).I := 128 Warp (j) .I: = 1

29 Pair_List[j][0] := i29 Pair_List [j] [0]: = i

한편, 본 발명의 한 실시예에 따른 병렬 프로세싱 유닛을 실제 하드웨어에 구현하여서 성능을 측정한 결과에 대해서 설명한다.The parallel processing unit according to an embodiment of the present invention is implemented in actual hardware and the measurement results are described.

아래 표 1에 나타낸 것처럼, 성능 측정에서 15개의 멀티프로세서(Streaming multiprocessors, SM)를 포함하는 병렬 프로세싱 유닛이 사용된다. 또한 SM당 최대 1536개의 쓰레드가 생성되고, 2개의 워프 스케줄러가 사용된다. 또한 L1D 캐시로 128 바이트 캐시 라인을 가지고 4-웨이(way)를 지원하는 16KB의 캐시 또는 128 바이트 캐시 라인을 가지고 6-웨이(way)를 지원하는 48KB의 캐시가 사용된다.As shown in Table 1 below, a parallel processing unit including 15 multiprocessors (SM) is used in the performance measurement. Also, up to 1536 threads are created per SM, and two warp schedulers are used. In addition, the L1D cache uses a 128KB cache line, a 16KB cache that supports 4-way, or a 48KB cache that supports a 6-way with a 128-byte cache line.

# of SMs# of SMs 1515 # of threads# of threads Max 1536 per SMMax 1536 per SM # of war schedulers# of war schedulers 2 per SM2 per SM L1D cacheL1D cache 16KB w/ 128B lines, and 4 ways
48KB w/ 128B lines, and 6 ways16KB w / 128B lines, and 4 ways
48KB w / 128B lines, and 6 ways L1D policyL1D policy Write no-allocate, local write-back, global write-through, and pseudo LRUWrite no-allocate, local write-back, global write-through, and pseudo LRU L2 cacheL2 cache 768KB w/ 128B lines, and 8 ways768KB w / 128B lines, and 8 ways L2 policyL2 policy Allocate on miss, write-back, and pseudo LRUAllocate on miss, write-back, and pseudo LRU DRAMDRAM GDDR5 w/ 16 banks, tCL=12, tRCD-12, and tRAS=28GDDR5 w / 16 banks, tCL = 12, tRCD-12, and tRAS = 28 Victim Tag ArrayVictim Tag Array 8 tags per set, 48 sets, and FIFO8 tags per set, 48 sets, and FIFO

이 경우, 워프 파티셔닝만을 사용하는 워프 스케줄링 방법은 기존의 워프 스케줄링에 비해서 32%만큼 성능을 향상시킬 수 있었다. 또한 워프 파티셔닝과 워프 스로틀링을 사용하는 워프 스케줄링 방법은 기존의 워프 스케줄링에 비해서 54%만큼 성능을 향상시킬 수 있었다.In this case, the warp scheduling method using only warp partitioning can improve the performance by 32% as compared with the conventional warp scheduling. Also, the warp scheduling method using warp partitioning and warp throttling can improve performance by 54% compared to the conventional warp scheduling.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

Claims

Includes multiprocessor,
The multi-
A memory unit including a L1D (level 1 data) cache and a shared memory to which a plurality of cache lines are allocated, and
A thread group scheduler that changes the cache access of the second thread group to the shared memory and isolates the second thread group when a predetermined first condition is satisfied for the first thread group in an active state,
&Lt; / RTI >

The method of claim 1,
Wherein the first condition comprises a condition that the level of cache interference for the first thread group exceeds a first cutoff.

3. The method of claim 2,
Wherein the second thread group comprises a thread group that most frequently interferes with the thread group causing the first thread group to interfere with the first thread group.

3. The method of claim 2,
And if the predetermined second condition is satisfied, the thread group scheduler changes the cache access of the second thread group back to the L1D cache.

5. The method of claim 4,
Wherein the second condition includes a second cutoff beam or a lower condition where the level of cache interference for the first thread group is lower than the first cutoff.

5. The method of claim 4,
Wherein the second condition comprises a condition that execution of the first thread group is completed.

5. The method of claim 4,
Wherein the thread group scheduler determines the first condition at the end of the first period of time and the second condition at the end of the second period of time in the cycle consisting of the first period and the second period, unit.

The method of claim 1,
And the thread group scheduler stall the isolated second thread group if a predetermined second condition is satisfied.

9. The method of claim 8,
Wherein the second condition comprises a condition that a level of cache interference for a third thread group isolated to the shared memory exceeds a first cutoff.

The method of claim 9,
And if the predetermined third condition is satisfied, the thread group scheduler reactivates the stalled second thread group.

11. The method of claim 10,
Wherein the third condition comprises a second cutoff beam or a lower condition where the level of cache interference for the third thread group is lower than the first cutoff.

11. The method of claim 10,
Wherein the second condition comprises a condition that execution of the third thread group is completed.

11. The method of claim 10,
Wherein the thread group scheduler determines the second condition at the end of the first time and determines the third condition at the end of the second time in the cycle consisting of the first time and the second time, unit.

A parallel processing unit according to claim 1,
A central processing unit (CPU)
System memory, and
A memory bridge connecting the parallel processing unit, the CPU and the system memory,
&Lt; / RTI >

Includes multiprocessor,
The multi-
A memory unit including an L1D (level 1 data) cache and a shared memory, and
A thread group scheduler,
Wherein the shared memory comprises a plurality of shared memory banks having a plurality of rows,
Wherein the plurality of shared memory banks are grouped into a plurality of bank groups,
A plurality of cache lines are allocated to a plurality of shared memory banks belonging to each bank group,
The thread group scheduler changes the cache access of some thread groups to the cache line of the shared memory
Parallel processing unit.

17. The method of claim 16,
Wherein a bank group to which a cache line is assigned is assigned a number of the thread group corresponding to the tag for the certain cache line in the other bank group.

1. A thread group scheduling method of a parallel processing unit comprising a memory unit including an L1D (level 1 data) cache and a shared memory,
Detecting a level of cache interference for the first thread group, and
And when the level of the cache interference exceeds the first cutoff, a cache access of a second thread group satisfying a predetermined condition among the thread groups causing interference to the first thread group is changed to the shared memory, Isolating
/ RTI >

The method of claim 17,
Again detecting the level of cache interference for the first thread group, and
And changing the cache access of the second thread group back to the L1D cache if the level of the re-detected cache interference is lower than a second cutoff lower than the first cutoff.

The method of claim 17,
Detecting a level of cache interference for a third thread group isolated to the shared memory, and
Stalling the isolated second thread group if the level of cache interference for the third thread group exceeds a second cutoff,
The thread group scheduling method further comprising:

20. The method of claim 19,
Again detecting the level of cache interference for the third thread group, and
And reactivating the stalled second thread group if the level of the re-detected cache interference for the third thread group is lower than the second cutoff beam or the lower third cutoff
The thread group scheduling method further comprising: