KR101940523B1

KR101940523B1 - Apparatus and method for warp scheduling

Info

Publication number: KR101940523B1
Application number: KR1020180050278A
Authority: KR
Inventors: 김철홍; 김광복; 콩 튜안 두
Original assignee: 전남대학교산학협력단
Priority date: 2018-04-30
Filing date: 2018-04-30
Publication date: 2019-01-21

Abstract

The present invention relates to warp scheduling capable of increasing parallelism in processing an instruction. An operation method of an electronic device comprises the steps of: determining an instruction causing latency above a threshold value; assigning a priority to warp corresponding to the determined instruction; and executing the instruction according to the priority of the warp.

Description

[0001] APPARATUS AND METHOD FOR WARP SCHEDULING [0002]

본 발명은 워프 스케줄링에 관한 것으로, 더욱 상세하게는 워프에 대한 명령어(instruction)의 지연시간(latency)에 기반하는 워프 스케줄링에 관한 것이다.The present invention relates to warp scheduling and, more particularly, to warp scheduling based on latency of instructions to a warp.

GPU(graphic processing unit)는 그래픽 작업량뿐만 아니라 범용 작업부하(workload)를 처리하는데 널리 사용된다. GPU는 병렬 애플리케이션의 경우 CPU(central processing unit)보다 훨씬 더 높은 성능을 제공할 수 있다. 특히, GPU의 계산 능력은 다수의 스트리밍 멀티프로세서(streaming multiprocessor, SM) 및 높은 메모리 대역폭의 기가 바이트를 사용하여 달성될 수 있다. 따라서, GPU는 범용 작업부하를 처리하기 위한 컴퓨팅 플랫폼이 되었으며 연구에서 점점 더 많은 주목을 받았다.Graphics processing units (GPUs) are widely used to handle general workloads as well as graphics workloads. GPUs can provide much higher performance than central processing units (CPUs) for parallel applications. In particular, the computing power of a GPU can be achieved using multiple streaming multiprocessors (SMs) and gigabytes of high memory bandwidth. As a result, the GPU has become a computing platform for handling general-purpose workloads and has received increasing attention in research.

범용 컴퓨팅을 위한 GPU 아키텍처는 높은 처리량에 중점을 두고 설계되었기 때문에 비 예상(non-speculative) 주문형 프로세서 파이프라인이 다수의 계산 단위에 대한 절충안으로 적용된다. GPU는 하나의 중요한 문제인 전역 로드(global load)와 같은 긴 지연시간 연산의 오버 헤드를 극복하기 위해 빠른 컨텍스트 스위칭을 지원하여 이러한 지연 시간을 숨기기(hiding) 위해 최대한 많은 동시 워프를 예약할 수 있다. 즉, 워프 실행이 중단될 때마다 즉시 바꿀 수 있고 또 다른 워프를 즉시 바꿔 넣을 수 있어 페널티 없이 자원 활용도를 높일 수 있다. 파이프라인이 이러한 긴 지연 시간 동작 하에서 계속 활성화되도록 하기 위해 워프 스케줄러는 각 사이클의 워프 풀에서 준비 워프(ready warp)를 선택하여 다음 명령어를 실행할 수 있다. 이론적으로, GPU는 이러한 방식으로 높은 스레드 레벨 병렬 처리(thread level parallelism, TLP)를 달성 할 수 있지만, 파이프라인은 여전히 하드웨어 자원의 활용도를 낮추는 긴 시간의 비활성 시간으로 인해 어려움을 겪는다.Because the GPU architecture for general-purpose computing is designed with high-throughput focus, a non-speculative on-demand processor pipeline is a compromise for many computing units. The GPU supports fast context switching to overcome the overhead of long latency operations such as global load, which is one important issue, to reserve as many concurrent warps as possible to hiding this latency. In other words, you can change immediately whenever warp execution is interrupted, and you can swap another warp immediately, increasing resource utilization without penalty. To ensure that the pipeline continues to be active under such long delay time operations, the warp scheduler can execute the next instruction by selecting a ready warp in the warp pool of each cycle. In theory, the GPU can achieve high thread level parallelism (TLP) in this way, but the pipeline still suffers due to the long inactivity time that lowers the utilization of hardware resources.

워프 스케줄러는 GPU에서 하드웨어 리소스의 사용률을 높이기 위한 핵심적인 역할을 할 수 있다. 일반적으로 사용되는 라운드 로빈(round robin, RR) 워프 예약 정책은 모든 워프에 동일한 우선순위를 할당할 수 있다. 그러나 이전의 연구 결과에서는 전통적인 RR 스케줄링이 제한된 오프 칩 DRAM 대역폭으로 인해 발생하는 긴 메모리 페치(fetch) 지연시간을 숨기지 못하고 GPU SM의 활용도가 크게 떨어지는 것을 확인할 수 있다. 그러나 이러한 RR 방식은 짧은 지연시간 연산을 숨기기 위해 워프/스레드 병렬 처리를 낭비할 수 있다. 한편, 모든 워프가 긴 지연시간 연산을 실행하면 RR 스케줄링은 그러한 지연시간을 숨길 수 없으므로 심각한 성능 저하를 초래할 수 있다.The warp scheduler can play a key role in increasing the utilization of hardware resources in the GPU. A commonly used round robin (RR) warp reservation policy can assign the same priority to all warps. However, previous studies have shown that the long memory fetch latency caused by off-chip DRAM bandwidth, which is limited by traditional RR scheduling, is not concealed and the GPU SM utilization is greatly reduced. However, these RR schemes can waste warp / thread parallelism to hide short delay time operations. On the other hand, if all the warps execute a long delay time operation, the RR scheduling can not hide such a delay time, which can lead to serious performance degradation.

[특허문헌 1] 한국등록특허 제10-1236562호[Patent Document 1] Korean Patent No. 10-1236562

본 발명은 전술한 문제점을 해결하기 위하여 창출된 것으로, 워프 스케줄링을 위한 장치 및 방법을 제공하는 것을 그 목적으로 한다.SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to provide an apparatus and method for warp scheduling.

또한, 본 발명은 워프에 대한 명령어(instruction)의 지연시간(latency)에 기반하는 워프 스케줄링을 위한 장치 및 방법을 제공하는 것을 그 목적으로 한다.It is also an object of the present invention to provide an apparatus and method for warp scheduling based on the latency of an instruction for a warp.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood from the following description.

상기한 목적들을 달성하기 위하여, 본 발명의 일 실시예에 따른 전자 장치의 동작 방법은 임계값 이상의 지연시간(latency)을 발생시키는 명령어(instruction)를 결정하는 단계; 상기 결정된 명령어에 대응하는 워프에 우선순위를 할당하는 단계; 및 상기 워프의 우선순위에 따라 상기 명령어를 실행하는 단계;를 포함할 수 있다.According to an aspect of the present invention, there is provided a method of operating an electronic device, including: determining an instruction to generate a latency of a threshold value or more; Assigning a priority to a warp corresponding to the determined instruction; And executing the instruction according to the priority of the warp.

실시예에서, 상기 명령어를 결정하는 단계는, 상기 임계값 이상의 지연시간을 발생시키는 제1 명령어를 결정하는 단계; 및 상기 임계값 미만의 지연시간을 발생시키는 제2 명령어를 결정하는 단계;를 포함할 수 있다.In an embodiment, determining the instruction comprises: determining a first instruction to generate a delay time greater than or equal to the threshold value; And determining a second instruction to generate a delay time that is less than the threshold value.

실시예에서, 상기 제1 명령어에 대응하는 워프의 제1 우선순위는 상기 제2 명령어에 대응하는 워프의 제2 우선순위보다 높을 수 있다.In an embodiment, the first priority of the warp corresponding to the first instruction may be higher than the second priority of the warp corresponding to the second instruction.

실시예에서, 상기 명령어를 실행하는 단계는, 상기 제1 우선순위 및 상기 제2 우선순위에 따라, 상기 제1 명령어를 수행하는 단계; 및 상기 제1 명령어를 수행한 후, 상기 제2 명령어를 수행하는 단계;를 포함할 수 있다.In an embodiment, the step of executing the instruction comprises: performing the first instruction according to the first priority and the second priority; And performing the second instruction after performing the first instruction.

실시예에서, 상기 제1 명령어에 대응하는 워프는, 가이딩 풀(guiding pool)에 포함되고, 상기 제2 명령어에 대응하는 워프는, 필링 풀(filling pool)에 포함될 수 있다.In an embodiment, a warp corresponding to the first instruction is included in a guiding pool, and a warp corresponding to the second instruction may be included in a filling pool.

실시예에서, 전자 장치는 임계값 이상의 지연시간(latency)을 발생시키는 명령어(instruction)를 결정하고, 상기 결정된 명령어에 대응하는 워프에 우선순위를 할당하며, 상기 워프의 우선순위에 따라 상기 명령어를 실행하는 워프 스케줄러를 포함할 수 있다.In an embodiment, the electronic device determines an instruction that causes a latency above a threshold value, assigns a priority to a warp corresponding to the determined instruction, and updates the instruction according to the priority of the warp And may include a running warp scheduler.

실시예에서, 상기 워프 스케줄러는, 상기 임계값 이상의 지연시간을 발생시키는 제1 명령어를 결정하며, 상기 임계값 미만의 지연시간을 발생시키는 제2 명령어를 결정할 수 있다.In an embodiment, the warp scheduler may determine a first instruction to generate a delay time that is greater than or equal to the threshold value, and determine a second instruction to generate a delay time that is less than the threshold value.

실시예에서, 상기 워프 스케줄러는, 상기 제1 우선순위 및 상기 제2 우선순위에 따라, 상기 제1 명령어를 수행하고, 상기 제1 명령어를 수행한 후, 상기 제2 명령어를 수행할 수 있다.In an embodiment, the warp scheduler may perform the first instruction in accordance with the first priority and the second priority, and may perform the second instruction after performing the first instruction.

상기한 목적들을 달성하기 위한 구체적인 사항들은 첨부된 도면과 함께 상세하게 후술될 실시예들을 참조하면 명확해질 것이다.The above and other objects, features and advantages of the present invention will be more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which: FIG.

그러나, 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라, 서로 다른 다양한 형태로 구성될 수 있으며, 본 발명의 개시가 완전하도록 하고 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자(이하, "통상의 기술자")에게 발명의 범주를 완전하게 알려주기 위해서 제공되는 것이다.The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. Quot; a "general inventor") to fully disclose the scope of the invention.

본 발명의 일 실시예에 의하면, 긴 지연시간을 발생시키는 명령어를 우선적으로 실행할 수 있도록 명령어들을 워프 단위로 정렬함으로써 명령어 처리에 대한 병렬성을 향상시킬 수 있다According to an embodiment of the present invention, the parallelism of the instruction processing can be improved by arranging the instructions in units of warp so as to preferentially execute the instructions that generate the long delay time

본 발명의 효과들은 상술된 효과들로 제한되지 않으며, 본 발명의 기술적 특징들에 의하여 기대되는 잠정적인 효과들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects described above, and the potential effects expected by the technical features of the present invention can be clearly understood from the following description.

도 1은 본 발명의 일 실시예에 따른 기본 GPU 아키텍쳐의 기능적 구성을 도시한 도면이다.
도 2는 본 발명의 일 실시예에 따른 워프 스케줄링의 예를 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 전자 장치의 기능적 구성을 도시한 도면이다.
도 4는 본 발명의 일 실시예에 따른 워프 발행 유닛의 기능적 구성을 도시한 도면이다.
도 5는 본 발명의 일 실시예에 따른 워프 스케줄러의 동작 방법의 흐름도이다.
도 6은 본 발명의 일 실시예에 따른 워프 스케줄러의 의사 코드(pseudo code)를 도시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 워프 스케줄러의 성능 그래프이다.
도 8은 본 발명의 일 실시예에 따른 워프 스케줄러의 L1 데이터 캐시의 미스 비율 그래프이다.
도 9는 본 발명의 일 실시예에 따른 워프 스케줄러의 L2 캐시의 미스 비율 그래프이다.
도 10은 본 발명의 일 실시예에 따른 워프 스케줄러의 캐시 액세스 그래프이다.
도 11은 본 발명의 일 실시예에 따른 워프 스케줄러의 스톨 사이클 그래프이다.
도 12는 본 발명의 일 실시예에 따른 다른 스톨 사이클 그래프이다.
도 13은 본 발명의 일 실시예에 따른 워프 스케줄러의 다른 성능 그래프이다.1 is a functional block diagram of a basic GPU architecture according to an embodiment of the present invention.
2 is a diagram illustrating an example of warp scheduling according to an embodiment of the present invention.
3 is a functional block diagram of an electronic device according to an embodiment of the present invention.
4 is a functional block diagram of a warp issuing unit according to an embodiment of the present invention.
5 is a flowchart of a method of operating a warp scheduler according to an embodiment of the present invention.
6 is a diagram illustrating pseudo code of a warp scheduler according to an embodiment of the present invention.
7 is a graph of performance of a warp scheduler according to an embodiment of the present invention.
8 is a miss ratio graph of the L1 data cache of the warp scheduler according to an embodiment of the present invention.
9 is a graph showing a miss ratio of the L2 cache of the warp scheduler according to an exemplary embodiment of the present invention.
10 is a cache access graph of a warp scheduler according to an embodiment of the present invention.
11 is a stall cycle graph of a warp scheduler according to an embodiment of the present invention.
12 is another stall cycle graph according to an embodiment of the present invention.
13 is another performance graph of a warp scheduler according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고, 여러 가지 실시예들을 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 이를 상세히 설명하고자 한다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

청구범위에 개시된 발명의 다양한 특징들은 도면 및 상세한 설명을 고려하여 더 잘 이해될 수 있을 것이다. 명세서에 개시된 장치, 방법, 제법 및 다양한 실시예들은 예시를 위해서 제공되는 것이다. 개시된 구조 및 기능상의 특징들은 통상의 기술자로 하여금 다양한 실시예들을 구체적으로 실시할 수 있도록 하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다. 개시된 용어 및 문장들은 개시된 발명의 다양한 특징들을 이해하기 쉽게 설명하기 위한 것이고, 발명의 범위를 제한하기 위한 것이 아니다.Various features of the invention disclosed in the claims may be better understood in view of the drawings and detailed description. The devices, methods, processes and various embodiments disclosed in the specification are provided for illustration. The disclosed structural and functional features are intended to enable a person skilled in the art to practice various embodiments and are not intended to limit the scope of the invention. The terms and phrases disclosed are intended to facilitate understanding of the various features of the disclosed invention and are not intended to limit the scope of the invention.

본 발명을 설명함에 있어서, 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 상세한 설명을 생략한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

이하, 본 발명의 일 실시예에 따른 워프(warp)에 대한 명령어(instruction)의 지연시간(latency)에 기반하는 워프 스케줄링을 설명한다. 여기서, 워프는 32개의 스레드들의 집합으로 GPU(graphic processing unit)에서 연산을 수행하기 위한 기본 단위를 의미할 수 있다.Hereinafter, a description will be given of warp scheduling based on a latency of an instruction for a warp according to an embodiment of the present invention. Here, the warp is a set of 32 threads, which can mean a basic unit for performing operations in a graphics processing unit (GPU).

도 1은 본 발명의 일 실시예에 따른 기본 GPU 아키텍쳐의 기능적 구성을 도시한 도면이다.1 is a functional block diagram of a basic GPU architecture according to an embodiment of the present invention.

도 1을 참고하면, GPU(100)는 스트리밍 멀티프로세서(streaming multiprocessor, SM)(110), 상호연결 네트워크(120), 메모리 컨트롤러(130) 및 GDDR RAM(140)을 포함할 수 있다.Referring to FIG. 1, the GPU 100 may include a streaming multiprocessor (SM) 110, an interconnect network 120, a memory controller 130, and a GDDR RAM 140.

일 실시예에서, GPU(100)는 다수의 SM(100)들을 포함할 수 있다. 각 SM(110)은 일반적으로 8~32개의 단일 명령어 다중 스레드(single instruction multiple thread, SIMT) 레인을 포함할 수 있다. 예를 들어, GPU(100)는 각각 32의 SIMT 폭, 즉, 32개 IPC(instructions per cycle)를 갖는 15개의 SM(110)들을 포함할 수 있다. 32개의 스레드 각각으로부터 하나의 명령어가 집합적으로 발행(issue)될 수 있다. 모든 SIMT 레인은 별개의 스레드를 실행하고 잠금 단계에서 처리하기 위해 스칼라 레지스터에서 작동합니다. 또한, SM(110)은 L1 데이터 캐쉬, 텍스쳐 캐쉬 및 상수 캐쉬를 포함할 수 있다.In one embodiment, the GPU 100 may include multiple SMs 100. Each SM 110 may typically comprise 8 to 32 single instruction multiple thread (SIMT) lanes. For example, the GPU 100 may include 15 SMs 110 each having 32 SIMT widths, i.e., 32 instructions per cycle (IPC). One instruction from each of the 32 threads can be issued collectively. Every SIMT lane runs in a scalar register to execute a separate thread and process it in the lock phase. In addition, the SM 110 may include an L1 data cache, a texture cache, and a constant cache.

상호연결 네트워크(120)는 SM(100)과 연결될 수 있다. SM(110)은 다중 스레드이며 파이프라인 방식이며 주로 레지스터 파일(즉, 오퍼랜드(operand) 수집기), 실행 유닛(ALU(Arithmetic Logic Unit)/FPU(Floating Point Unit)/SFU(special function units)) 및 로드/저장 유닛(LD(load)/ST(store))으로 구성될 수 있다. GPU(100)는 여러 수준의 메모리 계층 구조를 포함할 수 있다. 병렬 컴퓨팅을 지원하기 위해 GPU(100)는 다수의 스레드들을 수용할 수 있도록 SM(110) 당 방대한 온칩 레지스터 파일을 제공할 수 있다. 온칩 스토리지는 L1 데이터 캐시, 읽기 전용 텍스처, 상수 캐시 및 저 지연시간(low-latency) 공유 메모리와 같은 특정 용도의 캐시로 분할될 수 있다. 데이터의 목적지에 따라, 메모리 요청은 대응하는 캐시 메모리로 전송될 수 있다. 각 SM(110)의 전용 L1 데이터 캐시는 짧은 지연시간을 가지며 협업 스레드 배열(cooperative thread array, CTA) 내의 모든 스레드에서 액세스 할 수 있다. 각 메모리 컨트롤러는 메모리 컨트롤러가 있는 공유 L2 캐시 뱅크 슬라이스와 연관되어 있으며 메모리 파티션으로 정의될 수 있다. 전역 메모리(장치 메모리라고도 함)는 긴 지연시간으로 오프 칩에 위치하며 그리드의 모든 스레드 블록이 액세스 할 수 있다.The interconnect network 120 may be coupled to the SM 100. The SM 110 is a multithreaded, pipelined and primarily register file (i.e., an operand collector), an execution unit (Arithmetic Logic Unit), a Floating Point Unit (SFU) / special function units (SFU) And a load / store unit (LD (load) / ST (store)). GPU 100 may include multiple levels of memory hierarchy. To support parallel computing, the GPU 100 may provide a vast on-chip register file per SM 110 to accommodate multiple threads. On-chip storage can be partitioned into special-purpose caches such as L1 data cache, read-only texture, constant cache, and low-latency shared memory. Depending on the destination of the data, the memory request may be transferred to the corresponding cache memory. The dedicated L1 data cache of each SM 110 has a short latency and is accessible to all threads in a cooperative thread array (CTA). Each memory controller is associated with a shared L2 cache bank slice with a memory controller and can be defined as a memory partition. Global memory (also known as device memory) is located off-chip with long latency and is accessible by all thread blocks in the grid.

CUDA 및 OpenCL과 같은 새로운 병렬 프로그래밍 모델을 사용하면 GPGPU(general-purpose computing on graphics processing units) 응용 프로그램이 계층 구조로 구현될 수 있다. 최상위 레벨에서 커널은 응용 프로그램의 특정 모듈을 구현할 수 있다. 커널의 스레드는 CTA 또는 스레드 블록(thread block, TB)으로 그룹화될 수 있다. 이는 모든 동기화 및 장벽 기본 요소를 워프 스레드 사이에 넣고 기본 하드웨어가 처리량을 최대화하기 위해 임의의 순서로 CTA를 실행하는 것을 돕는 개념이다. 차례대로 각 CTA는 일반적으로 32개의 형제(sibling) 스레드를 포함하는 워프로 세분될 수 있다. 병렬 커널 호출은 동일하거나 여러 GPU에서 허용될 수 있다. 응용 프로그램에 여러 CUDA 스트림이 포함되어있는 경우 여러 커널이 동시에 실행될 수 있다. SM 아키텍처는 워프에서 스레드를 실행하는 SIMT 실행 모델을 기반으로 하므로 명령어 페치(instruction fetch) 및 디코드 오버 헤드가 상환될 수 있다.With a new parallel programming model such as CUDA and OpenCL, GPU (general purpose computing on graphics processing units) applications can be implemented in a hierarchical structure. At the top level, the kernel can implement a specific module of an application. Threads in the kernel can be grouped into CTAs or thread blocks (TBs). This is a concept that puts all synchronization and barrier primitives between warp threads and helps the underlying hardware to execute the CTA in any order to maximize throughput. In turn, each CTA can be subdivided into warps, which typically include 32 sibling threads. Parallel kernel calls can be allowed on the same or on multiple GPUs. If your application contains multiple CUDA streams, multiple kernels can run simultaneously. The SM architecture is based on a SIMT execution model that executes threads in a warp, so instruction fetch and decode overhead can be redeemed.

효율적으로 실행되기 위해서, 스레드 스케줄링은 다중 단계 프로세스로서 수행될 수 있다. 처음에는 GPGPU 응용 프로그램의 커널이 GPU에서 실행되고 전역 CTA 스케줄러는 시작된 커널의 CTA를 사용 가능한 모든 SM에 할당할 수 있다. 이 연산에서 CTA 할당은 라운드 로빈(round robin, RR) 방식으로 수행될 수 있다. GPU(100)는 FGMT(fine-grained multi-threaded)를 통해 계산 자원을 보다 효율적으로 활용할 수 있으며, 하나 또는 여러 개의 CTA를 SM(110)에 할당 할 수 있다. 그러나 SM(110)에서 동시에 실행될 수 있는 CTA의 수는 하드웨어 자원에 의해 제한될 수 있다. 즉, SM(110)은 충분한 CTA가 있는 상태에서 여러 CTA를 실행할 수 있는 경우 CTA 할당은 최대 수에 도달 할 때까지 계속될 수 있다. CTA 할당 후, 시작된 CTA와 관련된 워프는 해당 SM(110)의 SIMT 레인으로 스케줄링됩니다. LRR(loose round-robin), 2-레벨 스케줄링(two-level scheduling), GTO(greedy-then-oldest)와 같은 다수의 워프 스케줄러들이 사용될 수 있다.To be efficient, thread scheduling can be performed as a multi-step process. Initially, the kernel of the GPGPU application is run on the GPU, and the global CTA scheduler can allocate the CTA of the started kernel to all available SMs. In this operation, the CTA allocation can be performed in a round robin (RR) manner. GPU 100 may utilize computational resources more efficiently through fine-grained multi-threaded (FGMT) and may assign one or more CTAs to SM 110. However, the number of CTAs that can be executed simultaneously in the SM 110 may be limited by hardware resources. That is, if the SM 110 can execute multiple CTAs with sufficient CTAs, the CTA allocations may continue until the maximum number is reached. After CTA allocation, the warp associated with the initiated CTA is scheduled as a SIMT lane of the corresponding SM (110). A number of warp schedulers may be used, such as loose round-robin (LRR), two-level scheduling, and greedy-then-oldest (GTO).

일 실시예에서, 레지스터 파일에 원하는 데이터가 존재하지 않는 경우, 지역 메모리에 접근하여 명령어 수행을 위해 필요한 데이터가 요청될 수 있다. 지역 메모리에서 데이터 요구(또는 메모리 요구)가 실패할 경우, 해당 데이터 요구가 GDDR RAM(140)으로 접근할 수 있도록 요청될 수 있다. 이 경우, GDDR RAM(140)으로 접근되는 데이터 요구는 메모리 컨트롤러(130)에 의해 처리될 수 있다.In one embodiment, if there is no desired data in the register file, the data needed to access the local memory and perform the instruction may be requested. If a data request (or memory request) in the local memory fails, the data request may be requested to access the GDDR RAM 140. In this case, the data request to access the GDDR RAM 140 may be processed by the memory controller 130.

도 2는 본 발명의 일 실시예에 따른 워프 스케줄링의 예를 도시한 도면이다.2 is a diagram illustrating an example of warp scheduling according to an embodiment of the present invention.

도 2를 참고하면, SM 상에 동시에 실행되는 6개의 워프들(W1, ..., W6)이 있고, 각각의 워프는 명령어를 실행할 수 있고, 상기 명령어는 메모리(memory) 및 계산(computation)과 같은 2가지 타입의 연산으로 분류될 수 있다. W1, W3, W5 및 W6은 mem->mem->compt->mem과 같은 순서로 명령어를 실행하지만 W2 및 W4는 compt->compt->mem->mem와 같은 순서로 명령어를 실행할 수 있다. 일반적으로 계산 연산은 일반적으로 짧은 지연시간(short latency)을 갖지만 메모리 연산은 캐시 미스(miss)의 경우 지연시간이 길고 캐시 히트(hit)의 경우 짧은 지연시간을 가질 수 있다. 따라서 캐시 메모리에서 히트하는 메모리 연산의 지연시간은 계산 연산의 지연시간보다 약간 더 길다고 가정할 수 있다. 이 경우, 도 2에서 계산 연산의 지연시간과 캐시 히트의 메모리 연산의 지연시간을 확인할 수 있다. 메모리 연산이 캐시 메모리에서 미스되면 요청된 데이터를 하위 메모리로부터 불러올 수 있다. 그 결과, 계산 동작과 비교하여 이러한 유형의 동작에 훨씬 더 긴 지연시간이 요구될 수 있다. Referring to FIG. 2, there are six warps W1, ..., W6 running concurrently on the SM, each of which can execute an instruction, which includes memory and computation, Can be classified into two types of operations. W1, W3, W5, and W6 execute commands in the same order as mem-> mem-> compt-> mem, but W2 and W4 can execute commands in the same order as compt-> compt-> mem-> mem. In general, calculation operations generally have a short latency, but a memory operation can have a long delay time in case of a cache miss and a short delay time in case of a cache hit. Therefore, it can be assumed that the delay time of the memory operation hitting in the cache memory is slightly longer than the delay time of the calculation operation. In this case, the delay time of the calculation operation and the delay time of the memory operation of the cache hit can be confirmed in Fig. If the memory operation is missed in the cache memory, the requested data can be fetched from the lower memory. As a result, a much longer delay time may be required for this type of operation as compared to the calculation operation.

일 실시예에서, GPU는 신속한 컨텍스트 스위칭을 수행할 수 있다. 즉, 액세스 우선순위를 하나의 워프에서 다른 워프로 하드웨어 리소스에 대한 액세스 우선순위를 즉시 이동시켜 파이프라인의 활성화가 유지될 수 있다. In one embodiment, the GPU can perform rapid context switching. That is, the activation of the pipeline can be maintained by immediately moving the access priority for hardware resources from one warp to another.

종래의 RR 알고리즘의 경우, 워프는 RR 방식으로 스케줄링될 수 있다. 즉, 현재 실행된 워프가 정지될 때, 미리 정의된 워프는 상기 워프의 상태와 관계없이 다음 명령어를 실행하도록 선택될 수 있다. 결과적으로, 유휴 시간이 길어져 성능이 심각하게 저하될 수 있다. In the case of the conventional RR algorithm, the warp can be scheduled in the RR scheme. That is, when the currently executed warp is stopped, the predefined warp can be selected to execute the next command regardless of the state of the warp. As a result, the idle time may become longer and the performance may be seriously degraded.

종래의 LRR(loose round robin) 알고리즘의 경우, 워프의 스케줄링 순서가 보다 유연하므로 미리 정의된 워프가 준비되지 않은 경우 즉시 준비 워프(ready warp)를 대체할 수 있다. 이것은 기본적으로 RR 방식의 문제를 해결하고 하드웨어 자원의 사용을 현저하게 증가시킬 수 있다. In the conventional loose round robin (LRR) algorithm, the scheduling order of the warp is more flexible, so that a ready warp can be replaced immediately if a predefined warp is not prepared. This basically solves the problem of the RR scheme and can significantly increase the use of hardware resources.

종래의 GTO 알고리즘을 사용하면 하드웨어 자원이 데이터 지역성(내부 워프 데이터 지역성)을 유지할 수 있으므로 효율적으로 활용될 수 있다. Using the conventional GTO algorithm, the hardware resources can maintain the data locality (internal warp data locality) and can be utilized effectively.

그러나, 상술한 방법은 워프 간 지역성(inter-warp locality)을 유지하지 못하며, 긴 지연시간을 오버랩하는 병렬 처리가 충분하지 않음을 확인할 수 있다. 특히 모든 계산 명령어가 각 계산 영역에서 모두 소모되므로 스케줄러가 준비 명령어를 찾지 못한 경우 스톨(stall) 사이클이 발생하기 때문에 각 계산 영역에서 모든 계산 명령어가 고갈될 수 있다. 이러한 문제는 SM에 할당하기 위해 스톨 상태가 아닌 워프가 없기 때문에 SM을 비활성 상태로 만들 수 있다.However, the above-described method does not maintain inter-warp locality, and it can be confirmed that parallel processing that overlaps a long delay time is not sufficient. In particular, since all the calculation instructions are consumed in each calculation area, a stall cycle occurs when the scheduler can not find the preparation instruction, so that all calculation instructions in each calculation area may be exhausted. This problem can be put into an inactive state because there is no warp that is not stalled to allocate to the SM.

본 발명의 다양한 실시예에 따라 상술한 문제에 대한 해결책을 제시하여 GPU의 기존 워프 스케줄러에 대한 긴 지연시간을 숨기는 기능을 향상시킬 수 있다. 충분한 양의 동시 스레드가 있는 경우, 연산량 특성뿐만 아니라 병렬 처리 사용에 따라 위의 스케줄링 정책으로 긴 지연시간을 숨길 수 있다. 본 발명의 다양한 실시예에 따른 워프 스케줄링은 사용 가능한 TLP를 저장하고 여러 개의 긴 지연시간 연산을 숨기기 위한 솔루션을 제공할 수 있다. GPGPU 응용 프로그램에서 각 워프는 짧은 지연시간 연산이 많은 실제 컴퓨팅 영역에 도착하기 전에 몇 가지 간단한 계산 연산 및/또는 공유 메모리 저장소로 구분된 적은 수의 전역 로드를 실행할 수 있다. 이러한 전역 로드가 서로 가까울 때, 긴 지연시간 로드마다 스톨에 대한 우선순위를 변경하는 것은 후속 전역 로드와 오버랩하기에 불충분한 몇 개의 작은 계산 영역을 야기할 수 있다. 따라서 긴 지연시간 연산을 실행하는 워프를 연속적으로 예약하여 TLP 절감에 대한 다른 접근 방식을 만들 수 있다. 긴 지연시간이 근접하게(closely) 위치하는 경우, 다른 워프를 사용하여 긴 지연시간의 나머지 부분을 오버랩하기 전에 해당 워프가 오버랩될 수 있다. 이러한 방식은 GPU의 강력한 병렬 기능을 고도로 활용할 수 있으며 이후의 긴 지연시간을 오버랩하는 불충분한 TLP의 문제를 피할 수 있다.A solution to the above-described problem may be provided in accordance with various embodiments of the present invention to enhance the ability to hide the long latency for existing warp schedulers of the GPU. If you have a sufficient amount of concurrent threads, you can hide the long latency with the above scheduling policy as well as the computational characteristics, as well as the use of parallel processing. The warp scheduling according to various embodiments of the present invention may provide a solution for storing available TLPs and for hiding several long latency operations. In a GPGPU application, each warp can perform a few simple computations and / or a small number of global loads separated by a shared memory store before reaching a real computing area with a short latency operation. When these global loads are close to each other, changing the priority for the stall for each long delay time load can result in several small calculation areas that are insufficient to overlap with the subsequent global load. Thus, a different approach to TLP savings can be made by continuously reserving warps that perform long latency operations. If the long delay time is closely located, the other warp may be used to overlap the warp before overlapping the remainder of the long delay time. This approach can highly utilize the powerful parallelism of the GPU and avoids the problem of insufficient TLP that overlaps the long latency in the future.

W1, W3, W5 및 W6은 긴 지연시간 연산을 실행하므로 W2 및 W4에 우선순위가 지정되며 RR 방식으로 예약될 수 있다. 이전 워프(former warp)로 인한 긴 지연시간은 이후 워프(latter warp)와 중복될 수 있다. 이러한 워프가 서로 지연시간을 숨길 수 없는 경우 W2 및 W4는 이전의 긴 지연시간의 나머지 부분과 오버랩하도록 스케줄링된 짧은 지연시간 연산을 실행할 수 있다. 이 전략으로 스톨 기간이 여전히 발생하지만, 스톨 기간은 상술한 워프 스케줄링 기법보다 짧을 수 있다. 즉, 본 발명의 다양한 실시예들에 따른 솔루션은 최상의 처리량을 제공할 수 있다. 따라서, 이것은 필요한 시간 동안 병렬 처리를 절약할뿐만 아니라 SM 유휴 시간을 감소시키기 때문에 긴 지연시간을 숨기는 효과적인 방법일 수 있다. 하드웨어 리소스 활용도가 높아지므로 GPU의 전반적인 성능이 향상될 수 있다. Since W1, W3, W5, and W6 execute a long delay time operation, W2 and W4 are prioritized and can be reserved in an RR manner. The long delay time due to the former warp may overlap with the latter warp. If these warps can not hide the delay time from each other, W2 and W4 can perform a short delay time operation scheduled to overlap with the rest of the previous long delay time. With this strategy, the stall period still occurs, but the stall period may be shorter than the warp scheduling technique described above. That is, a solution according to various embodiments of the present invention can provide the best throughput. Thus, this can be an effective way of concealing long delays because it saves parallel processing as well as SM idle time for the required amount of time. Higher utilization of hardware resources can improve the overall performance of the GPU.

도 3은 본 발명의 일 실시예에 따른 전자 장치(300)을 도시한 도면이다. 일 실시예에서, 전자 장치(300)는 셰이더 코어(shader core)를 포함할 수 있다.3 is a diagram illustrating an electronic device 300 in accordance with an embodiment of the present invention. In one embodiment, the electronic device 300 may include a shader core.

도 3을 참고하면, 전자 장치(300)는 페치(fetch) 유닛(310), I(instruction)-캐시 유닛(320), 디코딩 유닛(330), I-버퍼 유닛(340), 스코어보드(scoreboard)(350), 워프 발행 유닛(360), 레지스터 유닛(370), ALU(380) 및 MEM(390)을 포함할 수 있다.3, electronic device 300 includes a fetch unit 310, an I-cache unit 320, a decoding unit 330, an I-buffer unit 340, a scoreboard ) 350, a warp issue unit 360, a register unit 370, an ALU 380, and a MEM 390. [

페치 유닛(310)은 매주기마다 워프(warp)를 위한 I-버퍼(340)에서 PC(program counter)와 빈 슬롯을 선택할 수 있다. 해당 명령어는 I-캐시(320)에서 가져온 후 I-버퍼(340)의 빈 슬롯에 배치되기 전에 디코딩 유닛(330)을 통해 디코딩될 수 있다. 명령어는 준비 비트(ready bit)가 스코어보드(350)에 의해 설정될 때까지, 즉, 이 워프로부터의 이전 명령어가 완료될 때까지 I-버퍼(340)에서 대기할 수 있다. The fetch unit 310 may select a program counter (PC) and an empty slot in the I-buffer 340 for warp every cycle. The instruction may be decoded through the decoding unit 330 before being placed in the empty slot of the I-buffer 340 after being fetched from the I-cache 320. [ The instruction may wait in the I-buffer 340 until a ready bit is set by the scoreboard 350, i.e., until the previous instruction from this warp is completed.

워프 발행(issue) 유닛(360)은 특정 스케줄링 알고리즘에 기반한 워프를 선택하여 I-버퍼(340)에서 준비된 명령어로 실행할 수 있다. 그 후에, 발행된 명령어와 대응하는 I-버퍼(340) 내의 슬롯은 페치 유닛(310)에게 이 워프에 대한 다음 명령어를 인출하도록 신호하기 위해 invalid로 표시될 수 있다. 발행된 명령어는 레지스터 유닛(370)으로부터의 피연산자에 기반하여 명령어의 타입에 따라 ALU(380) 또는 MEM(390)에서 실행될 수 있다. 따라서, 본 발명의 다양한 실시예들에 따른 워프 스케줄링 알고리즘은 워프 발행 유닛(360)에서 구현될 수 있다. 즉, 워프 발행 유닛(360)은 전자 장치(300)의 워프 스케줄러로 동작할 수 있다.Warp issue unit 360 may select a warp based on a particular scheduling algorithm and execute it with instructions prepared in I-buffer 340. Thereafter, the slot in the I-buffer 340 corresponding to the issued instruction may be marked as invalid to signal the fetch unit 310 to fetch the next instruction for this warp. The issued instruction may be executed in ALU 380 or MEM 390, depending on the type of instruction, based on the operand from register unit 370. [ Thus, the warp scheduling algorithm according to various embodiments of the present invention may be implemented in warp issue unit 360. [ That is, the warp issuing unit 360 may operate as a warp scheduler of the electronic device 300.

도 4는 본 발명의 일 실시예에 따른 워프 발행 유닛(460)의 기능적 구성을 도시한 도면이다. 일 실시예에서, 워프 발행 유닛(460)은 도 3의 워프 발행 유닛(360)의 상세 구조일 수 있다.4 is a functional block diagram of the warp issuing unit 460 according to an embodiment of the present invention. In one embodiment, the warp issue unit 460 may be a detailed structure of the warp issue unit 360 of FIG.

도 4를 참고하면, 워프는 스케줄링 전에 워프 풀에 저장될 수 있다. 워프 발행 유닛(460)은 다음 명령어(next instruction)의 특성에 따라 워프를 분류하기 위해 분류 프로세스를 수행할 수 있다. 분류 프로세스를 통해, 각 워프에 대응하는 I-버퍼(340)의 엔트리를 조사하여 해당 워프가 발행될 명령어의 타입이 식별될 수 있다. 한편 스코어보드(350)는 다음에 발행된 명령어가 긴 지연시간을 필요로 하는지 여부를 예측하는데 사용될 수 있다. 분류 프로세스를 거친 후 마지막 발행된 워프를 제외한 모든 워프들이 가이딩 풀(guiding pool) 또는 필링 풀(filling pool) 중 하나에 저장될 수 있다. 가이딩 풀은 다음 스케줄링 시간에 긴 지연시간 로드 연산이 발행되는 워프를 포함할 수 있다. 필링 풀은 짧은 지연시간 로드 연산이 발행되는 워프를 포함할 수 있다. 이 워프가 다음에 실행하는 명령어가 타입을 변경하면 특정 풀의 워프를 다른 풀로 이동할 수 있다. 일반적으로 가이딩 풀의 워프는 필링 풀의 워프보다 우선순위가 높을 수 있다. 가이딩 풀과 필링 풀 모두의 워프가 발급되지 않는 경우 마지막으로 발급된 워프가 스케줄링될 수 있다. Referring to FIG. 4, a warp may be stored in a warp pool prior to scheduling. Warp issue unit 460 may perform the sort process to classify the warp according to the nature of the next instruction. Through the classification process, an entry of the I-buffer 340 corresponding to each warp can be examined to identify the type of instruction for which the warp will be issued. Scoreboard 350, on the other hand, can be used to predict whether the next issued instruction will require a long delay time. After the classification process, all warps, except the last issued warp, may be stored in either a guiding pool or a filling pool. The guiding pool may include a warp where a long latency load operation is issued at the next scheduling time. The filling pool may include a warp in which a short delay load operation is issued. If the next command this warp executes changes its type, you can move the warp of a specific pool to another pool. Generally, the guiding pool's warp may be higher than the filling pool's warp. If the warping of both the guiding pool and the pooling pool is not issued, the last issued warp may be scheduled.

본 발명의 다양한 실시예들에 따른 워프 스케줄러는 매우 간단한 하드웨어 로직을 구현할 수 있다. 주요 계산 오버 헤드는 I-버퍼 및 스코어보드에 대한 액세스를 포함하여 분류 프로세스로부터 발생할 수 있다. 워프를 분류하고 해당 워프 풀에 푸시하는 것이 지연될 수 있지만, 이 연산은 간단한 병렬 연산으로 수행되기 때문에 시스템에 거의 영향을 미치지 않을 수 있다. 저장소 오버 헤드 측면에서 볼 때 별도의 워프 풀을 구현하려면 추가 저장소가 필요할 수 있다.The warp scheduler according to various embodiments of the present invention may implement very simple hardware logic. The main computational overhead can arise from the classification process, including access to I-buffers and scoreboards. While it may be delayed to sort the warp and push it into the warp pool, this operation may be performed with a simple parallel operation and may have little effect on the system. In terms of storage overhead, additional repositories may be required to implement separate warp pools.

도 5는 본 발명의 일 실시예에 따른 전자 장치(500)의 동작 방법의 흐름도이다. 일 실시예에서, 전자 장치(500)는 셰이더 코어(shader core)를 포함할 수 있다.5 is a flow diagram of a method of operating an electronic device 500 in accordance with an embodiment of the present invention. In one embodiment, the electronic device 500 may include a shader core.

도 5를 참고하면, S501 단계에서, 전자 장치(500)는 임계값 이상의 지연시간(latency)을 발생시키는 명령어(instruction)를 결정할 수 있다. 일 실시예에서, 전자 장치(500)는 임계값 이상의 지연시간을 발생시키는 제1 명령어를 결정할 수 있다. 또한, 전자 장치(500)는 임계값 미만의 지연시간을 발생시키는 제2 명령어를 결정할 수 있다.Referring to FIG. 5, in step S501, the electronic device 500 may determine an instruction that causes a latency of more than a threshold value. In one embodiment, the electronic device 500 may determine a first instruction to generate a delay time that is above a threshold value. In addition, the electronic device 500 may determine a second instruction to generate a delay time that is less than a threshold value.

S503 단계에서, 전자 장치(500)는 상기 결정된 명령어에 대응하는 워프에 우선순위를 할당할 수 있다. 일 실시예에서, 상기 제1 명령어에 대응하는 워프의 제1 우선순위는 상기 제2 명령어에 대응하는 워프의 제2 우선순위보다 높을 수 있다.In step S503, the electronic device 500 may assign a priority to the warp corresponding to the determined command. In one embodiment, the first priority of the warp corresponding to the first instruction may be higher than the second priority of the warp corresponding to the second instruction.

S505 단계에서, 전자 장치(500)는 워프의 우선순위에 따라 명령어를 실행할 수 있다. 일 실시예에서, 전자 장치(500)는 제1 우선순위 및 제2 우선순위에 따라, 제1 명령어를 수행할 수 있다. 또한, 전자 장치(500)는 제1 명령어를 수행한 후, 제2 명령어를 수행할 수 있다.In step S505, the electronic device 500 may execute an instruction in accordance with the priority of the warp. In one embodiment, the electronic device 500 may perform a first instruction in accordance with a first priority and a second priority. In addition, the electronic device 500 may perform a first instruction and then a second instruction.

도 6은 본 발명의 일 실시예에 따른 워프 스케줄러의 의사 코드(pseudo code)를 도시한 도면이다.6 is a diagram illustrating pseudo code of a warp scheduler according to an embodiment of the present invention.

도 6을 참고하면, 본 발명의 다양한 실시예들에 따른 워프 스케줄러의 의사 코드는 정렬 단계와 스케줄링 단계를 포함할 수 있다.Referring to FIG. 6, the pseudo code of the warp scheduler according to various embodiments of the present invention may include an alignment step and a scheduling step.

정렬 단계에서, 모든 워프를 검사하여 가이딩 풀 또는 필링 풀 중 하나로 푸시(push)될 수 있다. 워프가 다음 스케줄링 시간에 긴 지연시간 로드를 실행하면 가이딩 풀에 푸시되고, 그렇지 않으면 채우기 풀에 푸시될 수 있다. 상기 연산은 워프 ID를 사용하여 구현될 수 있다. In the alignment step, all warps can be examined and pushed to either the guiding pool or the peeling pool. If the warp executes a long delay load at the next scheduling time, it can be pushed to the guiding pool, or it can be pushed to the fill pool. The operation may be implemented using a warp ID.

스케줄링 단계에서, 가이딩 풀로부터의 워프는 긴 지연시간을 야기하기 때문에, 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 필링 풀로부터의 워프보다 가이딩 풀로부터의 워프에 높은 우선순위를 부여할 수 있다. 가이딩 워프는 LRR 방식으로 스케줄링되며 시스템이 가이딩 워프의 타입을 더 이상 발행할 수 없을 때까지 우선순위가 부여될 수 있다. 가이딩 워프가 준비되지 않은 경우, 예를 들어 다음 명령어가 버퍼링되지 않은 경우, 필링 풀로부터의 워프는 이전 가이딩 워프에 의해 야기된 긴 지연시간을 오버랩시키기 위해 즉시 교체될 수 있다. 필링 워프는 또한 임의의 가이딩 워프가 스케줄링할 준비가 될 때까지 LRR 방식으로 스케줄링될 수 있다. 어떤 이유로든 워프를 사용할 수 없는 경우 마지막으로 발행된 워프가 발행될 수 있다. 필링 풀에서, I-버퍼의 준비 명령어(ready instruction)가 있는 워프는 다른 필링 워프보다 우선순위가 높을 수 있다. 워프 풀의 워프는 다른 풀로 이동할 수 있다. I-버퍼는 워프 ID를 사용하여 액세스할 수 있다. 반환 결과는 명령어의 타입(opcode)과 워프 ID의 해당 명령어를 실행하는데 필요한 피연산자(operand)일 수 있다. 마찬가지로 스코어보드는 워프 ID를 통해 액세스될 수 있다. I-버퍼 및 스코어보드의 액세스 결과는 워프의 다음 명령어가 긴 지연시간 로드 연산인지 여부를 예측하기 위한 정보로 사용될 수 있다.In the scheduling step, since the warp from the guiding pool causes a long delay time, the warp scheduler according to various embodiments of the present invention gives the warp from the guiding pool a higher priority than the warp from the filling pool . Guiding warps are scheduled in the LRR scheme and can be prioritized until the system can no longer issue guiding warp types. For example, if the guiding warp is not ready, for example if the next command is not buffered, the warp from the filling pool can be swapped immediately to overlap the long latency caused by the previous guiding warp. The filling warp may also be scheduled in the LRR manner until any guiding warp is ready for scheduling. If warp is not available for any reason, the last issued warp may be issued. In a filling pool, a warp with a ready instruction of an I-buffer may have a higher priority than another filling warp. The warp pool's warp can move to another pool. I-buffers can be accessed using warp IDs. The return result may be the type of the instruction (opcode) and the operand required to execute the corresponding instruction of the warp ID. Likewise, scoreboards can be accessed via warp IDs. The access result of the I-buffer and the scoreboard can be used as information for predicting whether or not the next command of the warp is a long latency load operation.

도 7은 본 발명의 일 실시예에 따른 워프 스케줄러의 성능 그래프이다.7 is a graph of performance of a warp scheduler according to an embodiment of the present invention.

도 7을 참고하면, 워프 스케줄링이 GPU에 미치는 영향을 확인할 수 있다. 다양한 기법에 따른 워프 스케줄링의 모든 결과는 LRR 정책을 사용하는 기본 GPU로 정규화(normalize)될 수 있다. 이 연산에 사용된 모든 응용 프로그램에서 성능 향상은 19.6%까지 이루어지지만 응용 프로그램이 느려지지는 않는다. MUM, EPP 및 SAOP와 같은 메모리 집약적인 응용 프로그램에서 각각 23.1%, 61.1% 및 52.6%의 성능 향상으로 나타날 수 있다. 특히 SP의 경우 성능이 다른 응용 프로그램보다 메모리에 덜 집중되지만 성능은 101.3% 향상될 수 있다. Referring to FIG. 7, the effect of the warp scheduling on the GPU can be confirmed. All results of warp scheduling according to various techniques can be normalized to a basic GPU using the LRR policy. Performance improvements of up to 19.6% for all applications used in this operation are not slowing down the application. 23.1%, 61.1% and 52.6%, respectively, in memory intensive applications such as MUM, EPP and SAOP. Especially in the case of SP, performance is less concentrated in memory than other applications, but performance can be improved by 101.3%.

본 발명의 다양한 실시예에 따른 워프 스케줄러를 2-레벨(two-level)과 GTO와 같은 워프 스케줄러와 비교하면, 워프를 지연시간에 따라 활성 세트(active set)와 펜딩 세트(pending set)로 분류하는 2-레벨 스케줄러는 평균보다 0.5% 낮은 GPU 성능을 약간 향상시키는 것을 확인할 수 있다. Comparing the warp scheduler according to various embodiments of the present invention with a warp scheduler such as two-level and GTO, the warp is classified into an active set and a pending set according to the delay time. Level scheduler slightly improves GPU performance by 0.5% below average.

한편, GTO 스케줄러는 평균 15.8%의 뛰어난 결과를 달성할 수 있다. 전반적으로, 이전의 연구들은 어플리케이션이 높은 수준의 워프 내 데이터 지역을 보여줄 때 RR 워프 스케줄링 정책보다 더 좋은 성능을 보여주는 것을 확인할 수 있다. 결과가 2-레벨 및 GTO로 정규화되면 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 평균적으로 각각 18.5% 및 5.6%의 성능 향상을 확인할 수 있다.On the other hand, the GTO scheduler can achieve an excellent result of 15.8% on average. Overall, previous studies show that applications show better performance than RR warp scheduling policies when they show data areas within a high level of warp. Once the results are normalized to 2-level and GTO, the warp scheduler according to various embodiments of the present invention can see an average performance improvement of 18.5% and 5.6%, respectively.

도 8은 본 발명의 일 실시예에 따른 워프 스케줄러의 L1 데이터 캐시의 미스 비율 그래프이다. 8 is a miss ratio graph of the L1 data cache of the warp scheduler according to an embodiment of the present invention.

도 8을 참고하면, 워프 스케줄러가 적용될 때 L1 데이터 캐시의 미스 비율(miss rate)을 확인할 수 있다. 예를 들어, 베이스라인(baseline)과 비교하여, 본 발명의 다양한 실시예들에 따른 워프 스케줄링은 평균 8.2 %의 캐시 미스 비율을 감소시킬 수 있다. 이는 본 발명의 다양한 실시예들에 따른 워프 스케줄링이 데이터 지역성(data locality)을 유지할 수 있음을 의미하지만, L1 데이터 캐시에 대한 캐시 액세스 수에 거의 영향을 주지 않음을 확인할 수 있다. L1 데이터 캐시 미스 비율의 감소는 전반적인 성능 향상에 기여하는 중요한 요인일 수 있다. 예를 들어 MIT, EPP 및 SAOP의 L1 데이터 캐시 미스 비율은 각각 14.8%, 38.3% 및 37.3%로 크게 향상되어 이러한 응용 프로그램이 성능면에서 본 발명의 다양한 실시예들에 따른 워프 스케줄링을 크게 활용하는 이유를 설명할 수 있다. 그러나 캐시 미스 비율은 HW 및 MS의 성능 향상이 적기 때문에 성능에 영향을 주는 유일한 요소는 아니지만 이러한 응용 프로그램은 캐시 미스 비율이 낮음을 확인할 수 있다. 캐시 비우호적인(cache-unfriendly) 애플리케이션인 SP는 L1 데이터 캐시 미스 비율이 감소하지는 않지만 매우 우수한 성능을 확인할 수 있다. Referring to FIG. 8, the miss rate of the L1 data cache can be confirmed when the warp scheduler is applied. For example, in comparison to the baseline, the warp scheduling according to various embodiments of the present invention can reduce the average cache miss ratio of 8.2%. This means that warp scheduling according to various embodiments of the present invention can maintain data locality, but it has little effect on the number of cache accesses to the L1 data cache. A reduction in the L1 data cache miss ratio can be an important contributor to overall performance improvement. For example, the L1 data cache miss ratio of MIT, EPP, and SAOP is significantly improved to 14.8%, 38.3%, and 37.3%, respectively, such that these applications make significant use of warp scheduling in terms of performance in various embodiments of the present invention The reason can be explained. However, the cache miss ratio is not the only factor that affects performance because the performance improvement of HW and MS is small, but these applications can confirm that the cache miss ratio is low. A cache-unfriendly application, SP, does not reduce the rate of L1 data cache misses, but it does show excellent performance.

도 9는 본 발명의 일 실시예에 따른 워프 스케줄러의 L2 캐시의 미스 비율 그래프이다.9 is a graph showing a miss ratio of the L2 cache of the warp scheduler according to an exemplary embodiment of the present invention.

도 9를 참고하면, L2 캐시의 경우 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 애플리케이션에 다양한 영향을 미침을 확인할 수 있다. 이러한 애플리케이션에 대한 L2 캐시에 대한 캐시 액세스는 상당히 감소함을 확인할 수 있다.Referring to FIG. 9, in the case of the L2 cache, the warp scheduler according to various embodiments of the present invention can confirm that the application has various effects. It can be seen that cache access to the L2 cache for these applications is significantly reduced.

도 10은 본 발명의 일 실시예에 따른 워프 스케줄러의 캐시 액세스 그래프이다.10 is a cache access graph of a warp scheduler according to an embodiment of the present invention.

도 10을 참고하면, 대부분의 애플리케이션들이 베이스라인과 비교하여 낮거나 동일한 L2 캐시 미스 비율을 갖지만 일부 애플리케이션은 높은 L2 캐시 미스 비율을 갖는다는 것을 확인할 수 있다. 평균적으로 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 L2 캐시에서 2.8% 더 많은 캐시 미스를 발생시키는 것을 확인할 수 있다. GPU 아키텍처에서, 일반적으로 L2 캐시의 캐시 크기가 크기 때문에, L2 캐시의 미스 비율은 L1 데이터 캐시의 미스 비율보다 작다. 또한, L2 캐시는 오프 칩 메모리이기 때문에, L1 데이터 캐시와 비교할 때 성능에 약한 영향을 미칠 수 있다. Mum, EPP 및 SAOP의 L2 캐시 미스 비율이 크게 증가하더라도, 성능 저하는 없다. Referring to FIG. 10, it can be seen that most applications have a lower or equal L2 cache miss ratio compared to the baseline, but some applications have a higher L2 cache miss ratio. On average, it can be seen that the warp scheduler according to various embodiments of the present invention generates 2.8% more cache misses in the L2 cache. In the GPU architecture, because the cache size of the L2 cache is generally large, the miss ratio of the L2 cache is smaller than the miss ratio of the L1 data cache. Also, because the L2 cache is an off-chip memory, it can have a weak impact on performance when compared to the L1 data cache. Even if the ratio of L2 cache misses of Mum, EPP and SAOP is greatly increased, there is no performance degradation.

도 11은 본 발명의 일 실시예에 따른 워프 스케줄러의 스톨 사이클 그래프이다.11 is a stall cycle graph of a warp scheduler according to an embodiment of the present invention.

도 11을 참고하면, 긴 지연시간 연산을 실행할 수 있는 워프가 근접하게 스케줄링되면 NoC 및 메모리 시스템에 더 많은 압력을 가할 수 있다. GPU의 메모리 시스템은 높은 대역폭을 제공하기 때문에 본 발명의 다양한 실시예들에 따른 워프 스케줄러에 의해 부정적인 영향을 받지 않는다. 본 발명의 다양한 실시예들에 따른 워프 스케줄러가 NoC 대역폭에 미치는 영향을 확인할 수 있다. NoC는 L2 캐시 및 메인 메모리와 같은 하위 메모리에서 가져온 데이터로 전역 로드의 메모리 요청을 끊는 역할을 하기 때문에 NoC가 중단되면 GPU 성능이 저하될 수 있다.Referring to FIG. 11, more pressure may be placed on the NoC and memory system if a warp that can perform a long delay operation is scheduled to be closely spaced. The memory system of the GPU is not negatively affected by the warp scheduler according to various embodiments of the present invention because it provides high bandwidth. The influence of the warp scheduler according to various embodiments of the present invention on the NoC bandwidth can be confirmed. Since NoC is responsible for breaking memory requests for global loads with data from lower memory, such as L2 cache and main memory, GPU performance can be degraded if NoC is interrupted.

도 12는 본 발명의 일 실시예에 따른 다른 스톨 사이클 그래프이다.12 is another stall cycle graph according to an embodiment of the present invention.

도 12를 참고하면, 본 발명의 다양한 실시예들에 따른 워프 스케줄러가 적용될 때 NoC 혼잡(congestion)으로 인한 메모리 채널에서의 스톨 사이클의 수를 확인할 수 있다. 베이스라인과 비교했을 때, 대부분의 애플리케이션은 더 낮은 수의 스톨 사이클 또는 동일한 수의 스톨 사이클을 나타냄을 확인할 수 있다. 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 SP 및 NW에 대해서만 NoC 스톨 사이클을 약간 증가시키지만 상기 애플리케이션에 대한 메모리 스테이지에서 파이프라인 스톨을 현저하게 감소시키기 때문에 상기 애플리케이션에서 성능상의 불이익은 확인되지 않는다.Referring to FIG. 12, the number of stall cycles in a memory channel due to NoC congestion can be confirmed when a warp scheduler according to various embodiments of the present invention is applied. It can be seen that, compared to the baseline, most applications exhibit a lower number of stall cycles or the same number of stall cycles. The warp scheduler according to various embodiments of the present invention slightly increases the NoC stall cycle only for SP and NW but does not identify a performance penalty in the application because it significantly reduces the pipeline stall in the memory stage for the application .

도 13은 본 발명의 일 실시예에 따른 워프 스케줄러의 다른 성능 그래프이다.13 is another performance graph of a warp scheduler according to an embodiment of the present invention.

도 13을 참고하면, 본 발명의 다양한 실시예들에 따른 워프 스케줄링과 종래의 기술의 성능을 베이스라인으로 정규화한 것을 확인할 수 있다. 본 발명의 다양한 실시예들에 따른 워프 스케줄링은 GPU에 대한 캐시 관리 기술들과 비교될 수 있다. Referring to FIG. 13, it can be seen that the warp scheduling according to various embodiments of the present invention and the performance of the prior art are normalized to the baseline. Warp scheduling according to various embodiments of the present invention may be compared to cache management techniques for the GPU.

캐시 관리 기술들 중 하나는 요청 순서 재정렬(request reordering)과 캐시 바이패싱(cache bypassing)라는 2개의 개별 구성 요소로 구성될 수 있다. 따라서 요청 순서 재정렬을 구현하는데 추가 스토리지(예: SRAM 3KB)가 필요하므로, 본 발명의 다양한 실시예들에 따른 워프 스케줄링은 캐시 바이패싱 구성요소와 비교될 수 있다. 캐시 바이 패스 방식은 바이패스-온-결합-스톨(bypass-on-associativity-stall) 방식으로 선택될 수 있다. MRPB 바이패싱 기법은 평균 7.6% 만큼 베이스라인을 능가할 수 있다. One of the cache management techniques may consist of two separate components: request reordering and cache bypassing. Therefore, since additional storage (e.g., SRAM 3 KB) is required to implement order reordering, warp scheduling according to various embodiments of the present invention can be compared to cache bypassing components. The cache bypass scheme may be selected in a bypass-on-associativity-stall manner. The MRPB bypassing technique can outperform the baseline by an average of 7.6%.

재사용 빈도가 낮고 재사용 거리가 긴 요청을 바이패싱하는데 목적을 둔 분리형(decoupled) LID의 경우, 상기 바이패싱 기법은 SP 및 NW와 같은 캐시 비호환적 응용 프로그램에서만 수행되며 다른 응용 프로그램에 작은 성능 영향을 미칠 수 있다. 이 기술은 9.8%의 평균 성능 향상을 제공할 수 있다. 상술한 두 가지 기법과 비교하여, 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 더 많은 응용 프로그램보다 성능이 뛰어나고 평균 성능이 향상될 수 있다.In the case of decoupled LIDs intended for bypassing requests with low reuse frequency and long reuse distance, the bypassing technique is performed only in cache-incompatible applications such as SP and NW, and has a small performance impact on other applications I can go crazy. This technology can provide an average performance improvement of 9.8%. Compared to the two techniques described above, the warp scheduler according to various embodiments of the present invention can outperform more applications and improve average performance.

상술한 바와 같이, 본 발명의 다양한 실시예들에 따른 워프 스케줄링 정책은 기본 스케줄링 정책(L3WS 라 함)의 단점을 완화시킬 수 있다. 본 발명의 다양한 실시예들에 따른 워프 병렬 처리를 효과적으로 사용하는 방법이 GPU의 전반적인 성능을 향상시키기 위해 설명되었다. 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 워프에 대해 발행될 명령어의 특성에 따라 워프를 별도의 워프 풀로 분류하여 TLP를 보다 정확하게 사용할 수 있다. 긴 지연시간 연산을 실행하는 워프는 높은 우선순위로 지정되며 가이딩 워프로 분류될 수 있다. 한편, 짧은 지연시간 연산을 실행하는 워프는 낮은 우선순위를 가지며 가이딩 워프가 야기하는 지연시간을 오버랩하도록 스케줄링될 수 있다. 이를 통해, 본 발명의 다양한 실시예들에 따른 워프 스케줄러는 TLP 부족 문제를 완화하고 시스템 성능을 향상시킬 수 있다. As described above, the warp scheduling policy according to various embodiments of the present invention can alleviate the disadvantage of the basic scheduling policy (referred to as L3WS). A method of effectively utilizing warp parallelism in accordance with various embodiments of the present invention has been described to improve the overall performance of the GPU. The warp scheduler according to various embodiments of the present invention can use the TLP more accurately by classifying the warp into separate warp pools depending on the nature of the instruction to be issued to the warp. Warps that perform long latency operations are designated with high priority and can be classified as guiding warps. On the other hand, the warp that performs the short delay operation has a low priority and can be scheduled to overlap the delay time caused by the guiding warp. Thereby, the warp scheduler according to various embodiments of the present invention can mitigate TLP shortage problems and improve system performance.

긴 지연시간을 발생시키는 메모리 접근 명령어를 별도로 분류하기 위해 스코어보드로부터 긴 지연시간을 발생시키는 명령어가 예측될 수 있다. 해당 명령어는 별도의 큐 구조에 저장되어 우선적으로 실행될 수 있다. 긴 지연시간을 발생시키는 명령어를 우선적으로 실행할 수 있도록 명령어들을 워프 단위로 정렬함으로써 명령어 처리에 대한 병렬성을 향상시킬 수 있다.An instruction that generates a long delay time from the scoreboard can be predicted to separately classify the memory access instruction that generates the long delay time. The command can be stored in a separate queue structure and executed first. The parallelism of the instruction processing can be improved by arranging the instructions in units of warp so that the instruction that generates the long delay time can be preferentially executed.

이상의 설명은 본 발명의 기술적 사상을 예시적으로 설명한 것에 불과한 것으로, 통상의 기술자라면 본 발명의 본질적인 특성이 벗어나지 않는 범위에서 다양한 변경 및 수정이 가능할 것이다.The foregoing description is merely illustrative of the technical idea of the present invention and various changes and modifications may be made by those skilled in the art without departing from the essential characteristics of the present invention.

따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라, 설명하기 위한 것이고, 이러한 실시예들에 의하여 본 발명의 범위가 한정되는 것은 아니다.Accordingly, the embodiments disclosed herein are for the purpose of describing, not limiting, the technical spirit of the present invention, and the scope of the present invention is not limited by these embodiments.

본 발명의 보호범위는 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 이해되어야 한다.The scope of protection of the present invention should be construed according to the claims, and all technical ideas within the scope of the same should be understood as being included in the scope of the present invention.

100: GPU
110: SM
120: 상호연결 네트워크
130: 메모리 컨트롤러
140: GDDR RAM
300: 전자 장치
310: 페치 유닛
320: I-캐시 유닛
330: 디코딩 유닛
340: I-버퍼 유닛
350: 스코어보드
360: 워프 발행 유닛
370: 레지스터 유닛
380: ALU
390: MEM
460: 워프 발행 유닛100: GPU
110: SM
120: Interconnect Network
130: Memory controller
140: GDDR RAM
300: electronic device
310: Fetch unit
320: I-cache unit
330: decoding unit
340: I-buffer unit
350: Scoreboard
360: warp issue unit
370: register unit
380: ALU
390: MEM
460: warp issuing unit

Claims

A method of operating an electronic device,
Determining an instruction to generate a latency of a threshold value or more based on operand related information of a scoreboard and an instruction type;
Assigning a priority to a warp corresponding to the determined instruction; And
Executing the instruction according to the priority of the warp;
&Lt; / RTI >

The method according to claim 1,
Wherein determining the instruction comprises:
Determining a first instruction to generate a delay time greater than or equal to the threshold value; And
Determining a second instruction to generate a delay time that is less than the threshold value;
&Lt; / RTI >

3. The method of claim 2,
Wherein the first priority of the warp corresponding to the first instruction is higher than the second priority of the warp corresponding to the second instruction.

The method of claim 3,
Wherein executing the instruction comprises:
Performing the first instruction according to the first priority and the second priority; And
Performing the second instruction after performing the first instruction;
&Lt; / RTI >

The method of claim 3,
The warp corresponding to the first instruction is included in a guiding pool,
Wherein the warp corresponding to the second instruction is included in a filling pool.

In an electronic device,
Determining an instruction to generate a latency of a threshold value or more based on operand related information of a scoreboard and an instruction type,
Assign a priority to the warp corresponding to the determined instruction,
And a warp scheduler that executes the instruction in accordance with the priority of the warp.

The method according to claim 6,
The warp scheduler includes:
Determining a first instruction to generate a delay time greater than or equal to the threshold value,
And to generate a delay time less than the threshold value.

8. The method of claim 7,
Wherein the first priority of the warp corresponding to the first instruction is higher than the second priority of the warp corresponding to the second instruction.

9. The method of claim 8,
The warp scheduler includes:
Performing the first instruction according to the first priority and the second priority,
And performs the second instruction after performing the first instruction.

9. The method of claim 8,
The warp corresponding to the first instruction is included in a guiding pool,
Wherein the warp corresponding to the second instruction is included in a filling pool.