KR20220067289A

KR20220067289A - Gpgpu thread block scheduling method and apparatus

Info

Publication number: KR20220067289A
Application number: KR1020200153825A
Authority: KR
Inventors: 반효경; 조경운
Original assignee: 이화여자대학교 산학협력단
Priority date: 2020-11-17
Filing date: 2020-11-17
Publication date: 2022-05-24
Also published as: KR102530348B1

Abstract

Disclosed are a GPGPU thread block scheduling method and device. The thread block scheduling method performed by a thread block scheduler comprises: a step of identifying a remaining resource amount of computing resources for processing a plurality of different function units for each of a plurality of stream multiprocessors (SMs) constituting a GPU; a step of identifying resource usage of the computing resources according to a workload type for each of a plurality of thread blocks (TBs) to be processed through the GPU; and a step of allocating each of the TBs to one among the SMs based on the remaining resource amount of the computing resources for each of the identified SMs and the resource usage of the computing resources for each of the TBs. Accordingly, GPU task placement and multitasking can be achieved efficiently.

Description

GPGPU thread block scheduling method and apparatus {GPGPU THREAD BLOCK SCHEDULING METHOD AND APPARATUS}

본 발명은 GPGPU 스레드 블록 스케줄링 방법 및 장치에 관한 것으로, 보다 구체적으로는 GPU에서 수행되는 워크로드의 자원 사용 특성에 기초하여 스레드 블록을 스케줄링 하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for scheduling a GPGPU thread block, and more particularly, to a method and apparatus for scheduling a thread block based on resource usage characteristics of a workload performed on a GPU.

4차 산업혁명시대의 도래로 딥러닝, 블록체인, 유전자 분석 등 다양한 분야에서 GPU가 병렬 연산장치로 활용되고 있다. 그러나 GPU 관리는 워크로드의 다양성을 반영하기보다는 간결한 제어 구조를 통한 개별 워크로드의 병렬성 극대화에 초점을 맞추어 왔다. With the advent of the 4th industrial revolution, GPUs are being used as parallel computing devices in various fields such as deep learning, block chain, and genetic analysis. However, GPU management has focused on maximizing parallelism of individual workloads through a concise control structure rather than reflecting the diversity of workloads.

최근 다양한 분야의 워크로드가 GPU에서 동시에 수행되는 것이 지원되면서 개별 워크로드의 병렬 수행보다 멀티태스킹으로 인한 효율성을 높이는 것이 GPU 관리의 중요한 이슈로 부각되고 있다.Recently, as workloads in various fields are supported to be performed simultaneously on the GPU, increasing efficiency due to multitasking rather than parallel execution of individual workloads is emerging as an important issue in GPU management.

따라서, 멀티태스킹으로 인한 효율성을 높이기 위한 GPU의 스레드 블록 스케줄링 방법이 요구되고 있다.Accordingly, there is a demand for a method for scheduling a thread block of a GPU to increase efficiency due to multitasking.

본 발명은 GPGPU 스레드 블록 스케줄링 방법 및 장치에 관한 것으로, 보다 구체적으로는 GPU에서 수행되는 워크로드의 자원 사용 특성을 분석하여 서로 다른 자원 사용 특성을 가지는 워크로드에 대응하는 스레드 블록들을 동일한 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)에 배치하는 방법 및 장치에 관한 것이다.The present invention relates to a GPGPU thread block scheduling method and apparatus, and more particularly, by analyzing resource usage characteristics of a workload performed on a GPU, thread blocks corresponding to workloads having different resource usage characteristics are combined with the same stream multiprocessor. (Stream Multiprocessor, hereinafter SM) relates to a method and apparatus for deployment.

본 발명의 일실시예에 따른 스레드 블록 스케줄러가 수행하는 스레드 블록 스케줄링 방법은 GPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대해 서로 다른 복수의 함수 유닛(Function Unit)을 처리하기 위한 컴퓨팅 자원의 잔여 자원량을 식별하는 단계; 상기 GPU를 통해 처리될 복수의 스레드 블록(Thread Block, 이하 TB)들 각각에 대해 워크로드 종류에 따른 상기 컴퓨팅 자원의 자원 사용량을 식별하는 단계; 및 상기 식별된 복수의 SM들 각각에 대한 컴퓨팅 자원의 잔여 자원량과 상기 복수의 TB들 각각에 대한 컴퓨팅 자원의 자원 사용량에 기초하여 상기 복수의 TB들 각각을 상기 복수의 SM들 중 어느 하나의 SM에 할당하는 단계를 포함할 수 있다.The thread block scheduling method performed by the thread block scheduler according to an embodiment of the present invention includes a plurality of different function units for each of a plurality of stream multiprocessors (SM) constituting a GPU. identifying a remaining resource amount of computing resources to process; identifying resource usage of the computing resource according to the type of workload for each of a plurality of thread blocks (hereinafter referred to as TB) to be processed through the GPU; and assigning each of the plurality of TBs to any one of the plurality of SMs based on the amount of remaining resources of the computing resources for each of the identified SMs and the resource usage of the computing resources for each of the plurality of TBs. It may include the step of allocating to.

상기 컴퓨팅 자원은 단정밀도 연산 장치, 배정밀도 연산 장치, 제어흐름 장치, 로드/스토어 장치 및 특수함수 연산 장치 중 적어도 하나를 포함할 수 있다.The computing resource may include at least one of a single-precision arithmetic unit, a double-precision arithmetic unit, a control flow unit, a load/store unit, and a special function arithmetic unit.

상기 할당하는 단계는 특정 SM에 할당하고자 하는 둘 이상의 TB들이 서로 상이한 컴퓨팅 자원을 사용하는 경우, 상기 특정 SM의 컴퓨팅 자원에 대한 잔여 자원량을 초과하지 않는 한도 내에서 동일한 SM에 할당할 수 있다.In the allocating step, when two or more TBs to be allocated to a specific SM use different computing resources, the allocation may be performed to the same SM within a limit that does not exceed the remaining resource amount for the computing resources of the specific SM.

상기 할당하는 단계는 상기 복수의 TB들 각각을 상기 복수의 SM들에 라운드 로빈(Round-Robin) 방식에 따라 순차적으로 하나씩 할당할 수 있다.In the allocating step, each of the plurality of TBs may be sequentially allocated one by one to the plurality of SMs according to a round-robin method.

본 발명의 일실시예에 따른 스레드 블록 스케줄러는 프로세서를 포함하고, 상기 프로세서는 GPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대해 서로 다른 복수의 함수 유닛(Function Unit)을 처리하기 위한 컴퓨팅 자원의 잔여 자원량을 식별하고, 상기 GPU를 통해 처리될 복수의 스레드 블록(Thread Block, 이하 TB)들 각각에 대해 워크로드 종류에 따른 상기 컴퓨팅 자원의 자원 사용량을 식별하며, 상기 식별된 복수의 SM들 각각에 대한 컴퓨팅 자원의 잔여 자원량과 상기 복수의 TB들 각각에 대한 컴퓨팅 자원의 자원 사용량에 기초하여 상기 복수의 TB들 각각을 상기 복수의 SM들 중 어느 하나의 SM에 할당할 수 있다.A thread block scheduler according to an embodiment of the present invention includes a processor, wherein the processor includes a plurality of different function units for each of a plurality of stream multiprocessors (SM) constituting the GPU. Identifies the amount of remaining resources of the computing resource for processing the Allocating each of the plurality of TBs to any one SM of the plurality of SMs based on the amount of remaining resources of the computing resources for each of the identified SMs and the resource usage of the computing resources for each of the plurality of TBs can do.

상기 프로세서는 특정 SM에 할당하고자 하는 둘 이상의 TB들이 서로 상이한 컴퓨팅 자원을 사용하는 경우, 상기 특정 SM의 컴퓨팅 자원에 대한 잔여 자원량을 초과하지 않는 한도 내에서 동일한 SM에 할당할 수 있다.When two or more TBs to be allocated to a specific SM use different computing resources, the processor may allocate to the same SM within a limit that does not exceed the remaining resource amount for the computing resources of the specific SM.

상기 프로세서는 상기 복수의 TB들 각각을 상기 복수의 SM들에 라운드 로빈(Round-Robin) 방식에 따라 순차적으로 하나씩 할당할 수 있다.The processor may sequentially allocate each of the plurality of TBs one by one to the plurality of SMs according to a round-robin method.

본 발명의 일실시예에 따른 스레드 블록 스케줄러가 수행하는 스레드 블록 스케줄링 방법은 GPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대한 컴퓨팅 자원 및 메모리 자원의 잔여 자원량을 식별하는 단계; 상기 GPU를 통해 처리될 복수의 스레드 블록(Thread Block, 이하 TB)들 각각에 대해 워크로드 종류에 따른 상기 GPU의 자원 사용량을 식별하는 단계; 및 상기 식별된 복수의 SM들 각각에 대한 잔여 자원량과 상기 복수의 TB들 각각에 대한 GPU의 자원 사용량에 기초하여 상기 복수의 TB들 각각을 상기 복수의 SM들 중 어느 하나의 SM에 할당하는 단계를 포함하고, 상기 컴퓨팅 자원은 서로 다른 복수의 함수 유닛(Function Unit)을 처리하기 위한 장치들로 구성되고, 상기 메모리 자원은 레지스터 파일, 공유 메모리 및 통합 캐시로 구성될 수 있다.The thread block scheduling method performed by the thread block scheduler according to an embodiment of the present invention identifies the remaining resource amounts of computing resources and memory resources for each of a plurality of stream multiprocessors (SM) constituting a GPU. step; identifying resource usage of the GPU according to a workload type for each of a plurality of thread blocks (hereinafter referred to as TB) to be processed through the GPU; and allocating each of the plurality of TBs to any one of the plurality of SMs based on the amount of remaining resources for each of the identified SMs and the resource usage of the GPU for each of the plurality of TBs. Including, wherein the computing resource is composed of devices for processing a plurality of different function units (Function Unit), the memory resource may be composed of a register file, a shared memory and a unified cache.

상기 할당하는 단계는 특정 SM에 할당하고자 하는 둘 이상의 TB들에 대한 컴퓨팅 자원 및 메모리 자원이 상기 특정 SM의 잔여 자원량을 초과하지 않는 경우, 해당 TB들을 특정 SM에 할당하고, 상기 특정 SM의 잔여 자원량을 초과하는 경우, 특정 SM을 스킵(Skip)할 수 있다.In the allocating step, if computing resources and memory resources for two or more TBs to be allocated to a specific SM do not exceed the remaining resource amount of the specific SM, allocating the TBs to the specific SM, and the remaining resource amount of the specific SM In case of exceeding , a specific SM may be skipped.

본 발명의 일실시예에 따른 스레드 블록 스케줄러는 프로세서를 포함하고, 상기 프로세서는 GPU를 구성하는 복수의 스트림 멀티프로세서(Stream Multiprocessor, 이하 SM)들 각각에 대한 컴퓨팅 자원 및 메모리 자원의 잔여 자원량을 식별하고, 상기 GPU를 통해 처리될 복수의 스레드 블록(Thread Block, 이하 TB)들 각각에 대한 워크로드 종류에 기초하여 요구되는 상기 GPU의 자원 사용량을 식별하며, 상기 식별된 복수의 SM들 각각의 잔여 자원량과 복수의 TB들 각각의 GPU 자원 사용량에 기초하여 상기 복수의 TB들 각각을 상기 복수의 SM들 중 어느 하나의 SM에 할당하고, 상기 컴퓨팅 자원은 서로 다른 복수의 함수 유닛(Function Unit)을 처리하기 위한 장치들로 구성되고, 상기 메모리 자원은 레지스터 파일, 공유 메모리 및 통합 캐시로 구성될 수 있다.A thread block scheduler according to an embodiment of the present invention includes a processor, wherein the processor identifies the remaining resource amounts of computing resources and memory resources for each of a plurality of stream multiprocessors (SM) constituting the GPU. and identify the resource usage of the GPU required based on the workload type for each of a plurality of thread blocks (hereinafter referred to as TBs) to be processed through the GPU, and the remaining amount of each of the identified plurality of SMs Allocating each of the plurality of TBs to any one of the SMs based on the amount of resources and the GPU resource usage of each of the plurality of TBs, and the computing resources are different from each other a plurality of function units (Function Unit) It consists of devices for processing, and the memory resource may consist of a register file, a shared memory, and a unified cache.

상기 프로세서는 특정 SM에 할당하고자 하는 둘 이상의 TB들에 대한 컴퓨팅 자원 및 메모리 자원이 상기 특정 SM의 잔여 자원량을 초과하지 않는 경우, 해당 TB들을 특정 SM에 할당하고, 상기 특정 SM의 잔여 자원량을 초과하는 경우, 해당 특정 SM을 스킵(Skip)할 수 있다.If the computing resources and memory resources for two or more TBs to be allocated to the specific SM do not exceed the remaining resource amount of the specific SM, the processor allocates the TBs to the specific SM and exceeds the remaining resource amount of the specific SM In this case, the specific SM may be skipped.

본 발명은 GPGPU 스레드 블록 스케줄링 방법 및 장치에 관한 것으로, 보다 구체적으로는 GPU에서 수행되는 워크로드의 자원 사용 특성을 분석하여 서로 다른 자원 사용 특성을 가지는 워크로드에 대응하는 스레드 블록들을 동일한 SM에 배치함으로써 GPU 작업배치 및 멀티태스킹을 효율적으로 달성할 수 있다.The present invention relates to a GPGPU thread block scheduling method and apparatus, and more particularly, by analyzing resource usage characteristics of a workload performed on a GPU, thread blocks corresponding to workloads having different resource usage characteristics are arranged in the same SM By doing so, GPU work placement and multitasking can be achieved efficiently.

도 1은 본 발명의 일실시예에 따른 스레드 블록 스케줄링의 개념도를 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 스레드 블록을 처리하는데 필요한 SM의 자원을 나타낸 도면이다.
도 3은 본 발명의 일실시예에 따른 TBS가 특정 TB를 SM에 할당하는 제1 방법을 나타낸 도면이다.
도 4는 본 발명의 일실시예에 따른 TBS가 특정 TB를 SM에 할당하는 제2 방법을 나타낸 도면이다.
도 5는 본 발명의 일실시예에 따른 GPGPU의 스레드 블록 스케줄링 방법을 플로우챠트로 나타낸 도면이다.1 is a diagram illustrating a conceptual diagram of thread block scheduling according to an embodiment of the present invention.
2 is a diagram illustrating resources of an SM required to process a thread block according to an embodiment of the present invention.
3 is a diagram illustrating a first method in which a TBS allocates a specific TB to an SM according to an embodiment of the present invention.
4 is a diagram illustrating a second method in which a TBS allocates a specific TB to an SM according to an embodiment of the present invention.
5 is a flowchart illustrating a thread block scheduling method of a GPGPU according to an embodiment of the present invention.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 스레드 블록 스케줄링의 개념도를 도시한 도면이다.1 is a diagram illustrating a conceptual diagram of thread block scheduling according to an embodiment of the present invention.

GPU의 컴퓨팅 하드웨어는 수백 혹은 수천 단위의 계산 코어들로 구성된 컴퓨팅 프로세서(주로 Stream Multiprocessor, 이하 SM)들의 집합이다. 이러한 SM의 개수는 GPU 장치의 종류에 따라 다양할 수 있다. 보통 1개의 SM은 보통 수십에서 수백개의 동일한 컴퓨팅 명령을 처리 데이터만 달리하여 하드웨어적으로 동시에 수행할 수 있다. The computing hardware of the GPU is a set of computing processors (mainly Stream Multiprocessor, hereinafter SM) composed of hundreds or thousands of computational cores. The number of such SMs may vary according to the type of GPU device. Usually, one SM can perform tens to hundreds of identical computing commands at the same time in hardware by changing only the processing data.

GPU에 주로 쓰이는 프로그래밍 방식은 1개 코어 기준의 단일 스레드(single thread) 프로그램을 하드웨어적으로 병렬 수행하는 기법이다. 그런데 GPU를 통해 해결하고자 하는 컴퓨팅 문제들은 수십만에서 수백만 단위의 스레드로 구성될 수 있으므로, 이를 GPU에서 동시 수행하는 것은 하드웨어적으로 불가능하다. The programming method mainly used for GPU is a hardware parallel execution method of a single-threaded program based on one core. However, since the computing problems to be solved through the GPU can consist of hundreds of thousands to millions of threads, it is impossible to perform them simultaneously on the GPU in terms of hardware.

이러한 문제를 해결하고자 GPU는 수십만에서 수백만 단위의 스레드를 해결하고자 하는 문제에 맞게 일정한 크기의 작업 단위로 그룹화하여 SM상에서 수행시키는 방법을 제공한다. 이러한 작업 단위를 스레드 블럭(Thread Block, 이하 TB)이라 하며 TB들을 GPU 장치내의 존재하는 다수의 SM에 할당하는 기능을 수행하는 관리자를 스레드 블록 스케줄러(Thread Block Scheduler, 이하 TBS)라 한다.To solve this problem, the GPU provides a method of grouping into a task unit of a certain size according to the problem to be solved from hundreds of thousands to millions of threads and performing it on SM. This unit of work is called a thread block (hereinafter referred to as TB), and a manager that allocates TBs to a plurality of SMs in the GPU device is called a thread block scheduler (hereinafter, TBS).

도 1을 참고하면, TBS(100)는 서로 다른 복수의 서브 커널(Sub Kernel)들 내에 포함된 TB들을 수신하여 GPU 내의 하드웨어인 SM에 할당하는 역할을 수행할 수 있다. 여기서, 서브 커널들은 GPU에서 수행되는 코드로서 CPU 개념으로 볼 때 프로그램으로 이해될 수 있다. 서브 커널들은 커널의 종류에 따라 서로 다른 크기의 TB들을 포함할 수 있으며, 각각의 서브 커널 내에 포함된 TB들은 크기가 동일할 수 있다. 즉, 서브 커널 A(Kernel A)와 커널 B(Kernel B)는 서로 다른 크기의 TB들을 포함할 수 있으며, 서브 커널 A 및 서브 커널 B 각각에 포함된 TB들은 크기가 동일할 수 있다.Referring to FIG. 1 , the TBS 100 may perform a role of receiving TBs included in a plurality of different sub-kernels and allocating them to SM, which is hardware in the GPU. Here, the sub-kernels are codes executed on the GPU and may be understood as programs in terms of the CPU concept. The sub-kernels may include TBs of different sizes depending on the type of kernel, and the TBs included in each sub-kernel may have the same size. That is, the sub-kernel A (Kernel A) and the kernel B (Kernel B) may include TBs of different sizes, and the TBs included in each of the sub-kernel A and the sub-kernel B may have the same size.

TB는 1개의 스레드부터 보통 1024개의 스레드로 구성될 수 있는데, SM은 TBS(100)를 통해 할당된 TB 내의 스레드들을 동시에 수행할 수 있다. Warp는 GPU의 하드웨어에서 물리적으로 동시에 수행가능한 스레드 단위를 나타내며, 보통 하나의 Warp는 32개의 스레드로 구성될 수 있다. 일례로, TB가 64개의 스레드로 구성되었다고 가정하면, 해당 TB는 2개의 Warp로 구성될 수 있다. 그리고, 해당 TB가 TBS(100)를 통해 SM에 할당되면, 할당된 SM을 통해 32개의 스레드, 즉, Warp 단위로 동시에 수행될 수 있다. The TB may consist of 1 thread to usually 1024 threads, and the SM may concurrently perform threads in the TB allocated through the TBS 100 . A warp represents a unit of threads that can be physically simultaneously executed in the hardware of the GPU, and usually one warp can consist of 32 threads. For example, if it is assumed that a TB is composed of 64 threads, the corresponding TB may be composed of two warps. And, when the corresponding TB is allocated to the SM through the TBS 100, it can be simultaneously performed in 32 threads, ie, warp units through the allocated SM.

만약, TB가 1개의 스레드로 구성되었다면, 해당 TB는 1개의 Warp로 구성될 수 있는데 1개의 스레드를 가지는 TB는 테스트 이외의 용도로는 사용하지 않으므로 보통 TB의 스레드 개수는 32의 배수로 구성될 수 있다.If a TB consists of one thread, the TB can consist of one warp. Since a TB with one thread is not used for purposes other than testing, the number of threads in a TB is usually a multiple of 32. have.

이와 같이 GPU 내의 하드웨어인 SM에서 수행되는 작업(스레드)의 처리율은 TBS(100)가 수행하는 스케줄링 정책에 많은 영향을 받을 수 있다. TBS(100)는 서로 다른 복수의 서브 커널들 내에 포함된 TB들을 라운드 로빈(Round-Robin) 방식으로 복수의 SM들에 순차적으로 하나씩 할당할 수 있다.As such, the throughput of a task (thread) performed in the SM, which is hardware in the GPU, may be greatly affected by the scheduling policy performed by the TBS 100 . The TBS 100 may sequentially allocate TBs included in a plurality of different sub-kernels to a plurality of SMs one by one in a round-robin manner.

이때, 본 발명에서 제공하는 TBS(100)는 GPU에서 수행되는 워크로드의 자원 사용 특성을 분석하여 서로 다른 자원 사용 특성을 가지는 워크로드에 대응하는 스레드 블록들을 동일한 SM에 배치함으로써 GPU 작업배치 및 멀티태스킹을 효율적으로 달성하는 방법을 제공할 수 있다.At this time, the TBS 100 provided in the present invention analyzes the resource usage characteristics of the workload performed on the GPU and arranges thread blocks corresponding to the workloads having different resource usage characteristics in the same SM, thereby providing GPU work arrangement and multi-tasking. It can provide a way to achieve the task efficiently.

도 2는 본 발명의 일실시예에 따른 스레드 블록을 처리하는데 필요한 SM의 자원을 나타낸 도면이다.2 is a diagram illustrating resources of an SM required to process a thread block according to an embodiment of the present invention.

GPU를 구성하는 SM은 TB가 할당되었을 때, 할당된 TB를 처리하기 위하여 일정한 크기의 자원이 요구될 수 있다. 이때, 필요한 자원은 스레드 수, 공유 메모리 및 레지스터 수 중 적어도 하나를 포함할 수 있다.When a TB is allocated to the SM constituting the GPU, a resource of a certain size may be required to process the allocated TB. In this case, the required resource may include at least one of the number of threads, shared memory, and the number of registers.

동일한 커널 내에 포함된 TB들은 크기가 동일할 수 있으므로, 요구되는 SM의 자원량이 동일할 수 있으며, 이와 같은 SM의 자원량은 TB들이 컴파일 될 때 결정될 수 있다.Since the TBs included in the same kernel may have the same size, the required amount of SM resource may be the same, and this amount of SM resource may be determined when the TBs are compiled.

TBS(100)는 특정 TB를 복수의 SM들 중 어느 하나의 SM에 할당하는 경우, 우선 복수의 SM들 각각에 대한 잔여 자원량을 식별할 수 있다. 이후 TBS(100)는 미리 정해진 할당 순서에 따라 제1 SM의 잔여 자원량과 할당하고자 하는 특정 TB에 대한 SM 자원 사용량을 비교하여 특정 TB의 SM 자원 사용량이 제1 SM의 잔여 자원량 보다 작은 경우, 특정 TB를 제1 SM에 할당할 수 있다.When a specific TB is allocated to any one of a plurality of SMs, the TBS 100 may first identify a residual resource amount for each of the plurality of SMs. Thereafter, the TBS 100 compares the remaining resource amount of the first SM with the SM resource usage for the specific TB to be allocated according to a predetermined allocation order. When the SM resource usage of the specific TB is smaller than the remaining resource amount of the first SM, A TB may be allocated to the first SM.

이와는 달리 TBS(100)는 특정 TB의 SM 자원 사용량이 제1 SM의 잔여 자원량 보다 큰 경우, 제1 SM을 스킵(Skip)하고, 이후 할당 순서에 대응하는 제2 SM의 잔여 자원량과 특정 TB의 SM 자원 사용량을 비교하고, 비교 결과에 따라 할당을 수행할 수 있다.On the other hand, when the SM resource usage of a specific TB is greater than the remaining resource amount of the first SM, the TBS 100 skips the first SM, and thereafter, the remaining resource amount of the second SM corresponding to the allocation sequence and the specific TB SM resource usage may be compared, and allocation may be performed according to the comparison result.

이때, TBS(100)는 제1 SM의 자원 구성하는 스레드 수, 공유 메모리 및 레지스터 수 중 적어도 하나의 자원에 대한 잔여 자원량이 할당하고자 하는 특정 TB의 SM 자원 사용량 보다 작은 경우, 해당 제1 SM을 스킵하고, 이후 할당 순서에 대응하는 제2 SM에 특정 TB의 할당 여부를 판단할 수 있다. At this time, the TBS 100 is the first SM when the remaining resource amount for at least one of the number of threads, the shared memory, and the number of registers constituting the resource of the first SM is smaller than the SM resource usage of the specific TB to be allocated. After skipping, it may be determined whether a specific TB is allocated to the second SM corresponding to the allocation order.

도 3은 본 발명의 일실시예에 따른 TBS가 특정 TB를 SM에 할당하는 제1 방법을 나타낸 도면이다.3 is a diagram illustrating a first method in which a TBS allocates a specific TB to an SM according to an embodiment of the present invention.

TB들을 처리하는 SM은 자원이 한정되어 있기 때문에 TBS(100)를 통해 한정된 자원 내에서 TB들을 적절히 배치하는 것이 중요할 수 있다. GPU의 SM을 통해 처리되는 TB들은 워크로드 종류에 따라 필요한 SM의 자원이 서로 다를 수 있다. 일례로, 도 3을 참고하면, Hotspot에 대응하는 TB는 SM의 자원 중 컴퓨팅 자원을 주로 사용하고, Stream Cluster에 대응하는 TB는 SM의 자원 중 메모리 자원을 주로 사용하는 것을 알 수 있다.Since the SM processing the TBs has limited resources, it may be important to properly place the TBs within the limited resources through the TBS 100 . TBs processed through the SM of the GPU may have different SM resources depending on the type of workload. For example, referring to FIG. 3 , it can be seen that a TB corresponding to a hotspot mainly uses a computing resource among SM resources, and a TB corresponding to a stream cluster mainly uses a memory resource among SM resources.

따라서, TBS(100)는 할당하고자 하는 TB들을 워크로드 종류에 기초하여 컴퓨팅 자원 집약적인지 또는 메모리 자원 집약적인지를 판단하고, 서로 다른 종류의 자원을 집약적으로 사용하는 TB들을 하나의 SM에 공동으로 할당함으로써 SM의 자원 사용률을 극대화할 수 있다.Accordingly, the TBS 100 determines whether the TBs to be allocated are computing resource intensive or memory resource intensive based on the type of workload, and jointly allocates TBs using different types of resources intensively to one SM. By doing so, the resource utilization rate of SM can be maximized.

이와 같이 할당하고자 하는 TB들이 컴퓨팅 자원 집약적인지 또는 메모리 자원 집약적인지를 판단하기 위하여 커널 빌드 후 사전 사용량 테스트가 수행될 수 있다. 이후 테스트 결과값이 커널에 포함된 복수의 TB들에 반영될 수 있다.In order to determine whether TBs to be allocated are computing resource intensive or memory resource intensive, a pre-use test may be performed after kernel build. Thereafter, the test result value may be reflected in a plurality of TBs included in the kernel.

도 4는 본 발명의 일실시예에 따른 TBS가 특정 TB를 SM에 할당하는 제2 방법을 나타낸 도면이다.4 is a diagram illustrating a second method in which a TBS allocates a specific TB to an SM according to an embodiment of the present invention.

도 3에서는 할당하고자 하는 TB가 컴퓨팅 자원 집약적인지 또는 메모리 자원 집약적인지를 판단하고, 판단 결과에 따라 서로 다른 종류의 자원을 집약적으로 사용하는 TB들을 하나의 SM에 공동으로 할당함으로써 SM의 자원 사용률을 극대화하는 방법을 제공하였다.In FIG. 3, it is determined whether the TB to be allocated is computing resource intensive or memory resource intensive, and according to the determination result, TBs using different types of resources intensively are jointly allocated to one SM to determine the resource usage rate of the SM. A way to maximize is provided.

한편, 컴퓨팅 자원 집약적인 워크로드들은 함수 유닛(Function Unit)의 종류에 따라 필요한 SM의 자원이 서로 다를 수 있다. 이때, 컴퓨팅 자원은 함수 유닛의 종류에 따라 단정밀도(Single Floating Point, SFP) 연산 장치, 배정밀도(Double Floating Point, DFP) 연산 장치, 제어흐름(Control-Flow, CF) 장치, 로드/스토어(Load/Store, LSDT) 장치 및 특수함수(Special Function Unit, SFU) 연산 장치 중 적어도 하나를 포함할 수 있다.Meanwhile, computing resource-intensive workloads may have different SM resources required according to the type of a function unit. At this time, the computing resource is a single-precision (Single Floating Point, SFP) arithmetic unit, a double-precision (Double Floating Point, DFP) arithmetic unit, a control-flow (CF) unit, a load/store ( It may include at least one of a Load/Store, LSDT) device and a Special Function Unit (SFU) arithmetic unit.

일례로, 도 4를 참고하면, Back Propagation에 대응하는 TB는 SM의 컴퓨팅 자원 중 배정밀도 연산 장치를 주로 사용하고, Nbody에 대응하는 TB는 SM의 컴퓨팅 자원 중 단정밀도 연산 장치를 주로 사용하는 것을 알 수 있다.As an example, referring to FIG. 4 , the TB corresponding to Back Propagation mainly uses a double-precision arithmetic unit among the SM's computing resources, and the TB corresponding to the Nbody mainly uses a single-precision arithmetic unit among the SM's computing resources. Able to know.

따라서, TBS(100)는 할당하고자 하는 TB들을 워크로드 종류에 기초하여 복수의 컴퓨팅 자원 중 어떤 자원을 집약적으로 사용하는지를 판단하고, 서로 다른 종류의 컴퓨팅 자원을 집약적으로 사용하는 TB들을 하나의 SM에 공동으로 할당함으로써 SM의 자원 사용률을 극대화할 수 있다.Accordingly, the TBS 100 determines which resource among a plurality of computing resources is intensively used for the TBs to be allocated based on the workload type, and TBs using different types of computing resources intensively are assigned to one SM. By jointly allocating, the resource utilization rate of SM can be maximized.

이와 같이 할당하고자 하는 TB들이 컴퓨팅 자원 중 어떤 자원을 집약적으로 사용하는지를 판단하기 위하여 커널 빌드 후 사전 사용량 테스트가 수행될 수 있다. 이후 테스트 결과값이 커널에 포함된 복수의 TB들에 반영될 수 있다.In order to determine which of the computing resources the TBs to be allocated intensively use in this way, a pre-use test may be performed after the kernel is built. Thereafter, the test result value may be reflected in a plurality of TBs included in the kernel.

도 5는 본 발명의 일실시예에 따른 GPGPU의 스레드 블록 스케줄링 방법을 플로우챠트로 나타낸 도면이다.5 is a flowchart illustrating a thread block scheduling method of a GPGPU according to an embodiment of the present invention.

단계(510)에서, TBS(100)는 GPU를 구성하는 복수의 SM들 각각에 대한 컴퓨팅 자원 및 메모리 자원의 잔여 자원량을 식별할 수 있다. 이때, 컴퓨팅 자원은 서로 다른 복수의 함수 유닛을 처리하기 위한 장치들로 단정밀도 연산 장치, 배정밀도 연산 장치, 제어흐름 장치, 로드/스토어 장치 및 특수함수 연산 장치 중 적어도 하나를 포함할 수 있다. 그리고, 메모리 자원은 레지스터 파일, 공유 메모리 및 통합 캐시 중 적어도 하나를 포함할 수 있다.In step 510 , the TBS 100 may identify the remaining amount of computing resources and memory resources for each of the plurality of SMs constituting the GPU. In this case, the computing resource is a device for processing a plurality of different functional units, and may include at least one of a single-precision arithmetic unit, a double-precision arithmetic unit, a control flow unit, a load/store unit, and a special function arithmetic unit. In addition, the memory resource may include at least one of a register file, a shared memory, and an integrated cache.

단계(520)에서, TBS(100)는 GPU를 통해 처리될 복수의 TB들 각각에 대해 워크로드 종류에 따른 GPU의 자원 사용량을 식별할 수 있다.In step 520 , the TBS 100 may identify the resource usage of the GPU according to the workload type for each of the plurality of TBs to be processed through the GPU.

단계(530)에서, TBS(100)는 식별된 복수의 SM들 각각에 대한 잔여 자원량과 복수의 TB들 각각에 대한 GPU의 자원 사용량에 기초하여 복수의 TB들 각각을 복수의 SM들 중 어느 하나의 SM에 할당할 수 있다.In step 530 , the TBS 100 assigns each of the plurality of TBs to any one of the plurality of SMs based on the amount of remaining resources for each of the identified SMs and the resource usage of the GPU for each of the plurality of TBs. can be assigned to the SM of

이때, TBS(100)는 특정 SM에 할당하고자 하는 둘 이상의 TB들에 대한 컴퓨팅 자원 및 메모리 자원이 특정 SM의 잔여 자원량을 초과하지 않는 경우, 해당 TB들을 특정 SM에 할당하고, 특정 SM의 잔여 자원량을 초과하는 경우, 해당 특정 SM을 스킵하고, 이후 할당 순서에 대응하는 SM에 대하여 할당 여부를 판단할 수 있다.At this time, when the computing resources and memory resources for two or more TBs to be allocated to the specific SM do not exceed the remaining resource amount of the specific SM, the TBS 100 allocates the TBs to the specific SM and the remaining resource amount of the specific SM. If it exceeds, the specific SM may be skipped, and it may be determined whether to allocate the SM corresponding to the subsequent allocation order.

한편, 본 발명에 따른 방법은 컴퓨터에서 실행될 수 있는 프로그램으로 작성되어 마그네틱 저장매체, 광학적 판독매체, 디지털 저장매체 등 다양한 기록 매체로도 구현될 수 있다.Meanwhile, the method according to the present invention is written as a program that can be executed on a computer and can be implemented in various recording media such as magnetic storage media, optical reading media, and digital storage media.

본 명세서에 설명된 각종 기술들의 구현들은 디지털 전자 회로조직으로, 또는 컴퓨터 하드웨어, 펌웨어, 소프트웨어로, 또는 그들의 조합들로 구현될 수 있다. 구현들은 데이터 처리 장치, 예를 들어 프로그램가능 프로세서, 컴퓨터, 또는 다수의 컴퓨터들의 동작에 의한 처리를 위해, 또는 이 동작을 제어하기 위해, 컴퓨터 프로그램 제품, 즉 정보 캐리어, 예를 들어 기계 판독가능 저장 장치(컴퓨터 판독가능 매체) 또는 전파 신호에서 유형적으로 구체화된 컴퓨터 프로그램으로서 구현될 수 있다. 상술한 컴퓨터 프로그램(들)과 같은 컴퓨터 프로그램은 컴파일된 또는 인터프리트된 언어들을 포함하는 임의의 형태의 프로그래밍 언어로 기록될 수 있고, 독립형 프로그램으로서 또는 모듈, 구성요소, 서브루틴, 또는 컴퓨팅 환경에서의 사용에 적절한 다른 유닛으로서 포함하는 임의의 형태로 전개될 수 있다. 컴퓨터 프로그램은 하나의 사이트에서 하나의 컴퓨터 또는 다수의 컴퓨터들 상에서 처리되도록 또는 다수의 사이트들에 걸쳐 분배되고 통신 네트워크에 의해 상호 연결되도록 전개될 수 있다.Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or combinations thereof. Implementations may be implemented for processing by, or controlling the operation of, a data processing device, eg, a programmable processor, computer, or number of computers, a computer program product, ie an information carrier, eg, a machine readable storage It may be embodied as a computer program tangibly embodied in an apparatus (computer readable medium) or a radio signal. A computer program, such as the computer program(s) described above, may be written in any form of programming language, including compiled or interpreted languages, as a standalone program or in a module, component, subroutine, or computing environment. It can be deployed in any form, including as other units suitable for use in A computer program may be deployed to be processed on one computer or multiple computers at one site or distributed across multiple sites and interconnected by a communications network.

컴퓨터 프로그램의 처리에 적절한 프로세서들은 예로서, 범용 및 특수 목적 마이크로프로세서들 둘 다, 및 임의의 종류의 디지털 컴퓨터의 임의의 하나 이상의 프로세서들을 포함한다. 일반적으로, 프로세서는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 다로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 요소들은 명령어들을 실행하는 적어도 하나의 프로세서 및 명령어들 및 데이터를 저장하는 하나 이상의 메모리 장치들을 포함할 수 있다. 일반적으로, 컴퓨터는 데이터를 저장하는 하나 이상의 대량 저장 장치들, 예를 들어 자기, 자기-광 디스크들, 또는 광 디스크들을 포함할 수 있거나, 이것들로부터 데이터를 수신하거나 이것들에 데이터를 송신하거나 또는 양쪽으로 되도록 결합될 수도 있다. 컴퓨터 프로그램 명령어들 및 데이터를 구체화하는데 적절한 정보 캐리어들은 예로서 반도체 메모리 장치들, 예를 들어, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM) 등을 포함한다. 프로세서 및 메모리는 특수 목적 논리 회로조직에 의해 보충되거나, 이에 포함될 수 있다.Processors suitable for processing a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. In general, a processor will receive instructions and data from either read-only memory or random access memory or both. Elements of a computer may include at least one processor that executes instructions and one or more memory devices that store instructions and data. In general, a computer may include, receive data from, transmit data to, or both, one or more mass storage devices for storing data, for example magnetic, magneto-optical disks, or optical disks. may be combined to become Information carriers suitable for embodying computer program instructions and data are, for example, semiconductor memory devices, for example, magnetic media such as hard disks, floppy disks and magnetic tapes, Compact Disk Read Only Memory (CD-ROM). ), optical recording media such as DVD (Digital Video Disk), magneto-optical media such as optical disk, ROM (Read Only Memory), RAM (RAM) , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), and the like. Processors and memories may be supplemented by, or included in, special purpose logic circuitry.

또한, 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용매체일 수 있고, 컴퓨터 저장매체 및 전송매체를 모두 포함할 수 있다.In addition, the computer-readable medium may be any available medium that can be accessed by a computer, and may include both computer storage media and transmission media.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While this specification contains numerous specific implementation details, they should not be construed as limitations on the scope of any invention or claim, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. should be understood Certain features that are described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Furthermore, although features operate in a particular combination and may be initially depicted as claimed as such, one or more features from a claimed combination may in some cases be excluded from the combination, the claimed combination being a sub-combination. or a variant of a sub-combination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 장치 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 장치들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although acts are depicted in the drawings in a particular order, it should not be construed that all acts shown must be performed or that such acts must be performed in the specific order or sequential order shown to obtain desirable results. In certain cases, multitasking and parallel processing may be advantageous. Further, the separation of the various device components of the above-described embodiments should not be construed as requiring such separation in all embodiments, and the program components and devices described may generally be integrated together into a single software product or packaged into multiple software products. You have to understand that you can.

한편, 본 명세서와 도면에 개시된 본 발명의 실시 예들은 이해를 돕기 위해 특정 예를 제시한 것에 지나지 않으며, 본 발명의 범위를 한정하고자 하는 것은 아니다. 여기에 개시된 실시 예들 이외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 자명한 것이다.On the other hand, the embodiments of the present invention disclosed in the present specification and drawings are merely presented as specific examples to aid understanding, and are not intended to limit the scope of the present invention. It will be apparent to those of ordinary skill in the art to which the present invention pertains that other modifications based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein.

100 : 스레드 블록 스케줄러(TBS)100: Thread Block Scheduler (TBS)

Claims

A thread block scheduling method performed by a thread block scheduler, comprising:
identifying a residual resource amount of a computing resource for processing a plurality of different function units for each of a plurality of stream multiprocessors (hereinafter referred to as SM) constituting the GPU;
identifying resource usage of the computing resource according to the type of workload for each of a plurality of thread blocks (hereinafter referred to as TB) to be processed through the GPU; and
Each of the plurality of TBs to any one SM of the plurality of SMs based on the amount of remaining resources of the computing resources for each of the identified SMs and the resource usage of the computing resources for each of the plurality of TBs Steps to assign
A thread block scheduling method comprising:

According to claim 1,
The computing resource is
A thread block scheduling method comprising at least one of a single-precision arithmetic unit, a double-precision arithmetic unit, a control flow unit, a load/store unit, and a special function arithmetic unit.

According to claim 1,
The allocating step is
When two or more TBs to be allocated to a specific SM use different computing resources, a thread block scheduling method for allocating to the same SM within a limit that does not exceed the amount of remaining resources for the computing resources of the specific SM.

According to claim 1,
The allocating step is
A thread block scheduling method for sequentially allocating each of the plurality of TBs one by one to the plurality of SMs according to a round-robin method.

In the thread block scheduler,
including a processor;
The processor is
For each of the plurality of stream multiprocessors (hereinafter referred to as SM) constituting the GPU, the remaining resource amount of the computing resource for processing a plurality of different function units is identified, and the plurality of streams to be processed through the GPU are identified. Identifies the resource usage of the computing resource according to the type of workload for each of the Thread Blocks (TBs) of A thread block scheduler for allocating each of the plurality of TBs to any one of the plurality of SMs based on the resource usage of the computing resource for each.

6. The method of claim 5,
The computing resource is
A thread block scheduler comprising at least one of a single-precision arithmetic unit, a double-precision arithmetic unit, a control flow unit, a load/store unit, and a special function arithmetic unit.

6. The method of claim 5,
The processor is
When two or more TBs to be allocated to a specific SM use different computing resources, the thread block scheduler allocates to the same SM within a limit that does not exceed the amount of remaining resources for the computing resources of the specific SM.

6. The method of claim 5,
The processor is
A thread block scheduling method for sequentially allocating each of the plurality of TBs one by one to the plurality of SMs according to a round-robin method.

A thread block scheduling method performed by a thread block scheduler, comprising:
identifying the remaining resource amounts of computing resources and memory resources for each of a plurality of stream multiprocessors constituting the GPU (Stream Multiprocessor, hereinafter SM);
identifying resource usage of the GPU according to the type of workload for each of a plurality of thread blocks (hereinafter referred to as TB) to be processed through the GPU; and
Allocating each of the plurality of TBs to any one of the plurality of SMs based on the amount of remaining resources for each of the identified SMs and the resource usage of the GPU for each of the plurality of TBs
including,
The computing resource is
It consists of devices for processing a plurality of different function units,
The memory resource is
A method for scheduling thread blocks that consists of a register file, shared memory, and a unified cache.

10. The method of claim 9,
The computing resource is
A thread block scheduling method comprising at least one of a single-precision arithmetic unit, a double-precision arithmetic unit, a control flow unit, a load/store unit, and a special function arithmetic unit.

10. The method of claim 9,
The allocating step is
If the computing resources and memory resources for two or more TBs to be allocated to a specific SM do not exceed the remaining resource amount of the specific SM, allocating the TBs to the specific SM and exceeding the remaining resource amount of the specific SM, A thread block scheduling method that skips a specific SM.

10. The method of claim 9,
The allocating step is
A thread block scheduling method for sequentially allocating each of the plurality of TBs one by one to the plurality of SMs according to a round-robin method.

In the thread block scheduler,
including a processor;
The processor is
A plurality of stream multiprocessors (hereinafter, SM) constituting the GPU are identified, and the remaining resource amounts of computing resources and memory resources are identified, and a plurality of thread blocks to be processed through the GPU (Thread Blocks, hereinafter referred to as TBs) are selected. Identifies the resource usage of the GPU required based on the workload type for each, and based on the residual resource amount of each of the identified plurality of SMs and the GPU resource usage of each of the plurality of TBs, each of the plurality of TBs is assigned to any one of the plurality of SMs,
The computing resource is
It consists of devices for processing a plurality of different function units,
The memory resource is
A thread block scheduler consisting of a register file, shared memory, and a unified cache.

14. The method of claim 13,
The computing resource is
A thread block scheduler comprising at least one of a single-precision arithmetic unit, a double-precision arithmetic unit, a control flow unit, a load/store unit, and a special function arithmetic unit.

14. The method of claim 13,
The processor is
If the computing resources and memory resources for two or more TBs to be allocated to a specific SM do not exceed the remaining resource amount of the specific SM, allocating the TBs to the specific SM and exceeding the remaining resource amount of the specific SM, A thread block scheduler that skips that specific SM.

14. The method of claim 13,
The processor is
A thread block scheduler that sequentially allocates each of the plurality of TBs one by one to the plurality of SMs according to a round-robin method.