KR20220128903A

KR20220128903A - Method and apparatus for multi-tasking efficiency

Info

Publication number: KR20220128903A
Application number: KR1020210033642A
Authority: KR
Inventors: 김윤희; 김세진
Original assignee: 숙명여자대학교산학협력단
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2022-09-22
Also published as: KR102484563B1

Abstract

The present invention relates to a method and an apparatus for multitasking efficiency. A method for multitasking efficiency according to an embodiment of the present invention comprises the steps of: receiving a plurality of kernels as an input; sorting the plurality of kernels using category information including information on categories of the kernels; generating a concurrent execution group list including a plurality of groups obtained by grouping at least one kernel to be concurrently executed among the plurality of kernels, by using a K-Scheduler algorithm based on the sorted plurality of kernels and kernel information including the category information, kernel execution characteristic information, and profiling information; and for each group included in the concurrent execution group list, transmitting at least one grouped kernel to a GPU so as to perform multitasking. The present invention can optimize GPU performance when the kernels are concurrently executed.

Description

Multi-task efficiency method and apparatus {METHOD AND APPARATUS FOR MULTI-TASKING EFFICIENCY}

본 명세서는 다중 작업 효율화 방법 및 장치에 관한 것이다.The present specification relates to a multi-task efficiency method and apparatus.

최근 DATA CENTER 및 클라우드 환경에서 GPU 기반 infrastructure service를 제공하기 시작했다. 실제 GPGPU(General Prupose GPU) 애플리케이션은 GPU 친화적인 애플리케이션과 달리 낮은 GPU 활용도를 보인다. 이로 인해 높은 가격을 갖는 GPU의 자원 활용도를 높이기 위해 SM(Streaming Multiprocessor) 내부의 자원을 공유하며 서로 다른 애플리케이션들을 동시에 실행하는 다중 작업(Muntitasking)에 대한 요구가 발생되고 있다. Recently, we started to provide GPU-based infrastructure services in DATA CENTER and cloud environments. In fact, General Propose GPU (GPGPU) applications show low GPU utilization, unlike GPU-friendly applications. For this reason, in order to increase the resource utilization of the GPU having a high price, there is a demand for multitasking that shares resources inside a Streaming Multiprocessor (SM) and executes different applications at the same time.

그러나, GPU 자원을 공유하는 경우 한정된 자원에 대한 자원 경쟁이 발생할 수 있어 다중 작업 시 성능 예측이 어렵다. 또한, GPU를 이용하여 실행되는 애플리케이션의 실행 단위인 커널(Kernel) 각각은 서로 다른 GPU 자원 사용량 및 서로 다른 런타임(runtime)동작을 갖는다. 따라서, 커널의 동시 실행 시 커널의 단독 실행할 때보다 작업 성능이 감소할 수도 있으므로 작업 성능의 향상을 위해 최적의 커널 조합 선택이 매우 중요하다.However, when GPU resources are shared, resource competition for limited resources may occur, making it difficult to predict performance in multi-tasking. In addition, each kernel, which is an execution unit of an application executed using the GPU, has different GPU resource usage and different runtime operations. Therefore, when the kernels are simultaneously executed, task performance may be reduced compared to when the kernel is executed alone. Therefore, it is very important to select an optimal kernel combination to improve the task performance.

그러나, 사용하는 애플리케이션의 수가 많아질수록 이러한 커널의 조합을 선택하는 것이 어려워 지며, 이러한 커널 조합 선택의 어려움에 따라 다중 작업 성능의 최적화가 이루어지지 못하므로 GPU 내부의 자원을 효율적으로 이용하지 못하게 되는 문제점이 있다.However, as the number of applications used increases, it becomes difficult to select a combination of these kernels, and due to the difficulty in selecting a combination of these kernels, optimization of multi-task performance is not made, so that resources inside the GPU cannot be efficiently used. There is a problem.

또한, 종래의 자원 할당 방법은 애플리케이션이 단독으로 실행된 결과만을 토대로 동시 수행의 성능을 예측 하였으므로 애플리케이션의 동시 실행에서 발생할 수 있는 자원의 경쟁 정도가 고려되지 않아 이론적 성능과 실제 성능의 결과가 다른 문제점이 있다.In addition, since the conventional resource allocation method predicts the performance of concurrent execution only based on the result of running the application alone, the degree of resource competition that may occur in the concurrent execution of the application is not considered, so the theoretical performance and the actual performance are different. There is this.

본 명세서의 목적은 커널의 카테고리 정보에 따라 커널을 분류함으로써 커널의 동시 수행 시 GPU 성능을 최적화할 수 있는 다중 작업 효율화 방법 및 장치를 제공하는 것이다.An object of the present specification is to provide a multi-task efficiency method and apparatus capable of optimizing GPU performance when the kernel is simultaneously executed by classifying the kernel according to the category information of the kernel.

또한, 본 명세서의 목적은 K-Scheduler 알고리즘을 이용하여 동시실행그룹리스트를 생성함으로써 자원 경쟁을 최소화할 수 있는 다중 작업 효율화 방법 및 장치를 제공하는 것이다.In addition, an object of the present specification is to provide a multi-task efficiency method and apparatus capable of minimizing resource contention by generating a concurrent execution group list using the K-Scheduler algorithm.

또한, 본 명세서의 목적은 스케줄링 규칙을 통해 GPU 자원의 높은 활용도를 달성하여 전체 GPU의 성능뿐 아니라 단일 커널의 기대 성능 또한 향상시킬 수 있는 다중 작업 효율화 방법 및 장치를 제공하는 것이다.In addition, an object of the present specification is to provide a multi-task efficiency method and apparatus capable of achieving high utilization of GPU resources through scheduling rules to improve not only the performance of the entire GPU but also the expected performance of a single kernel.

또한 본 명세서의 목적은 커널의 동시 수행에서 발생할 수 있는 자원의 경쟁 정도를 예측함으로써 실제 성능을 최대화할 수 있는 다중 작업 효율화 방법 및 장치를 제공하는 것이다It is also an object of the present specification to provide a multi-task efficiency method and apparatus capable of maximizing actual performance by predicting the degree of resource contention that may occur in the concurrent execution of the kernel.

본 명세서의 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 명세서의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 명세서의 실시예에 의해 보다 분명하게 이해될 것이다. 또한, 본 명세서의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The objects of the present specification are not limited to the above-mentioned objects, and other objects and advantages of the present specification that are not mentioned may be understood by the following description, and will be more clearly understood by the examples of the present specification. It will also be readily apparent that the objects and advantages of the present specification may be realized by the means and combinations thereof indicated in the claims.

본 명세서의 일 실시예에 따른 다중 작업 효율화 방법은 복수의 커널(Kernel)을 입력 받는 단계, 커널의 카테고리에 관한 정보를 포함하는 카테고리 정보를 이용하여, 복수의 커널을 정렬하는 단계, 카테고리 정보, 커널의 실행 특성 정보, 프로파일링 정보를 포함하는 커널 정보 및 정렬된 복수의 커널에 기반하는 K-Scheduler 알고리즘을 이용하여, 복수의 커널 중 동시 실행할 적어도 하나의 커널을 그루핑한 복수의 그룹을 포함하는 동시실행그룹리스트를 생성하는 단계, 동시실행그룹리스트에 포함된 각각의 그룹에 대하여, 그루핑 된 적어도 하나의 커널을 GPU에 전송하여 다중 작업(Multi-tasking)을 수행하는 단계를 포함한다.The multi-task efficiency method according to an embodiment of the present specification includes the steps of receiving a plurality of kernels, arranging the plurality of kernels using category information including information on categories of the kernels, category information, Kernel information including kernel execution characteristic information, profiling information, and Using the K-Scheduler algorithm based on a plurality of sorted kernels, generating a concurrent execution group list including a plurality of groups grouping at least one kernel to be executed simultaneously among a plurality of kernels, including in the concurrent execution group list For each of the grouped groups, transmitting at least one grouped kernel to the GPU to perform multi-tasking.

본 명세서의 일 실시예에서 커널의 카테고리 정보는 계산 집약(compute-intensive) 커널, 메모리 집약(memory-intensive) 커널, L1 캐쉬 집약(L1 cache-intensive) 커널을 포함한다.In an embodiment of the present specification, the category information of the kernel includes a computation-intensive kernel, a memory-intensive kernel, and an L1 cache-intensive kernel.

본 명세서의 일 실시예에서 커널 정보를 이용하여 상기 복수의 커널을 정렬하는 단계는 복수의 커널을 L1 캐쉬 집약 커널, 메모리 집약 커널, 계산 집약 커널의 순서로 정렬하고, 카테고리 정보가 동일한 경우 실행 시간이 긴 커널을 우선으로 하여 상기 복수의 커널을 정렬하는 단계를 포함한다.In an embodiment of the present specification, the step of arranging the plurality of kernels by using the kernel information includes aligning the plurality of kernels in the order of the L1 cache-intensive kernel, the memory-intensive kernel, and the computation-intensive kernel, and when the category information is the same, the execution time and aligning the plurality of kernels with the long kernel as a priority.

본 명세서의 일 실시예에서 실행 특성 정보는 EPC(Eligible warp Per Cycle)가 1보다 작은 커널은 상기 커널에 많은 양의 자원을 할당해 주어도 GPU의 처리속도가 향상되지 않는 제 1특성, 커널의 누적 Memory bandwidth 사용량 및 계산량 (FLOPS; FLoating point Operations Per Second)이 GPU의 공급량을 초과하면 GPU의 처리속도가 향상되지 않는 제 2특성, L1 캐쉬 집약 커널의 경우 미리 설정된 L1 캐쉬 트랜잭션 수를 초과하는 다른 커널과 동시 실행 시 상기 GPU의 처리속도가 향상되지 않는 제 3특성, 적어도 2 이상의 메모리 집약 커널을 동시 실행 시 상기 GPU의 처리속도가 향상되지 않는 제 4특성 및 계산 집약 커널의 EPC가 미리 설정된 임계값(Base threshold)보다 큰 적어도 2 이상의 계산 집약 커널을 동시 실행 시 상기 GPU의 처리속도가 향상되지 않는 제 5특성을 포함한다.In an embodiment of the present specification, the execution characteristic information is a first characteristic that a kernel having an Eligible Warp Per Cycle (EPC) less than 1 does not improve the processing speed of the GPU even when a large amount of resources is allocated to the kernel, the kernel accumulation The second characteristic that the processing speed of the GPU does not improve when the memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) exceeds the supply of the GPU. A third characteristic that the processing speed of the GPU is not improved when executing concurrently with a third characteristic that the processing speed of the GPU is not improved when at least two or more memory-intensive kernels are concurrently executed and a fifth characteristic in which the processing speed of the GPU is not improved when at least two or more computationally intensive kernels greater than (Base threshold) are simultaneously executed.

본 명세서의 일 실시예에서 프로파일링 정보는 커널이 사용하는 GPU의 자원량에 대한 정보인 정적 프로파일링 정보 및 커널의 런타임(runtime) 정보, 실행 지연(Stall) 원인에 대한 동적 프로파일링 정보를 포함한다.In an embodiment of the present specification, the profiling information includes static profiling information that is information about the amount of resources of the GPU used by the kernel, runtime information of the kernel, and dynamic profiling information about the cause of the execution delay. .

본 명세서의 일 실시예에서 K-Scheduler 알고리즘은 실행 특성 정보에 기반한 스케줄링 규칙을 포함한다.In an embodiment of the present specification, the K-Scheduler algorithm includes a scheduling rule based on execution characteristic information.

본 명세서의 일 실시예에서 스케줄링 규칙은 정적 스케줄링 규칙 및 동적 스케줄링 규칙 중 적어도 하나를 포함하고, 정적 스케줄링 규칙은 상기 커널이 상기 GPU의 자원 제공량을 초과하여 사용될 수 없는 제1규칙 및 상기 커널의 누적 Memory bandwidth 사용량 및 계산량 (FLOPS; FLoating point Operations Per Second)이 상기 GPU의 공급량을 초과하여 사용될 수 없는 제 2규칙을 포함하고, 동적 스케줄링 규칙은 L1 캐쉬 집약 커널의 경우 미리 설정된 L1 캐쉬 트랜잭션 수를 초과하는 다른 커널과 다중 작업할 수 없는 제 3규칙, 적어도 2 이상의 메모리 집약 커널을 다중 작업을 할 수 없는 제 4규칙 및 계산 집약 커널의 EPC(Eligible warp Per Cycle)가 미리 설정된 임계값(Base threshold)보다 큰 적어도 2 이상의 계산 집약 커널을 다중 작업할 수 없는 제 5규칙을 포함한다.In an embodiment of the present specification, the scheduling rule includes at least one of a static scheduling rule and a dynamic scheduling rule, and the static scheduling rule includes a first rule in which the kernel cannot be used in excess of a resource provision amount of the GPU and the accumulation of the kernel. Memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) includes the second rule that cannot be used in excess of the supply amount of the GPU, and the dynamic scheduling rule exceeds the preset number of L1 cache transactions in the case of an L1 cache-intensive kernel A third rule that cannot multi-task with other kernels that perform multiple tasks, a fourth rule that cannot multi-task with at least two or more memory-intensive kernels, and a threshold at which Eligible Warp Per Cycle (EPC) of the computation-intensive kernel is set in advance (Base threshold) and a fifth rule that cannot multitask at least two larger computationally intensive kernels.

본 명세서의 일 실시예에 따른 다중 작업 효율화 장치는 명령어를 실행하는 하나 이상의 프로세서를 포함하고, 하나의 프로세서는, 복수의 커널(Kernel)을 입력 받고, 커널의 카테고리에 관한 정보를 포함하는 카테고리 정보를 이용하여, 상기 복수의 커널을 정렬하고, 카테고리 정보, 커널의 실행 특성 정보, 프로파일링 정보를 포함하는 커널 정보 및 상기 정렬된 복수의 커널에 기반하는 K-Scheduler 알고리즘을 이용하여, 상기 복수의 커널 중 동시 실행할 적어도 하나의 커널을 그루핑한 복수의 그룹을 포함하는 동시실행그룹리스트를 생성하고, 동시실행그룹리스트에 포함된 각각의 그룹에 대하여, 상기 그루핑 된 적어도 하나의 커널을 GPU에 전송하여 다중 작업(Multi-tasking)을 수행한다.Multi-task efficiency apparatus according to an embodiment of the present specification includes one or more processors that execute instructions, one processor receives a plurality of kernels, and category information including information about the category of the kernel Sorting the plurality of kernels using By using the K-Scheduler algorithm based on the sorted plurality of kernels, a concurrent execution group list including a plurality of groups grouping at least one kernel to be executed simultaneously among the plurality of kernels is generated, and in the concurrent execution group list For each included group, the grouped at least one kernel is transmitted to the GPU to perform multi-tasking.

본 명세서의 일 실시예에서 하나의 프로세서는 복수의 커널을 L1 캐쉬 집약 커널, 메모리 집약 커널, 계산 집약 커널의 순서로 정렬하고, 상기 카테고리 정보가 동일한 경우 실행 시간이 긴 커널을 우선으로 하여 상기 복수의 커널을 정렬한다.In an embodiment of the present specification, one processor arranges a plurality of kernels in the order of an L1 cache-intensive kernel, a memory-intensive kernel, and a computation-intensive kernel, and when the category information is the same, a kernel having a long execution time is given priority to the plurality of kernels sort the kernels of

본 명세서의 일 실시예에서 실행 특성 정보는 EPC(Eligible warp Per Cycle)가 1보다 작은 커널은 상기 커널에 많은 양의 자원을 할당해 주어도 상기 GPU의 처리속도가 향상되지 않는 제 1특성, 커널의 누적 Memory bandwidth 사용량 및 계산량 (FLOPS; FLoating point Operations Per Second)이 GPU의 공급량을 초과하면 상기 GPU의 처리속도가 향상되지 않는 제 2특성, L1 캐쉬 집약 커널의 경우 미리 설정된 L1 캐쉬 트랜잭션 수를 초과하는 다른 커널과 동시 실행시 상기 GPU의 처리속도가 향상되지 않는 제 3특성, 적어도 2 이상의 메모리 집약 커널을 동시 실행시 상기 GPU의 처리속도가 향상되지 않는 제 4특성 및 계산 집약 커널의 EPC가 미리 설정된 임계값(Base threshold)보다 큰 적어도 2 이상의 계산 집약 커널을 동시 실행시 상기 GPU의 처리속도가 향상되지 않는 제 5특성을 포함한다.In an embodiment of the present specification, the execution characteristic information is a first characteristic that a kernel having an Eligible Warp Per Cycle (EPC) less than 1 does not improve the processing speed of the GPU even when a large amount of resources is allocated to the kernel, the kernel When the accumulated memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) exceeds the supply of the GPU, the second characteristic that the processing speed of the GPU is not improved, in the case of an L1 cache-intensive kernel, the number of L1 cache transactions exceeding the preset number A third characteristic that the processing speed of the GPU is not improved when it is simultaneously executed with other kernels, a fourth characteristic that the processing speed of the GPU is not improved when at least two or more memory-intensive kernels are simultaneously executed, and the EPC of the computation-intensive kernel are preset and a fifth characteristic that the processing speed of the GPU is not improved when at least two computationally intensive kernels greater than a base threshold are simultaneously executed.

본 명세서의 일 실시예에서 프로파일링 정보는 커널이 사용하는 GPU의 자원량에 대한 정보인 정적 프로파일링 정보 및 상기 커널의 런타임(runtime) 정보, 실행 지연(Stall) 원인에 대한 동적 프로파일링 정보를 포함한다.In one embodiment of the present specification, the profiling information includes static profiling information, which is information about the amount of resources of the GPU used by the kernel, runtime information of the kernel, and dynamic profiling information about the cause of the execution delay (Stall) do.

본 명세서의 일 실시예에서 스케줄링 규칙은 정적 스케줄링 규칙 및 동적 스케줄링 규칙 중 적어도 하나를 포함하고, 정적 스케줄링 규칙은 상기 커널이 상기 GPU의 자원 제공량을 초과하여 사용될 수 없는 제1규칙 및 상기 커널의 누적 Memory bandwidth 사용량 및 계산량 (FLOPS; FLoating point Operations Per Second)이 GPU의 공급량을 초과하여 사용될 수 없는 제 2규칙을 포함하고, 동적 스케줄링 규칙은 L1 캐쉬 집약 커널의 경우 미리 설정된 L1 캐쉬 트랜잭션 수를 초과하는 다른 커널과 다중 작업할 수 없는 제 3규칙, 적어도 2 이상의 메모리 집약 커널을 다중 작업을 할 수 없는 제 4규칙 및 계산 집약 커널의 EPC(Eligible warp Per Cycle)가 미리 설정된 임계값(Base threshold)보다 큰 적어도 2 이상의 계산 집약 커널을 다중 작업할 수 없는 제 5규칙을 포함한다.In an embodiment of the present specification, the scheduling rule includes at least one of a static scheduling rule and a dynamic scheduling rule, and the static scheduling rule includes a first rule in which the kernel cannot be used in excess of a resource provision amount of the GPU and the accumulation of the kernel. Memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) includes the second rule that cannot be used in excess of the GPU supply, and the dynamic scheduling rule is for L1 cache-intensive kernels that exceed the preset number of L1 cache transactions. The third rule that cannot multi-task with other kernels, the fourth rule that cannot multi-task with at least two or more memory-intensive kernels, and EPC (Eligible Warp Per Cycle) of the computation-intensive kernel are higher than the preset threshold (Base threshold). It includes a fifth rule that cannot multitask a large at least two computationally intensive kernels.

본 명세서의 일 실시예에 따른 다중 작업 효율화 방법 및 장치는 커널의 카테고리 정보에 따라 커널을 분류함으로써 커널의 동시 수행시 GPU 성능을 최적화할 수 있다.The multi-task efficiency method and apparatus according to an embodiment of the present specification can optimize GPU performance when kernels are simultaneously executed by classifying kernels according to category information of the kernels.

또한, 본 명세서의 일 실시예에 따른 다중 작업 효율화 방법 및 장치는 K-Scheduler 알고리즘을 이용하여 동시실행그룹리스트를 생성함으로써 자원 경쟁을 최소화할 수 있다.In addition, the multi-task efficiency method and apparatus according to an embodiment of the present specification can minimize resource contention by generating a concurrent execution group list using the K-Scheduler algorithm.

또한, 본 명세서의 일 실시예 따른 다중 작업 효율화 장치는 스케줄링 규칙을 통해 GPU 자원의 높은 활용도를 달성하여 전체 GPU의 성능뿐 아니라 단일 커널의 기대 성능 또한 향상시킬 수 있다.In addition, the apparatus for improving multi-tasking efficiency according to an embodiment of the present specification may achieve high utilization of GPU resources through scheduling rules, thereby improving the performance of the entire GPU as well as the expected performance of a single kernel.

또한, 본 명세서의 일 실시예에 따른 다중 작업 효율화 방법 및 장치는 커널의 동시 수행에서 발생할 수 있는 자원의 경쟁 정도를 예측함으로써 실제 성능을 최대화할 수 있다.In addition, the multi-task efficiency method and apparatus according to an embodiment of the present specification can maximize actual performance by predicting the degree of resource contention that may occur in concurrent execution of the kernel.

도 1은 본 명세서의 일 실시예에 따른 다중 작업 효율화 장치를 포함하는 K-Scheduler 기반 시스템(1)의 블록도이다.
도 2 내지 도 6은 본 명세서의 일 실시예에서 커널의 실행 특성 정보를 산출하는 과정을 나타낸 도면이다.
도 7은 본 명세서의 일 실시예에서 K-Scheduler 알고리즘을 나타낸 도면이다.
도 8은 제3 규칙 내지 제 5규칙을 판단하는 알고리즘을 나타낸 도면이다.
도 9는 본 명세서의 일 실시예에서 다중 작업 효율화 장치가 동시실행그룹리스트를 생성하는 방법을 나타낸 도면이다.
도 10은 본 명세서의 일 실시예에 따른 다중 작업 효율화 방법의 순서도이다.1 is a block diagram of a K-Scheduler-based system 1 including a multi-task efficiency apparatus according to an embodiment of the present specification.
2 to 6 are diagrams illustrating a process of calculating kernel execution characteristic information according to an embodiment of the present specification.
7 is a diagram illustrating a K-Scheduler algorithm in an embodiment of the present specification.
8 is a diagram illustrating an algorithm for determining a third rule to a fifth rule.
9 is a diagram illustrating a method for generating a concurrent execution group list by the multi-task efficiency apparatus according to an embodiment of the present specification.
10 is a flowchart of a multi-task efficiency method according to an embodiment of the present specification.

전술한 목적, 특징 및 장점은 첨부된 도면을 참조하여 상세하게 후술되며, 이에 따라 본 명세서가 속하는 기술분야에서 통상의 지식을 가진 자가 본 명세서의 기술적 사상을 용이하게 실시할 수 있을 것이다. 본 명세서를 설명함에 있어서 본 명세서와 관련된 공지 기술에 대한 구체적인 설명이 본 명세서의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 상세한 설명을 생략한다. 이하, 첨부된 도면을 참조하여 본 명세서에 따른 바람직한 실시예를 상세히 설명하기로 한다. 도면에서 동일한 참조부호는 동일 또는 유사한 구성요소를 가리키는 것으로 사용된다.The above-described objects, features, and advantages will be described below in detail with reference to the accompanying drawings, and accordingly, those of ordinary skill in the art to which this specification belongs will be able to easily implement the technical idea of the present specification. In the description of the present specification, if it is determined that a detailed description of a known technology related to the present specification may unnecessarily obscure the subject matter of the present specification, the detailed description will be omitted. Hereinafter, preferred embodiments according to the present specification will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to refer to the same or similar components.

전술한 목적, 특징 및 장점은 첨부된 도면을 참조하여 상세하게 후술 되며, 이에 따라 본 명세서가 속하는 기술분야에서 통상의 지식을 가진 자가 본 명세서의 기술적 사상을 용이하게 실시할 수 있을 것이다. 본 명세서를 설명함에 있어서 본 명세서와 관련된 공지 기술에 대한 구체적인 설명이 본 명세서의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 상세한 설명을 생략한다. 이하, 첨부된 도면을 참조하여 본 명세서에 따른 바람직한 실시예를 상세히 설명하기로 한다. 도면에서 동일한 참조 부호는 동일 또는 유사한 구성요소를 가리키는 것으로 사용된다.The above-described objects, features and advantages will be described below in detail with reference to the accompanying drawings, and accordingly, those of ordinary skill in the art to which this specification belongs will be able to easily implement the technical idea of the present specification. In the description of the present specification, if it is determined that a detailed description of a known technology related to the present specification may unnecessarily obscure the subject matter of the present specification, the detailed description will be omitted. Hereinafter, preferred embodiments according to the present specification will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to refer to the same or similar components.

도 1은 본 명세서의 일 실시예에 따른 다중 작업 효율화 장치를 포함하는 K-Scheduler 기반 시스템(1)의 블록도이다. 본 명세서의 일 실시예에서 K-Scheduler 기반 시스템은 프로파일러(100), 커널 분류기(200), 다중 작업 효율화 장치(300) 및 GPU(400)를 포함한다.1 is a block diagram of a K-Scheduler-based system 1 including a multi-task efficiency apparatus according to an embodiment of the present specification. In an embodiment of the present specification, the K-Scheduler-based system includes a profiler 100 , a kernel classifier 200 , a multi-task efficiency device 300 , and a GPU 400 .

프로파일러(profiler,100)는 복수의 커널(10)이 시스템(1)에 입력되면 입력된 복수의 커널(Kernel)을 분석하여 프로파일링 정보를 생성한다. 프로파일러(100)는 커널 또는 애플리케이션의 성능 등을 분석하는 도구로써 분석을 통해 어느 부분에서 커널의 성능 저하가 발생하는지를 확인할 수 있다. When a plurality of kernels 10 are input to the system 1 , the profiler 100 analyzes the input kernels and generates profiling information. The profiler 100 is a tool for analyzing the performance of the kernel or the application, and through the analysis, it is possible to determine where the performance degradation of the kernel occurs.

구체적으로, 프로파일러는 커널의 어느 부분에서 실행 지연이 일어나는지 분석하기 위해 정적 프로파일링 및 동적 프로파일링을 모두 수행하는데, 이에 따라 생성된 프로파일링 정보는 정적 프로파일러(예컨대, NVIDIA CUDA Compiler)로부터 획득된 정적 프로파일링 정보 및 동적 프로파일러(예컨대, NVProf)로부터 획득된 동적 프로파일링 정보를 포함할 수 있다. 정적 프로파일링 정보는 GPU의 자원량에 대한 정보인 SM에 포함된 레지스터의 크기, Shared momory(L1 Cache)의 크기 및 TB(Tread Block)의 수에 대한 정보를 포함할 수 있고, 동적 프로파일링 정보는 커널의 런타임(runtime) 정보 및 실행 지연(Stall) 원인 정보를 포함할 수 있다. 따라서, 이러한 프로파일링 정보를 통해 커널의 어느 부분에서 실행 지연이 발생하는지 판단할 수 있다.Specifically, the profiler performs both static profiling and dynamic profiling in order to analyze in which part of the kernel execution delay occurs, and the generated profiling information is obtained from the static profiler (eg, NVIDIA CUDA Compiler). static profiling information and dynamic profiling information obtained from a dynamic profiler (eg, NVProf). The static profiling information may include information on the size of a register included in SM, which is information about the resource amount of the GPU, the size of shared momory (L1 Cache), and the number of TB (Tread Block), and the dynamic profiling information It may include kernel runtime information and execution delay cause information. Therefore, it is possible to determine in which part of the kernel the execution delay occurs through such profiling information.

커널 분류기(Kernel Classifier,200)는 프로파일러(100)로부터 생성된 프로파일링 정보를 획득하고, 획득한 프로파일링 정보에 기초하여 입력된 커널(예를 들어, K1)을 카테고리 정보(210)에 따라 분류한다.The kernel classifier 200 obtains the profiling information generated from the profiler 100 and converts the input kernel (eg, K1) based on the obtained profiling information according to the category information 210 . classify

이때, 카테고리 정보(210)는 계산 집약(compute-intensive) 커널, 메모리 집약(memory-intensive) 커널, L1 캐쉬 집약(L1 cache-intensive) 커널을 포함할 수 있고, 이러한 카테고리의 분류는 커널의 실행 지연(Stall) 원인을 기준으로 하여 설정될 수 있다. 즉 커널의 실행 지연 원인이 어디에서 발생 하느냐에 따라 카테고리가 분류된다.In this case, the category information 210 may include a computation-intensive kernel, a memory-intensive kernel, and an L1 cache-intensive kernel, and classification of these categories is performed by the kernel execution. It can be set based on the cause of the delay (Stall). That is, categories are classified according to where the cause of kernel execution delay occurs.

커널 분류기(200)는 입력된 커널(K1)의 프로파일링 정보에 포함된 실행 지연 정보가 캐쉬인 경우 카테고리 정보(210)에 따라 입력된 커널(K1)을 L1 캐쉬 집약 커널로 분류하고, 입력된 커널(K1)의 프로파일링 정보에 포함된 실행 지연 정보가 메모리인 경우 커널(K1)을 메모리 집약 커널로 분류하고, 입력된 커널(K1)의 프로파일링 정보에 포함된 실행 지연 정보가 계산인 경우 커널(K1)을 계산 집약 커널로 분류한다. 이러한 분류 작업을 통해 후술할 바와 같이 다중 작업 효율화 장치(300)는 복수의 커널을 손쉽게 정렬할 수 있어 다중 작업의 효율화가 가능하다.The kernel classifier 200 classifies the input kernel K1 as an L1 cache-intensive kernel according to the category information 210 when the execution delay information included in the profiling information of the input kernel K1 is a cache, and the input When the execution delay information included in the profiling information of the kernel K1 is memory, the kernel K1 is classified as a memory-intensive kernel, and when the execution delay information included in the input profiling information of the kernel K1 is calculation We classify the kernel (K1) as a computationally intensive kernel. As will be described later through such a classification task, the multi-task efficiency apparatus 300 can easily align a plurality of kernels, so that the multi-task efficiency can be improved.

다중 작업 효율화 장치(300)는 커널 분류기(200)에 의해 분류된 복수의 커널을 정렬하고, K-Scheduler 알고리즘을 이용하여 다중 작업 수행을 위한 동시실행그룹리스트를 생성한다. 이후, 생성된 동시실행그룹리스트는 GPU(400)에 전송되어 다중 작업(Multi-tasking)이 수행된다.The multi-task efficiency apparatus 300 aligns a plurality of kernels classified by the kernel classifier 200 and generates a list of concurrent execution groups for performing multi-tasks by using the K-Scheduler algorithm. Thereafter, the generated concurrent execution group list is transmitted to the GPU 400 to perform multi-tasking.

구체적으로, 다중 작업 효율화 장치(300)는 카테고리 정보(210), 커널의 실행 특성 정보(220) 및 K-Scheduler 알고리즘에 포함된 스케줄링 규칙(230)에 기반하여 동시실행그룹리스트를 생성한다. 다중 작업 효율화 장치(300)가 입력 커널 로부터 동시실행그룹리스트를 생성하는 방법은 후술하여 상세히 설명한다.Specifically, the multi-task efficiency apparatus 300 generates a concurrent execution group list based on the category information 210 , the kernel execution characteristic information 220 , and the scheduling rule 230 included in the K-Scheduler algorithm. A method for the multi-task efficiency device 300 to generate a concurrent execution group list from an input kernel will be described later in detail.

도 2 내지 도 6은 본 명세서의 일 실시예에서 커널의 실행 특성 정보를 산출하는 과정을 나타낸 도면이다. 이하 도 2 내지 도 6을 참조하여 산출된 커널의 실행 특성 정보 및 실행 특성 정보로부터 산출된 스케줄링 규칙에 대해 설명하도록 한다.2 to 6 are diagrams illustrating a process of calculating kernel execution characteristic information according to an embodiment of the present specification. Hereinafter, the execution characteristic information of the calculated kernel and the scheduling rule calculated from the execution characteristic information will be described with reference to FIGS. 2 to 6 .

커널의 실행 특성 정보는 커널의 실행 특성은 커널 각각이 개별적으로 실행될 때의 특성 및 다수의 커널이 동시 수행될 때의 특성에 기초하여 산출될 수 있고, 커널의 실행 특성은 다섯 가지의 특성을 포함할 수 있다. Kernel execution characteristic information may be calculated based on the characteristic when each kernel is individually executed and the characteristic when a plurality of kernels are simultaneously executed. The kernel execution characteristic includes five characteristics. can do.

구체적으로, 커널의 실행 특성은 사이클 당 명령을 실행하는 warp의 수인EPC(Eligible warp Per Cycle)가 1보다 작은 커널은 커널에 많은 양의 자원을 할당해 주어도 GPU의 처리속도가 향상되지 않는 제1 특성, 커널의 누적 memory bandwidth 사용량 및 계산량 (FLOPS; FLoating point Operations Per Second)이 GPU의 공급량을 초과하면 GPU의 처리속도가 향상되지 않는 제 2특성, L1 캐쉬 집약 커널의 경우 미리 설정된 L1 캐쉬 트랜잭션 수를 초과하는 다른 커널과 동시 실행시 GPU의 처리속도가 향상되지 않는 제 3특성, 적어도 2 이상의 메모리 집약 커널을 동시 실행시 GPU의 처리속도가 향상되지 않는 제 4특성 및 계산 집약 커널의 EPC가 미리 설정된 임계값(Base threshold)보다 큰 적어도 2 이상의 계산 집약 커널을 동시 실행시 GPU의 처리속도가 향상되지 않는 제 5특성을 포함할 수 있다.Specifically, the kernel execution characteristic is the first kernel whose Eligible Warp Per Cycle (EPC), which is the number of warps executing instructions per cycle, is less than 1, the processing speed of the GPU is not improved even when a large amount of resources is allocated to the kernel. Characteristics, the second characteristic that the processing speed of the GPU does not improve when the cumulative memory bandwidth usage of the kernel and the amount of computation (FLOPS; FLoating point Operations Per Second) exceeds the supply of the GPU. The third characteristic that the processing speed of the GPU is not improved when concurrently executing with other kernels exceeding It may include a fifth characteristic that the processing speed of the GPU is not improved when at least two or more computationally intensive kernels greater than a set threshold are simultaneously executed.

제1 특성과 관련하여 도 2를 참조하면, 도 2의 (a) 및 (b)는 모두 같은 계산 집약 커널이다. 그러나 (a)의 경우 EPC가 1보다 작은 0.17의 크기를 가지며 (b)의 경우 EPC가 1보다 큰 5.67의 크기를 갖는다. Referring to FIG. 2 in relation to the first characteristic, both (a) and (b) of FIG. 2 are the same computationally intensive kernel. However, in (a), EPC has a size of 0.17 smaller than 1, and in (b), EPC has a size of 5.67 that is larger than 1.

이에 따라 (a)의 경우 30SM-1TB, 30SM-2TB, 30SM-4TB의 자원을 할당해 주었을 때 각각의 커널 수행 시간은 64ms, 58ms, 58ms 이다. 반면, (b)의 경우 30SM-1TB, 30SM-2TB, 30SM-4TB의 자원을 할당해 주었을 때 각각의 커널 수행 시간은 237ms, 123.974ms, 69ms 로 성능이 향상됨을 확인할 수 있다. Accordingly, in case of (a), when resources of 30SM-1TB, 30SM-2TB, and 30SM-4TB are allocated, the respective kernel execution times are 64ms, 58ms, and 58ms. On the other hand, in case of (b), when resources of 30SM-1TB, 30SM-2TB, and 30SM-4TB are allocated, the respective kernel execution times are 237ms, 123.974ms, and 69ms, confirming that the performance is improved.

즉, EPC가 작은 커널은 GPU에서 많은 정적 자원(TB)을 할당해주어도 성능향상이 없다. EPC가 작을수록 실행 지연(stall)이 많이 발생하기 때문이다. 이러한 결과에 따라 제 1특성이 산출된다.In other words, a kernel with a small EPC does not improve performance even if a lot of static resources (TB) are allocated from the GPU. This is because the smaller the EPC, the more stalling the execution occurs. According to this result, the first characteristic is calculated.

제 2특성과 관련하여 도 3을 참조하면, GPU마다 가용 가능한 memory bandwidth(메모리 대역폭) 및 계산량(FLOPS; FLoating point Operations Per Second)이 존재한다. 예를 들어, 도 3의 (c)에서 GPU의 메모리 대역폭은 547.6GB/s bandwidth를 갖는다. 이때 하나의 커널을 실행 시 266GB/s 의 bandwidth를 사용하므로 2개의 커널 실행 시 547.6GB/s를 초과하지 않아 커널 수행 시간이 지연되지 않지만, 3개의 커널 실행 시 대역폭이 547.6GB/s를 초과하므로 커널 수행 시간이 대폭 증가한다(50ms이상).Referring to FIG. 3 in relation to the second characteristic, available memory bandwidth (memory bandwidth) and calculation amount (FLOPS; FLoating point Operations Per Second) exist for each GPU. For example, in (c) of FIG. 3, the memory bandwidth of the GPU has a bandwidth of 547.6 GB/s. At this time, when running one kernel, bandwidth of 266GB/s is used, so when running two kernels, it does not exceed 547.6GB/s, so kernel execution time is not delayed, but when running three kernels, bandwidth exceeds 547.6GB/s Kernel execution time is greatly increased (over 50ms).

마찬가지로 도 3의 (d)에서 GPU의 가용 가능한 계산량 (FLOPS)은 379.7 GFLOPS이며 하나의 커널이 사용될 때마다 184.4948 GFLOPS가 사용되므로 379.7 GFLOPS가 초과하는 3개의 커널을 동시 사용할 때 커널 수행 시간이 대폭 증가한다(100ms이상). 이러한 결과에 따라 제 2특성이 산출된다.Similarly, in Fig. 3(d), the available computational amount (FLOPS) of the GPU is 379.7 GFLOPS, and 184.4948 GFLOPS is used whenever one kernel is used, so the kernel execution time is significantly increased when using three kernels exceeding 379.7 GFLOPS simultaneously (more than 100ms). According to this result, the second characteristic is calculated.

제 3특성과 관련하여 도 4를 참조하면, L1 캐쉬 집약 커널인 LBM의 경우 L1 캐쉬 트랜잭션이 0에 가까운 QS를 제외한 나머지 커널 대부분은 LBM을 단독으로 실행했을 때(검은 선)보다 더 낮은 성능을 보였다. 이러한 결과에 따라 L1 캐쉬를 적게 사용하는 커널과 동시 수행할 때만 동시 수행 이득을 기대할 수 있다는 제 3특성이 산출된다.Referring to FIG. 4 in relation to the third characteristic, in the case of LBM, which is an L1 cache-intensive kernel, most of the remaining kernels except for QS, where the L1 cache transaction is close to 0, show lower performance than when LBM is executed alone (black line). seemed According to such a result, the third characteristic is calculated that a concurrent execution gain can be expected only when concurrently executed with a kernel that uses less L1 cache.

제 4특성과 관련하여 도 5를 참조하면, 도 5에 도시된 HS, RD, NW, SY, STENCIL, SPMV, NW는 모두 메모리 집약 커널이다. 메모리 집약 커널 간 동시 수행 시 메모리 집약 커널을 단독 수행했을 때(붉은 선) 보다 더 낮은 성능을 보였다. 이러한 결과에 따라 GPU가 제공하는 bandwidth를 초과하지 않더라도 메모리 집약 커널 간 동시 수행 시 성능 향상이 없다는 제 4특성이 산출된다.Referring to FIG. 5 in relation to the fourth characteristic, HS, RD, NW, SY, STENCIL, SPMV, and NW shown in FIG. 5 are all memory-intensive kernels. Simultaneous execution between memory-intensive kernels showed lower performance than when the memory-intensive kernel was executed alone (red line). According to these results, even if the bandwidth provided by the GPU is not exceeded, the fourth characteristic is calculated that there is no performance improvement during simultaneous execution between memory-intensive kernels.

제 5특성과 관련하여 도 6을 참조하면, 도 6은 계산 집약 커널 간의 동시 수행 결과에 따른 커널 실행 특성을 나타낸다. 도 6의 (e)는 계산 집약 커널인 FDTD와 동시 수행되는 다른 계산 집약 커널과의 성능 실험 결과이고, (f)는 계산 집약 커널인 CUTCP와 동시 수행되는 다른 계산 집약 커널과의 성능 실험 결과이다. Referring to FIG. 6 in relation to the fifth characteristic, FIG. 6 shows a kernel execution characteristic according to a result of simultaneous execution between computation-intensive kernels. (e) of FIG. 6 is a performance test result with other computationally intensive kernel simultaneously performed with FDTD, a computationally intensive kernel, and (f) is a performance test result with another computationally intensive kernel, which is concurrently performed with CUTCP, a computational intensive kernel. .

(e)의 FDTD의 경우 EPC가 1보다 작아 FDTD와 동시 수행되는 경우 모두 성능 향상을 보인다. 반면, (f)의 CUTCP의 경우 EPC가 1보다 매우 커, LavaMD 및 CUTCP를 제외한 나머지 커널과의 동시 수행 시 성능 향상을 보이지 못한다.In the case of FDTD in (e), the EPC is less than 1, so that performance is improved when it is performed simultaneously with FDTD. On the other hand, in the case of CUTCP in (f), the EPC is much larger than 1, so it does not show any performance improvement when concurrently executed with the remaining kernels except for LavaMD and CUTCP.

동시 수행되는 커널 모두의 EPC가 크다면 동시 수행을 통해 감소할 실행 지연(Stall)이 많지 않으므로 성능 향상을 보이지 못하는 것이다. 이에 따라 계산 집약 커널의 EPC가 미리 설정된 임계값 보다 큰 적어도 2 이상의 계산 집약 커널을 동시 실행 시 GPU의 처리속도가 향상되지 않는 제 5특성이 산출된다. EPC의 미리 설정된 임계값은 예를 들어, 1일 수 있다.If the EPCs of all concurrently executed kernels are large, there is not much stall delay to be reduced through concurrent execution, so there is no performance improvement. Accordingly, the fifth characteristic that the processing speed of the GPU is not improved when at least two or more computationally intensive kernels having an EPC of the computationally intensive kernel greater than a preset threshold is simultaneously executed is calculated. The preset threshold of EPC may be, for example, 1.

한편, 스케줄링 규칙(230)은 K-Scheduler 알고리즘내에서 복수의 커널을 그루핑하여 동시실행그룹리스트를 생성할 때 적용되는 규칙으로, 상술한 커널의 실행 특성에 기반하여 산출된다. 또한, 스케줄링 규칙(230)은 정적 스케줄링 규칙 및 동적 스케줄링 규칙을 포함한다.Meanwhile, the scheduling rule 230 is a rule applied when generating a concurrent execution group list by grouping a plurality of kernels in the K-Scheduler algorithm, and is calculated based on the aforementioned kernel execution characteristics. In addition, the scheduling rule 230 includes a static scheduling rule and a dynamic scheduling rule.

보다 상세하게, 정적 스케줄링 규칙은 커널이 GPU의 자원 제공량을 초과하여 사용될 수 없는 제1규칙 및 커널의 누적 memory bandwidth 사용량 및 계산량 (FLOPS; FLoating point Operations Per Second)이 하드웨어의 공급량을 초과하여 사용될 수 없는 제 2규칙을 포함하고, 동적 스케줄링 규칙은 L1 캐쉬 집약 커널의 경우 미리 설정된 L1 캐쉬 트랜잭션 수를 초과하는 다른 커널과 다중 작업할 수 없는 제 3규칙, 적어도 2 이상의 메모리 집약 커널을 다중 작업을 할 수 없는 제 4규칙 및 계산 집약 커널의 EPC(Eligible warp Per Cycle)가 미리 설정된 임계값(Base threshold)보다 큰 적어도 2 이상의 계산 집약 커널을 다중 작업할 수 없는 제 5규칙을 포함한다.More specifically, the static scheduling rule is the first rule that the kernel cannot be used in excess of the GPU's resource provision, and the kernel's cumulative memory bandwidth usage and computational amount (FLOPS; FLoating point Operations Per Second) exceeds the hardware supply. Including a second rule that does not exist, the dynamic scheduling rule includes a third rule that multi-tasks cannot be performed with other kernels exceeding the preset number of L1 cache transactions in the case of an L1 cache-intensive kernel, and at least two or more memory-intensive kernels can multitask and a fourth rule that cannot perform multiple operations on at least two or more computationally intensive kernels in which Eligible Warp Per Cycle (EPC) of the computation intensive kernel is greater than a preset base threshold.

이와 같이 스케줄링 규칙(230)은 커널의 실행 특성에 기반하여 산출된 바, 스케줄링 규칙을 통해 생성된 동시실행그룹리스트에 포함된 커널은 자원 경쟁을 최소화함으로써 효율적인 다중 작업이 수행될 수 있는 효과를 갖는다.As such, the scheduling rule 230 is calculated based on the execution characteristics of the kernel, and the kernel included in the concurrent execution group list generated through the scheduling rule minimizes resource contention, thereby effectively performing multiple tasks. .

도 7은 본 명세서의 일 실시예에서 K-Scheduler 알고리즘을 나타낸 도면이고, 도 8은 제3 규칙 내지 제 5규칙을 판단하는 알고리즘을 나타낸 도면이고, 도 9는 본 명세서의 일 실시예에서 다중 작업 효율화 장치가 동시실행그룹리스트를 생성하는 방법을 나타낸 도면이다. 이하, 도 7 내지 도 9를 참조하여 설명하도록 한다.7 is a diagram showing the K-Scheduler algorithm in an embodiment of the present specification, FIG. 8 is a diagram showing an algorithm for determining the third rule to the fifth rule, and FIG. 9 is a multi-tasking in an embodiment of the present specification It is a diagram showing how the efficiency device generates a list of concurrent execution groups. Hereinafter, it will be described with reference to FIGS. 7 to 9 .

도 7을 참조하면 먼저, K-Scheduler 알고리즘은 커널 정보에 기반하여 이루어지고, 커널 정보는 카테고리 정보, 커널의 실행 특성 정보, 프로파일링 정보를 포함한다. Referring to FIG. 7 , first, the K-Scheduler algorithm is made based on kernel information, and the kernel information includes category information, kernel execution characteristic information, and profiling information.

즉, 복수의 커널(10)의 프로파일링 정보로부터 각 커널의 카테고리가 결정되고, K-Scheduler 알고리즘에 사용되는 스케줄링 규칙은 커널의 실행 특성 정보에 기반하므로 K-Scheduler 알고리즘은 커널 정보에 기반하여 이루어진다.That is, the category of each kernel is determined from the profiling information of the plurality of kernels 10 , and the scheduling rule used in the K-Scheduler algorithm is based on the execution characteristic information of the kernel, so that the K-Scheduler algorithm is made based on the kernel information. .

커널 분류기(200)에 의해 분류가 완료된 복수의 커널(K¹, K², ...,K^k)이 입력으로 들어오면(line 1), 다중 작업 효율화 장치(300)는 커널의 카테고리에 관한 정보를 포함하는 카테고리 정보(210)를 이용하여 입력으로 들어온 복수의 커널을 정렬한다(line 2).When a plurality of kernels (K ¹ , K ² , ..., K ^k ) whose classification has been completed by the kernel classifier 200 are input (line 1), the multi-task efficiency apparatus 300 relates to the category of the kernel. A plurality of kernels inputted as input are aligned using the category information 210 including the information (line 2).

구체적으로 다중 작업 효율화 장치(300)는 복수의 커널(K¹, K², ...,K^k)을 L1 캐쉬 집약 커널, 메모리 집약 커널, 계산 집약 커널의 순서로 정렬하고, 카테고리 정보(210)가 동일한 경우 실행 시간이 긴 커널을 우선순위로 하여 복수의 커널을 정렬한다.Specifically, the multi-task efficiency apparatus 300 arranges a plurality of kernels (K ¹ , K ² , ...,K ^k ) in the order of an L1 cache-intensive kernel, a memory-intensive kernel, and a computation-intensive kernel, and category information 210 ), the kernels with the longest execution time are prioritized to sort multiple kernels.

정렬된 복수의 커널(SK¹, SK², ...,SK^k)은 정렬리스트(SK_List)에 저장되며, 다중 작업 효율화 장치(300)는 정렬리스트에 존재하는 커널이 없을 때까지 반복하여 동시 수행할 커널의 조합을 찾는다(line 3).A plurality of sorted kernels (SK ¹ , SK ² , ...,SK ^k ) are stored in the sorted list (SK_List), and the multi-tasking efficiency device 300 is repeatedly repeated until there is no kernel present in the sorted list. Find the kernel combination to be executed (line 3).

즉, 다중 작업 효율화 장치(300)는 커널의 실행 특성 정보에 기반한 스케줄링 규칙을 통해 동시 수행할 커널의 조합을 그루핑한다. 다시 말해, 다중 작업 효율화 장치(300)는 정렬리스트에 존재하는 커널(SK¹, SK², ...,SK^k)을 대상으로 스케줄링 규칙을 만족하는지 여부를 확인하고, 스케줄링 규칙을 만족하는 커널의 조합을 그룹화한다(line 5-9).That is, the multi-task efficiency apparatus 300 groups combinations of kernels to be simultaneously executed through a scheduling rule based on information on the execution characteristics of the kernel. In other words, the multi-task efficiency apparatus 300 checks whether the scheduling rule is satisfied with respect to the kernels (SK ¹ , SK ² , ...,SK ^k ) existing in the sorted list, and the kernel that satisfies the scheduling rule group the combinations of (lines 5-9).

이와 같이 그룹화된 각각의 그룹(CK)은 동시 수행할 적어도 하나의 커널을 포함하며, 각각의 그룹이 순차적으로 실행되도록 동시실행그룹리스트 (CK_List)에 추가되며 동시 실행그룹리스트가 생성된다(line 10).Each group CK grouped in this way includes at least one kernel to be simultaneously executed, and each group is added to the concurrent execution group list (CK_List) to be sequentially executed, and a concurrent execution group list is generated (line 10). ).

한편, 정렬리스트에 존재하는 커널이 스케줄링 규칙을 만족하는지는 하기의 식을 기준으로 판단한다.On the other hand, whether the kernel existing in the sorted list satisfies the scheduling rule is determined based on the following equation.

먼저, 제1 규칙과 관련하여,First, with respect to the first rule,

<식 1><Equation 1>

<식 2><Equation 2>

<식 3><Equation 3>

다중 작업 효율화 장치(300)는 상기의 식 1을 통해 현재까지 선택된 커널이 요청한 레지스터의 개수의 합과 K_i가 요청한 레지스터 개수의 합이 현재 GPU가 제공할 수 있는 최대 레지스터 개수인 MAX_REG를 초과하지 않는지 확인하고, 식 2를 통해 현재까지 선택된 커널의 공유 메모리(shared memory)의 크기 합과 K_i가 사용하는 공유 메모리 크기의 합이 현재 GPU가 제공할 수 있는 공유 메모리의 크기인 MAX_SMEM를 초과하지 않는지 확인하고, 식 3을 통해 그룹에 포함된 커널들의 지정된 thread block 개수의 합과 K_i의 지정된 thread block 개수를 합한 것이 GPU가 제공할 수 있는 thread block의 최대 개수인 MAX_TB미만인지 확인하여 정렬리스트에 존재하는 커널이 스케줄링 규칙을 만족하는지 판단할 수 있다.In the multi-tasking efficiency apparatus 300, the sum of the number of registers requested by the kernel selected so far through Equation 1 above and the sum of the number of registers requested by K _i exceeds MAX _REG , which is the maximum number of registers that the GPU can provide. MAX _SMEM , which is the size of the shared memory that the GPU can provide, is the sum of the size of the shared memory of the kernel selected so far and the size of the shared memory used by K _i through Equation 2 Check that it does not exceed, and check that the sum of the number of specified thread blocks of kernels included in the group and the number of specified thread blocks of K _i through Equation 3 is less than MAX _TB , the maximum number of thread blocks that the GPU can provide Thus, it can be determined whether the kernel existing in the sorted list satisfies the scheduling rule.

제2 규칙과 관련하여,With respect to Rule 2,

<식 4><Equation 4>

<식 5><Equation 5>

다중 작업 효율화 장치(300)는 상기의 식 4를 통해 현재까지 선택된 커널의 bandwidth 요구량을 합한 값과 K_i를 수행하는데 필요한 bandwidth 요구량의 합이 현재 GPU가 제공할 수 있는 최대 bandwidth인 MAX_BW를 초과하지 않는지 확인하고, 식 5를 통해 현재까지 선택된 커널의 요구 계산 성능의 합이 현재 GPU가 제공할 수 있는 GFLOPS 공급량인 MAX_GFLOPS를 초과하지 않는지 확인하여 정렬리스트에 존재하는 커널이 스케줄링 규칙을 만족하는지 판단할 수 있다.The multi-task efficiency device 300 is the sum of the bandwidth requirements of the kernel selected so far through Equation 4 above and the sum of the bandwidth requirements required to perform K _i exceeds MAX _BW , which is the maximum bandwidth that the current GPU can provide. Check whether the kernel in the sorted list satisfies the scheduling _rule by checking whether can judge

제3 규칙 내지 제 5규칙과 관련하여 다중 작업 효율화 장치(300)는 도 8에 도시된 바와 같이 카테고리 정보에 기반하여 L1 캐쉬 집약 커널의 경우 미리 설정된 L1 캐쉬 트랜잭션 수를 초과하는 다른 커널과 동시 실행되었는지 여부, 적어도 2 이상의 메모리 집약 커널이 동시 실행되었는지 여부 및 계산 집약 커널의 EPC가 미리 설정된 임계값(Base threshold)보다 큰 적어도 2 이상의 계산 집약 커널이 동시 실행되었는지 여부를 확인함으로써 정렬리스트에 존재하는 커널이 스케줄링 규칙을 만족하는지 판단할 수 있다.In relation to the third rule to the fifth rule, the multi-task efficiency apparatus 300 is concurrently executed with other kernels exceeding the preset number of L1 cache transactions in the case of an L1 cache-intensive kernel based on category information as shown in FIG. 8 . By checking whether at least two or more memory-intensive kernels are simultaneously executed, and whether at least two or more computation-intensive kernels whose EPC is greater than a preset threshold (Base threshold) are concurrently executed. It can be determined whether the kernel satisfies the scheduling rule.

이와 같이 본 명세서의 일 실시예 따른 다중 작업 효율화 장치는 스케줄링 규칙을 통해 GPU 자원의 높은 활용도를 달성하여 전체 GPU의 성능뿐 아니라 단일 커널의 기대 성능 또한 향상시킬 수 있다.As described above, the apparatus for improving multi-tasking efficiency according to an embodiment of the present specification may achieve high utilization of GPU resources through scheduling rules, thereby improving the performance of the entire GPU as well as the expected performance of a single kernel.

도 9를 참조하면, 다중 작업 효율화 장치(300)에 입력(Workload)으로 들어온 복수의 커널(K⁰, K¹, K², K³, K⁴, K⁵)는 K-Scheduler 알고리즘을 통해 CK₁, CK₂, CK₃의 그룹으로 그룹화 되었다. Referring to FIG. 9 , a plurality of kernels (K ⁰ , K ¹ , K ² , K ³ , K ⁴ , K ⁵ ) that have entered the multi-task efficiency device 300 as an input (workload) are CK through the K-Scheduler algorithm. ₁ , CK ₂ , and CK ₃ were grouped into groups.

여기서 생성된 동시실행그룹리스트는 (CK_List = CK₁, CK₂, CK₃)이며, CK₁, CK₂, CK₃각각은 적어도 하나의 커널 {K⁰, K³, K⁵}, {K², K⁴}, {K¹}을 포함한다. 하나의 그룹에 포함된 커널은 모두 동시 수행되며 CK₁, CK₂, CK₃그룹 순서대로 순차적으로 실행된다.Here, the generated concurrent execution group list is (CK_List = CK ₁ , CK ₂ , CK ₃ ), and each of CK ₁ , CK ₂ , CK ₃ contains at least one kernel {K ⁰ , K ³ , K ⁵ }, {K ² , K ⁴ }, {K ¹ }. All kernels included in one group are executed simultaneously and sequentially in the order of CK ₁ , CK ₂ , and CK ₃ groups.

즉, 다중 작업 효율화 장치(300)는 동시실행그룹리스트에 포함된 각각의 그룹에 대하여, 그루핑 된 적어도 하나의 커널을 GPU에 전송하면 GPU(400)에서는 전송된 커널의 다중 작업(Multi-tasking)을 수행한다.That is, the multi-task efficiency apparatus 300 transmits at least one grouped kernel to the GPU for each group included in the concurrent execution group list, the GPU 400 performs multi-tasking of the transmitted kernel. carry out

도 10은 본 명세서의 일 실시예에 따른 다중 작업 효율화 방법의 순서도이다.10 is a flowchart of a multi-task efficiency method according to an embodiment of the present specification.

도면을 참조하면, 다중 작업 효율화 방법은 복수의 커널(Kernel)을 입력 받으면(S110), 커널의 카테고리에 관한 정보를 포함하는 카테고리 정보를 이용하여, 상기 복수의 커널을 정렬한다(S120)Referring to the drawings, in the multi-tasking efficiency method, when a plurality of kernels are input (S110), the plurality of kernels are sorted by using category information including information on categories of the kernels (S120).

또한, 다중 작업 효율화 방법은 카테고리 정보, 커널의 실행 특성 정보, 프로파일링 정보를 포함하는 커널 정보 및 상기 정렬된 복수의 커널에 기반하는 K-Scheduler 알고리즘을 이용하여, 상기 복수의 커널 중 동시 실행할 적어도 하나의 커널을 그루핑한 복수의 그룹을 포함하는 동시실행그룹리스트를 생성한다(S130).In addition, the multi-task efficiency method includes kernel information including category information, kernel execution characteristic information, profiling information, and Using the K-Scheduler algorithm based on the sorted plurality of kernels, a concurrent execution group list including a plurality of groups grouping at least one kernel to be simultaneously executed among the plurality of kernels is generated ( S130 ).

이후, 다중 작업 효율화 방법은 동시실행그룹리스트에 포함된 각각의 그룹에 대하여, 상기 그루핑 된 적어도 하나의 커널을 GPU에 전송하여 다중 작업(Multi-tasking)을 수행한다. Thereafter, the multi-tasking efficiency method performs multi-tasking by transmitting the grouped at least one kernel to the GPU for each group included in the concurrent execution group list.

이상과 같이 본 발명에 대해서 예시한 도면을 참조로 하여 설명하였으나, 본 명세서에 개시된 실시 예와 도면에 의해 본 발명이 한정되는 것은 아니며, 본 발명의 기술사상의 범위 내에서 통상의 기술자에 의해 다양한 변형이 이루어질 수 있음은 자명하다. 아울러 앞서 본 발명의 실시 예를 설명하면서 본 발명의 구성에 따른 작용 효과를 명시적으로 기재하여 설명하지 않았을지라도, 해당 구성에 의해 예측 가능한 효과 또한 인정되어야 함은 당연하다.As described above, the present invention has been described with reference to the illustrated drawings, but the present invention is not limited by the embodiments and drawings disclosed in the present specification. It is obvious that variations can be made. In addition, although the effect of the configuration of the present invention has not been explicitly described and described while describing the embodiment of the present invention, it is natural that the effect predictable by the configuration should also be recognized.

Claims

receiving a plurality of kernels as input;
sorting the plurality of kernels by using category information including information on categories of the kernels;
Kernel information including the category information, kernel execution characteristic information, profiling information, and generating a concurrent execution group list including a plurality of groups grouping at least one kernel to be simultaneously executed among the plurality of kernels using a K-Scheduler algorithm based on the sorted plurality of kernels; and
For each group included in the concurrent execution group list, transmitting the grouped at least one kernel to a GPU to perform multi-tasking
How to streamline multi-tasking.

According to claim 1,
The category information of the kernel is
Includes compute-intensive kernels, memory-intensive kernels, and L1 cache-intensive kernels.
How to streamline multi-tasking.

3. The method of claim 2,
The step of aligning the plurality of kernels using the kernel information includes:
Sorting the plurality of kernels in the order of an L1 cache-intensive kernel, a memory-intensive kernel, and a computation-intensive kernel, and when the category information is the same, prioritizing a kernel with a longer execution time to sort the plurality of kernels
How to streamline multi-tasking.

According to claim 1,
The execution characteristic information is
A kernel having an Eligible Warp Per Cycle (EPC) less than 1 has a first characteristic that the processing speed of the GPU is not improved even when a large amount of resources is allocated to the kernel;
The second characteristic that the processing speed of the GPU is not improved when the cumulative memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) of the kernel exceeds the supply amount of the GPU,
In the case of the L1 cache-intensive kernel, the third characteristic that the processing speed of the GPU is not improved when concurrently executed with other kernels exceeding the preset number of L1 cache transactions;
a fourth characteristic that the processing speed of the GPU is not improved when at least two or more memory-intensive kernels are simultaneously executed; and
Comprising a fifth characteristic that the processing speed of the GPU is not improved when the EPC of the computation intensive kernel is simultaneously executed at least two or more computation intensive kernels greater than a preset base threshold
How to streamline multi-tasking.

According to claim 1,
The profiling information is
Static profiling information that is information on the amount of resources of the GPU used by the kernel, runtime information of the kernel, and dynamic profiling information on the cause of the execution delay (Stall)
How to streamline multi-tasking.

According to claim 1,
The K-Scheduler algorithm is
including a scheduling rule based on the execution characteristic information
How to streamline multi-tasking.

7. The method of claim 6,
The scheduling rule is
at least one of a static scheduling rule and a dynamic scheduling rule;
The static scheduling rule is a first rule that the kernel cannot be used in excess of the resource provision amount of the GPU, and the accumulated memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) of the kernel exceeds the supply amount of the GPU. including a second rule that cannot be
The dynamic scheduling rule includes a third rule that cannot multi-task with other kernels exceeding a preset number of L1 cache transactions in the case of an L1 cache-intensive kernel, a fourth rule that cannot multi-task with at least two or more memory-intensive kernels, and calculation Eligible warp per cycle (EPC) of the intensive kernel is greater than a preset threshold (Base threshold) including a fifth rule that cannot multitask at least two or more computationally intensive kernels
How to streamline multi-tasking.

one or more processors to execute instructions;
The one processor,
Receive multiple kernels as input,
Sorting the plurality of kernels by using category information including information about the category of the kernel,
Kernel information including the category information, kernel execution characteristic information, profiling information, and By using the K-Scheduler algorithm based on the sorted plurality of kernels, generating a concurrent execution group list including a plurality of groups grouping at least one kernel to be executed simultaneously among the plurality of kernels,
For each group included in the concurrent execution group list, the grouped at least one kernel is transmitted to the GPU to perform multi-tasking
Multitasking efficiency device.

9. The method of claim 8,
The category information of the kernel is
Includes compute-intensive kernels, memory-intensive kernels, and L1 cache-intensive kernels.
Multitasking efficiency device.

10. The method of claim 9,
The one processor,
Sorting the plurality of kernels in the order of an L1 cache-intensive kernel, a memory-intensive kernel, and a computation-intensive kernel, and when the category information is the same, a kernel with a longer execution time is prioritized to sort the plurality of kernels
Multitasking efficiency device.

9. The method of claim 8,
The execution characteristic information is
A kernel having an Eligible Warp Per Cycle (EPC) less than 1 has a first characteristic that the processing speed of the GPU is not improved even when a large amount of resources is allocated to the kernel;
The second characteristic that the processing speed of the GPU is not improved when the cumulative memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) of the kernel exceeds the supply amount of the GPU,
In the case of the L1 cache-intensive kernel, the third characteristic that the processing speed of the GPU is not improved when it is concurrently executed with other kernels exceeding the preset number of L1 cache transactions;
a fourth characteristic that the processing speed of the GPU is not improved when at least two or more memory-intensive kernels are simultaneously executed; and
Comprising a fifth characteristic that the processing speed of the GPU is not improved when the EPC of the computation intensive kernel is simultaneously executed at least two or more computation intensive kernels greater than a preset base threshold
Multitasking efficiency device.

9. The method of claim 8,
The profiling information is
Static profiling information that is information on the amount of resources of the GPU used by the kernel, runtime information of the kernel, and dynamic profiling information on the cause of the execution delay (Stall)
Multitasking efficiency device.

9. The method of claim 8,
The K-Scheduler algorithm is
including a scheduling rule based on the execution characteristic information
Multitasking efficiency device.

14. The method of claim 13,
The scheduling rule is
at least one of a static scheduling rule and a dynamic scheduling rule;
The static scheduling rule is a first rule that the kernel cannot be used in excess of the resource provision amount of the GPU, and the accumulated memory bandwidth usage and calculation amount (FLOPS; FLoating point Operations Per Second) of the kernel exceeds the supply amount of the GPU. including the second rule without
The dynamic scheduling rule includes a third rule that cannot multi-task with other kernels exceeding a preset number of L1 cache transactions in the case of an L1 cache-intensive kernel, a fourth rule that cannot multi-task with at least two or more memory-intensive kernels, and calculation Eligible warp per cycle (EPC) of the intensive kernel is greater than a preset threshold (Base threshold) including a fifth rule that cannot multitask at least two or more computationally intensive kernels
Multitasking efficiency device.