KR102142498B1

KR102142498B1 - GPU memory controller for GPU prefetching through static analysis and method of control

Info

Publication number: KR102142498B1
Application number: KR1020180118835A
Authority: KR
Inventors: 한환수; 김현준; 홍성인
Original assignee: 성균관대학교산학협력단; 재단법인 초고성능 컴퓨팅 연구단
Priority date: 2018-10-05
Filing date: 2018-10-05
Publication date: 2020-08-10
Also published as: KR20200039202A

Abstract

본 발명은 GPU 커널 정적 분석을 통해 GPU 프리패치를 수행하기 위한 GPU 메모리 제어장치 및 제어방법에 관한 것으로,
본 발명의 실시예에 따른 GPU 메모리 제어장치는,
GPU 커널 내부에 존재하는 스레드들을 수집하고, 수집된 스레드들의 고유 스레드 ID를 데이터 차원별로 스레드 ID의 계수와 반복문 변수의 계수를 이용하여 전역 메모리 변수의 인덱스를 정의하는 인덱스 정의모듈; 상기 데이터 차원별 스레드 ID의 계수에 따라 스레드 간의 거리를 계산하고, 스레드들 간의 거리에 따라 스레드-고밀도 메모리 접근 패턴 여부를 판단하는 스레드-고밀도 메모리 접근 패턴 판단모듈; 및, 상기 반복문 변수의 계수를 이용하여, 스레드-고밀도 메모리 접근 패턴을 가진 데이터 중에서, 프리패치 대상 데이터가 존재하는 지를 판단하는 프리패치 대상 결정모듈을 포함한다.The present invention relates to a GPU memory control apparatus and a control method for performing GPU prefetch through the GPU kernel static analysis,
GPU memory control apparatus according to an embodiment of the present invention,
An index definition module that collects threads existing in the GPU kernel, and defines an index of a global memory variable by using the thread ID count and the loop variable count for each data dimension of the unique thread ID of the collected threads; A thread-high-density memory access pattern determination module that calculates a distance between threads according to a coefficient of the thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between the threads; And a prefetch target determination module for determining whether prefetch target data exists among data having a thread-high density memory access pattern by using the coefficient of the loop variable.

Description

GPU memory controller for GPU prefetching through static analysis and method of control}

본 발명은 GPU 메모리 제어장치 및 제어방법에 관한 것으로, 보다 구체적으로는 GPU 커널 정적 분석을 통해 GPU 프리패치를 수행하기 위한 GPU 메모리 제어장치 및 제어방법에 관한 것이다.The present invention relates to a GPU memory control apparatus and a control method, and more particularly, to a GPU memory control apparatus and a control method for performing GPU prefetch through the GPU kernel static analysis.

그래픽 처리 장치(Graphics Processing Unit, GPU)는 중앙 처리 장치(Central Processing Unit, CPU)와 같이 다중 단계의 캐시 메모리(Cache Memory) 구조를 가지고 있다.A graphics processing unit (GPU) has a multi-level cache memory structure like a central processing unit (CPU).

도 1은 일반적인 GPU 아키텍처가 도시된 블록도이다. 도 1을 참조하면, GPU가 n개의 SIMD를 포함하는 것을 확인할 수 있다. SIMD(Single Instruction Multiple Data)란, 하나의 명령어로 여러 개의 값을 동시에 계산하는 방식의 병렬 프로세서 종류이다. 상기 SIMD를 GPU 제조사마다 서로 다른 이름으로 부르기도 한다. NVIDIA^®에서는 Stream Multiprocessor, AMD^®에서는 Compute Unit이라고 부른다. NVIDIA^®의 최근 제품인 GTX980^®에는 상기와 같은 SIMD가 총 16개 존재하고 있고, 각 SIMD 내부에는 4개의 warp 스케쥴러, 각 스케줄러마다 32개의 CUDA core가 있으며 8개의 로드/스토어 유닛, Special Functional Unit이 있다.　다만, 이에 대한 내용은 본 명세서에서 설명하고자 하는 내용과 직접적인 연관성이 높지 않아서 보다 자세한 설명은 생략한다. 1 is a block diagram showing a general GPU architecture. Referring to FIG. 1, it can be seen that the GPU includes n SIMDs. SIMD (Single Instruction Multiple Data) is a type of parallel processor that calculates several values simultaneously with one instruction. The SIMD is sometimes called a different name for each GPU manufacturer. It is called Stream Multiprocessor in NVIDIA ^® and Compute Unit in AMD ^® . NVIDIA ^® 's latest product, GTX980 ^® , has a total of 16 SIMDs as above, and there are 4 warp schedulers inside each SIMD, 32 CUDA cores for each scheduler, 8 load/store units, and special functional units. . However, since the content is not directly related to the content to be described in this specification, a more detailed description will be omitted.

상기 SIMD는 레지스터(Register), 스크래치패드 메모리(Scratchpad memory 또는 Shared Memory), L1 캐시(L1 cache) 및 ROM(Read Only Memory)를 포함할 수 있다. 상기 n개의 SIMD는 L2 캐시(L2 cache) 및 전역 메모리(Global Memory, DRAM)를 공유할 수 있다. 일 예로, 상기 L1 캐시는 SIMD마다 48KB, L2 캐시는 2MB의 크기를 가질 수 있다. 오른쪽 점선 밖은 주기억장치(Main memory)는 CPU에 연결된 메모리로서, 상기 SIMD와 연결된다.The SIMD may include a register, a scratchpad memory or a shared memory, an L1 cache, and a read only memory (ROM). The n SIMDs may share an L2 cache and a global memory (DRAM). For example, the L1 cache may have a size of 48 KB per SIMD, and the L2 cache may have a size of 2 MB. Outside the dotted line on the right, the main memory is a memory connected to the CPU and is connected to the SIMD.

상기 메모리들이 존재하는 위치에 따라 온-칩(on-chip) 메모리와 오프-칩(off-chip) 메모리로 구분될 수 있다. 레지스터, L1 캐시, L2 캐시, ROM(Read-only memory) 및 스크래치패드 메모리는 온-칩 메모리이다. 전역메모리(Global memory)는 오프-칩 메모리이다.Depending on where the memories exist, on-chip memory and off-chip memory may be classified. Registers, L1 cache, L2 cache, read-only memory (ROM) and scratchpad memory are on-chip memory. Global memory is off-chip memory.

한편, GPU는 대단위 병렬 프로세싱이 가능하지만, 느린 메모리 읽기/쓰기 성능 때문에 계산 성능이 충분히 활용되지 못하고 있다. 상술한 바와 같이, GPU의 메모리는 오프-칩 메모리와 온-칩 메모리로 구분되는데, 두 메모리 소자에 대한 접근 지연 시간은 10배에서 최대 100배 이상 차이 나기 때문에, 비효율적인 메모리 접근 패턴은 성능에 큰 영향을 주게 된다. On the other hand, GPUs are capable of large-scale parallel processing, but their computational performance is not fully utilized due to slow memory read/write performance. As described above, the memory of the GPU is divided into off-chip memory and on-chip memory. Since the access delay time for the two memory devices differs from 10 to 100 times or more, inefficient memory access patterns are It will have a big impact.

이에 따라, 온-칩 메모리를 활용하여 오프-칩 메모리에 대한 접근을 최소화하는 최적화 연구가 진행되었다. 그 방법의 하나로 오프-칩 메모리 내 데이터의 프리패치(prefetch)를 통해 두 메모리 소자 간의 대역폭 활용을 극대화하는 연구들이 있다. Accordingly, an optimization study was conducted to minimize access to off-chip memory by utilizing on-chip memory. As one of the methods, there are studies on maximizing bandwidth utilization between two memory devices through prefetching of data in an off-chip memory.

일반적으로는 GPU의 L1 캐시에 특정 메모리 접근 패턴을 프리패치하도록 하는 연구가 있었지만, GPU L1 캐시의 용량은 충분히 크지 않기 때문에 프리패치를 위한 공간으로는 부적합하다. In general, there has been research to prefetch a specific memory access pattern to the GPU's L1 cache, but the capacity of the GPU L1 cache is not large enough, so it is not suitable as a space for prefetching.

GPU 스크래치패드 메모리(scratchpad memory)는 온-칩 메모리 중 하나이며, L1 캐시와 유사한 읽기/쓰기 성능을 가지며 사용자가 임의로 공간을 할당하여 사용할 수 있다. 이 공간에 대해서는 사용자가 프리패치한 데이터에 대해서 보장하는 것이 가능하므로 이를 이용한 프리패치 기법들이 연구되고 있다.GPU scratchpad memory is one of the on-chip memories, has read/write performance similar to L1 cache, and can be used by the user arbitrarily allocating space. In this space, it is possible to guarantee the data prefetched by the user, so prefetching techniques using this are being studied.

한국공개특허공보 제10-2015-0092440호Korean Patent Publication No. 10-2015-0092440

본 발명은 GPU 커널 정적 분석을 통해 GPU 프리패치를 수행하기 위한 GPU 메모리 제어장치 및 제어방법을 제공하는 것을 목적으로 한다.An object of the present invention is to provide a GPU memory control device and a control method for performing GPU prefetch through static analysis of a GPU kernel.

본 발명의 실시예에 따른 GPU 메모리 제어장치는,GPU memory control apparatus according to an embodiment of the present invention,

GPU 커널 내부에 존재하는 스레드들을 수집하고, 수집된 스레드들의 고유 스레드 ID를 데이터 차원별로 스레드 ID의 계수와 반복문 변수의 계수를 이용하여 전역 메모리 변수의 인덱스를 정의하는 인덱스 정의모듈; 상기 데이터 차원별 스레드 ID의 계수에 따라 스레드 간의 거리를 계산하고, 스레드들 간의 거리에 따라 스레드-고밀도 메모리 접근 패턴 여부를 판단하는 스레드-고밀도 메모리 접근 패턴 판단모듈; 및, 상기 반복문 변수의 계수를 이용하여, 스레드-고밀도 메모리 접근 패턴을 가진 데이터 중에서, 프리패치 대상 데이터가 존재하는 지를 판단하는 프리패치 대상 결정모듈을 포함한다.An index definition module that collects threads existing in the GPU kernel, and defines an index of a global memory variable by using the thread ID count and the loop variable count for each data dimension of the unique thread ID of the collected threads; A thread-high-density memory access pattern determination module that calculates a distance between threads according to a coefficient of the thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between the threads; And a prefetch target determination module for determining whether prefetch target data exists among data having a thread-high density memory access pattern by using the coefficient of the loop variable.

본 발명의 실시예에 따른 GPU 메모리 제어장치에 있어서, 상기 전역 메모리 변수의 인덱스는 하기 식 (1)로 정의될 수 있다.In the GPU memory control apparatus according to an embodiment of the present invention, the index of the global memory variable may be defined by Equation (1) below.

식 (1) : [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + S X i + C]Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + S X i + C]

(여기서, threadIdx.x, threadIdx.y, threadIdx.z는 데이터 차원별 스레드 ID이고, Tx X, Ty X, Tz X는 데이터 차원별 스레드 ID의 계수이며, i는 반복문 변수이고, S는 반복문 변수의 계수이며, C는 상수임)(Here, threadIdx.x, threadIdx.y, threadIdx.z are thread IDs for each data dimension, Tx X, Ty X, and Tz X are counts of thread IDs for each data dimension, i is a loop variable, and S is a loop variable Is the coefficient of and C is a constant)

본 발명의 실시예에 따른 GPU 메모리 제어장치에 있어서, 상기 스레드-고밀도 메모리 접근 패턴 판단모듈은, 상기 데이터 차원별 스레드 ID의 계수 중 적어도 어느 하나의 스레드 ID의 계수가 0이면, 해당 스레드를 스레드-고밀도 메모리 접근 패턴으로 판단할 수 있다.In the GPU memory control apparatus according to an embodiment of the present invention, the thread-high density memory access pattern determination module, if the coefficient of at least one of the coefficients of the thread ID for each data dimension is 0, the corresponding thread is threaded. -It can be judged as a high-density memory access pattern.

본 발명의 실시예에 따른 GPU 메모리 제어장치에 있어서, 상기 프리패치 대상 결정모듈은, 상기 스레드-고밀도 메모리 접근 패턴을 가진 스레드 중에서, 상기 반복문 변수의 계수가 0이 아니면, 해당 스레드를 프리패치 대상 데이터가 존재하는 스레드로 판단할 수 있다.In the GPU memory control apparatus according to an embodiment of the present invention, the prefetch target determining module, among the threads having a high-density memory access pattern, prefetches the corresponding thread if the coefficient of the loop variable is not 0. It can be determined as a thread in which data exists.

본 발명의 실시예에 따른 GPU 메모리 제어방법은,GPU memory control method according to an embodiment of the present invention,

GPU 커널 내부에 존재하는 스레드들을 수집하고, 수집된 스레드들의 고유 스레드 ID를 데이터 차원별로 스레드 ID의 계수와 반복문 변수의 계수를 이용하여 전역 메모리 변수의 인덱스를 정의하는 인덱스 정의단계; 상기 데이터 차원별 스레드 ID의 계수에 따라 스레드 간의 거리를 계산하고, 스레드들 간의 거리에 따라 스레드-고밀도 메모리 접근 패턴 여부를 판단하는 스레드-고밀도 메모리 접근 패턴 판단단계; 및, 상기 반복문 변수의 계수를 이용하여, 스레드-고밀도 메모리 접근 패턴을 가진 데이터 중에서, 프리패치 대상 데이터가 존재하는 지를 판단하는 프리패치 대상 결정단계를 포함한다.An index definition step of collecting threads existing in the GPU kernel and defining an index of a global memory variable by using a count of a thread ID and a count of a loop variable for each data dimension of the unique thread IDs of the collected threads; A thread-high-density memory access pattern determining step of calculating a distance between threads according to a coefficient of the thread ID for each data dimension, and determining whether a thread-high density memory access pattern is based on the distance between the threads; And a prefetch target determining step of determining whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficient of the loop variable.

본 발명의 실시예에 따른 GPU 메모리 제어방법에 있어서, 상기 전역 메모리 변수의 인덱스는 하기 식 (1)로 정의될 수 있다.In the GPU memory control method according to an embodiment of the present invention, the index of the global memory variable may be defined by Equation (1) below.

본 발명의 실시예에 따른 GPU 메모리 제어방법에 있어서, 상기 스레드-고밀도 메모리 접근 패턴 판단단계는, 상기 데이터 차원별 스레드 ID의 계수 중 적어도 어느 하나의 스레드 ID의 계수가 0이면, 해당 스레드를 스레드-고밀도 메모리 접근 패턴으로 판단할 수 있다.In the GPU memory control method according to an embodiment of the present invention, in the step of determining the thread-high density memory access pattern, if the coefficient of at least any one of the coefficients of the thread ID for each data dimension is 0, the corresponding thread is threaded. -It can be judged as a high-density memory access pattern.

본 발명의 실시예에 따른 GPU 메모리 제어방법에 있어서, 상기 프리패치 대상 결정단계는, 상기 스레드-고밀도 메모리 접근 패턴을 가진 스레드 중에서, 상기 반복문 변수의 계수가 0이 아니면, 해당 스레드를 프리패치 대상 데이터가 존재하는 스레드로 판단할 수 있다.In the GPU memory control method according to an embodiment of the present invention, in the step of determining the prefetch target, if the coefficient of the loop variable is not 0, among the threads having a high-density memory access pattern, the corresponding thread is a prefetch target. It can be determined as a thread in which data exists.

본 발명의 실시예에 따른 기록매체에 저장된 컴퓨터 프로그램은,A computer program stored in a recording medium according to an embodiment of the present invention,

컴퓨팅 수단에서, GPU 커널 내부에 존재하는 스레드들을 수집하고, 수집된 스레드들의 고유 스레드 ID를 데이터 차원별로 스레드 ID의 계수와 반복문 변수의 계수를 이용하여 메모리 변수의 인덱스를 정의하는 인덱스 정의단계; 상기 데이터 차원별 스레드 ID의 계수에 따라 스레드 간의 거리를 계산하고, 스레드들 간의 거리에 따라 스레드-고밀도 메모리 접근 패턴 여부를 판단하는 스레드-고밀도 메모리 접근 패턴 판단단계; 및, 상기 반복문 변수의 계수를 이용하여, 스레드-고밀도 메모리 접근 패턴을 가진 데이터 중에서, 프리패치 대상 데이터가 존재하는 지를 판단하는 프리패치 대상 결정단계를 실행시키는 기록매체에 저장된 컴퓨터 프로그램이다.In the computing means, an index definition step of collecting threads existing in the GPU kernel and defining an index of a memory variable using a coefficient of a thread ID and a coefficient of a loop variable for each data dimension of the unique thread IDs of the collected threads; A thread-high-density memory access pattern determining step of calculating a distance between threads according to a coefficient of the thread ID for each data dimension, and determining whether a thread-high density memory access pattern is based on the distance between the threads; And a computer program stored in a recording medium for executing a prefetch target determination step of determining whether prefetch target data exists among data having a thread-high density memory access pattern by using the coefficients of the loop variable.

기타 본 발명의 다양한 측면에 따른 구현예들의 구체적인 사항은 이하의 상세한 설명에 포함되어 있다.Other specific details of implementations according to various aspects of the present invention are included in the detailed description below.

종래의 연구들은 동적 분석기법(예를 들어, 온-라인 프로파일링)에 대한 것이고, 동적 분석기법은 비정형적 패턴에 대해 정확한 프로그램 분석이 가능하지만, 프로그램의 여러 입력값에 대해 매번 프로그램 실행을 통해 새로운 분석을 해야하는 단점이 있다. 이에 반해, 본 발명의 GPU 메모리 제어장치 및 제어방법은 소스코드 레벨에서 분석하는 정적 분석기법에 대한 것으로, 프로그램 실행없이 프로그램을 분석하는 것이 가능하여, 분석에 소요되는 시간과 분석 비용과 분석 인력을 절감할 수 있는 장점이 있다. Conventional studies have been on dynamic analysis (e.g., on-line profiling), and dynamic analysis allows accurate program analysis for atypical patterns, but through each execution of the program for multiple input values of the program. There is a downside to doing a new analysis. On the contrary, the GPU memory control apparatus and control method of the present invention is for a static analysis method that analyzes at the source code level, and it is possible to analyze a program without executing a program, thereby reducing the time required for analysis, analysis cost, and analysis manpower. There is an advantage that can be saved.

도 1은 일반적인 GPU 아키텍처가 도시된 블록도이다.
도 2는 본 발명의 일 실시예에 따른 GPU 메모리 제어장치가 도시된 블록도이다.
도 3은 본 발명의 일 실시예에 따른 GPU 메모리 제어방법이 도시된 순서도이다.
도 4는 본 발명의 일 실시예에 따른 GPU 메모리 제어방법이 도시된 순서도로서, 도 3을 구체화한 순서도이다.1 is a block diagram showing a general GPU architecture.
2 is a block diagram illustrating an apparatus for controlling a GPU memory according to an embodiment of the present invention.
3 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention.
FIG. 4 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention, and is a flowchart illustrating FIG. 3.

이하, 첨부한 도면을 참조하여, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 설명한다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 이해할 수 있는 바와 같이, 후술하는 실시예는 본 발명의 개념과 범위를 벗어나지 않는 한도 내에서 다양한 형태로 변형될 수 있다. 가능한 한 동일하거나 유사한 부분은 도면에서 동일한 도면부호를 사용하여 나타낸다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains can easily practice. As can be easily understood by those of ordinary skill in the art to which the present invention pertains, the embodiments to be described later may be modified in various forms without departing from the concept and scope of the present invention. Where possible, the same or similar parts are indicated by the same reference numerals in the drawings.

본 명세서에서 사용되는 전문용어는 단지 특정 실시예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지는 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다.The terminology used herein is for the purpose of referring only to specific embodiments, and is not intended to limit the invention. Singular forms as used herein also include plural forms unless the phrases clearly indicate the opposite.

본 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특정 특성, 영역, 정수, 단계, 동작, 요소, 성분 및/또는 군의 존재나 부가를 제외시키는 것은 아니다.As used herein, the meaning of “comprising” embodies certain properties, regions, integers, steps, actions, elements and/or components, and other specific properties, regions, integers, steps, actions, elements, components and/or It does not exclude the presence or addition of military forces.

본 명세서에서 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 사전에 정의된 용어들은 관련기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다. 이하, 도면을 참조하여 본 발명의 실시예에 따른 GPU 메모리 제어장치 및 제어방법을 설명한다.All terms including technical terms and scientific terms used in the present specification have the same meaning as those generally understood by those skilled in the art to which the present invention pertains. Terms defined in the dictionary are additionally interpreted as having meanings consistent with related technical documents and currently disclosed contents, and are not interpreted in an ideal or very formal sense unless defined. Hereinafter, an apparatus and a control method for controlling a GPU memory according to an embodiment of the present invention will be described with reference to the drawings.

도 2는 본 발명의 일 실시예에 따른 GPU 메모리 제어장치가 도시된 블록도이다. 본 발명의 정적 분석기법은 소스코드 레벨에서 분석하여 프로그램 실행없이 프리패치 대상을 결정한다. 이를 수행하기 위한 GPU 메모리 제어장치는, 도 2에 도시된 바와 같이, 인덱스 정의모듈(100)과, 스레드-고밀도 메모리 접근 패턴 판단모듈(200)과, 프리패치 대상 결정모듈(300)을 포함한다.2 is a block diagram illustrating an apparatus for controlling a GPU memory according to an embodiment of the present invention. The static analysis method of the present invention analyzes at the source code level and determines a prefetch target without program execution. The GPU memory control apparatus for performing this includes an index definition module 100, a thread-high density memory access pattern determination module 200, and a prefetch target determination module 300, as shown in FIG. 2. .

인덱스 정의모듈(100)은 GPU 커널 내부에 존재하는 스레드들을 수집한다. 인덱스 정의모듈(100)은 GPU CUDA 프로그램에 대해서 파싱(parsing) 및 추상 구문 트리(AST : abstract syntax tree)를 생성할 수 있는 앤틀러(Antlr : Another Tool For Language Recognition)를 이용하여 스레드들을 수집할 수 있다. 앤틀러는 컴퓨터 기반 언어 인식에서 구문 분석을 위해 LL(*)을 사용하는 파서 발생기이다.The index definition module 100 collects threads existing inside the GPU kernel. The index definition module 100 collects threads using Another Tool For Language Recognition (Antlr) that can parse a GPU CUDA program and generate an abstract syntax tree (AST). I can. Antler is a parser generator that uses LL(*) for parsing in computer-based language recognition.

정적 분석을 위해, 인덱스 정의모듈(100)은 수집된 스레드들의 고유 스레드 ID를 데이터 차원별로 스레드 ID의 계수와 반복문 변수의 계수를 이용하여 전역 메모리 변수의 인덱스를 정의한다. For static analysis, the index definition module 100 defines the index of the global memory variable using the count of the thread ID and the count of the loop variable for each data dimension of the unique thread ID of the collected threads.

데이터 차원은 스레드 어레이 방법을 의미한다. 예를 들어, 100개의 스레드를 1차원으로 어레이하는 경우, T₁, T₂ ... T₁₀₀으로 어레이될 수 있다. 또한 예를 들어, 100개의 스레드를 2차원으로 어레이하는 경우, T_(1,1), T_(1,2), ... T_(10,10)으로 어레이될 수 있다.Data dimension refers to the thread array method. For example, when 100 threads are arrayed in one dimension, they may be arrayed as T ₁ , T ₂ ... T ₁₀₀ . Also, for example, when 100 threads are arrayed in two dimensions, they may be arrayed as T _(1,1) , T _(1,2) , ... T _(10,10) .

인덱스 정의모듈(100)에서 정의하는 메모리 변수의 인덱스는 하기 식(1)과 같다.The index of the memory variable defined in the index definition module 100 is as shown in Equation (1) below.

여기서, threadIdx.x, threadIdx.y, threadIdx.z는 데이터 차원별 스레드 ID이고, Tx X, Ty X, Tz X는 데이터 차원별 스레드 ID의 계수이며, I는 반복문 변수이고, S는 반복문 변수의 계수이며, C는 상수이다.Here, threadIdx.x, threadIdx.y, threadIdx.z are the thread IDs for each data dimension, Tx X, Ty X, Tz X are the counts of the thread IDs for each data dimension, I is the loop variable, and S is the loop variable. Coefficient, and C is a constant.

프로그래머가 작성한 코드가 NVIDIA GPU에서 동작 가능하게 하는 프로그래밍 언어인, CUDA는 스레드마다 고유 스레드 ID(threadIdx)를 가진다. 인덱스 정의모듈(100)은 스레드 마다 독립적인 메모리 주소의 데이터를 요청하기 위해 스레드 ID를 이용한다. 스레드 ID(threadIdx)는 데이터 차원에 따라 threadIdx.x, threadIdx.y, 그리고 threadIdx.z로 표현된다.CUDA, a programming language that enables programmer-written code to run on NVIDIA GPUs, has a unique thread ID (threadIdx) for each thread. The index definition module 100 uses a thread ID to request data of an independent memory address for each thread. The thread ID (threadIdx) is expressed as threadIdx.x, threadIdx.y, and threadIdx.z according to the data dimension.

스레드-고밀도 메모리 접근 패턴 판단모듈(200)은 데이터 차원별 스레드 ID의 계수에 따라 스레드 간의 거리를 계산하고, 스레드들 간의 거리에 따라 스레드-고밀도 메모리 접근 패턴 여부를 판단한다.The thread-high density memory access pattern determination module 200 calculates a distance between threads according to a coefficient of a thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between threads.

구체적으로, 스레드-고밀도 메모리 접근 패턴 판단모듈(200)은 하기의 식 (2)를 이용하여 스레드 간의 거리를 계산한다.Specifically, the thread-high-density memory access pattern determination module 200 calculates the distance between threads using Equation (2) below.

식 (2) :

Equation (2):

여기서, V는 논리 연산자 “or”을 의미한다. 식 (2)는 데이터 차원별 스레드 ID의 계수(Tx X, Ty X, Tz X) 중 적어도 어느 하나의 스레드 ID의 계수가 0이면, 식 (2)의 결과값은 “True”가 된다.Here, V means the logical operator “or”. In Equation (2), if the coefficient of at least one of the thread ID coefficients (Tx X, Ty X, Tz X) for each data dimension is 0, the result of Equation (2) is "True".

예를 들어, 데이터 차원이 1차원인 경우, 전역 메모리 명령어의 기본 인덱스는 [Tx X threadIdx.x + S X i +C]의 형태를 가진다. ThreadIdx.x의 계수 Tx가 0인 경우, 즉, 식 (2)가 참인 경우, [S X i + C]의 메모리 접근 패턴에서는 모든 스레드가 같은 반복문 내 같은 주소 S X i + C에 접근하게 된다. 이는 같은 차원 내 모드 스레드가 같은 주소에 데이터를 요청하는 스레드-고밀도 패턴임을 의미한다. For example, when the data dimension is 1 dimensional, the default index of the global memory instruction has the form [Tx X threadIdx.x + S X i +C]. When the coefficient Tx of ThreadIdx.x is 0, that is, when Equation (2) is true, in the memory access pattern of [S X i + C], all threads access the same address S X i + C in the same loop. This means that it is a thread-dense pattern in which mode threads within the same dimension request data at the same address.

또한, 예를 들어, 데이터 차원이 2차원인 경우에는 threadIdx.x 또는 threadIdx.y의 계수가 0인 경우, 데이터 차원이 3차원인 경우에는 thradIdx.x, threadIdx.y, threadIdx.z 중 적어도 어느 하나의 계수가 0인 경우에, 해당 데이터는 스레드-고밀도 메모리 접근 패턴을 가지는 것을 의미한다. Further, for example, when the data dimension is two-dimensional, when the coefficient of threadIdx.x or threadIdx.y is 0, and when the data dimension is three-dimensional, at least one of thradIdx.x, threadIdx.y, and threadIdx.z. When one coefficient is 0, it means that the corresponding data has a thread-dense memory access pattern.

즉, 스레드-고밀도 메모리 접근 패턴 판단모듈(200)은 데이터 차원별 스레드 ID의 계수 중 적어도 어느 하나의 스레드 ID의 계수가 0이면, 해당 데이터를 스레드-고밀도 메모리 접근 패턴으로 판단한다.That is, the thread-high density memory access pattern determination module 200 determines the corresponding data as a thread-high density memory access pattern when the coefficient of at least one of the thread ID's for each data dimension is 0.

프리패치 대상 결정모듈(300)은 반복문 변수의 계수를 이용하여, 스레드-고밀도 메모리 접근 패턴을 가진 데이터 중에서, 프리패치 대상 데이터가 존재하는 지를 판단한다.The prefetch target determination module 300 determines whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficients of the loop variable.

프리패치 대상 결정모듈(300)은 하기의 식 (3)을 이용하여 프리패치 대상 데이터의 존재 여부를 판단한다.The prefetch target determination module 300 determines whether the prefetch target data is present using Equation (3) below.

식 (3) :

Equation (3):

여기서, S는 식 (1)에 기재된 반복문 변수(i)의 계수로, 반복문 간에 스레드에서 접근하는 주소 간의 거리를 의미하기도 한다.Here, S is the coefficient of the loop variable (i) described in Equation (1), and also means the distance between the addresses accessed by threads between loops.

식 (3)에 의하면, 반복문 변수의 계수 S 값이 0이 아니면, 즉, 주소 간의 거리가 1 이상이면, 반복문마다 스레드에서 서로 다른 주소에 접근하는 것을 의미하므로, 프리패치를 위한 데이터가 존재하는 것으로 판단할 수 있다. According to Equation (3), if the coefficient S value of the loop variable is not 0, that is, if the distance between addresses is more than 1, it means that a thread accesses a different address for each loop, so there is data for prefetching. It can be judged as.

정리하면, CUDA 코드 내 전역 메모리 명령어의 인덱스의 정적 분석을 위해 전역 메모리 변수의 인덱스를 식 (1)과 같이 정의하고, 식 (2)를 이용하여 스레드 간 거리를 계산하여 스레드-고밀도 메모리 접근 패턴 여부를 판단할 수 있게 된다. 마지막으로 스레드-고밀도 메모리 접근 패턴을 대상으로 식 (3)을 이용하여 프리패치를 위한 데이터의 여부를 판단할 수 있게 된다.In summary, for static analysis of the index of the global memory instruction in the CUDA code, the index of the global memory variable is defined as Equation (1), and the distance between threads is calculated using Equation (2) to calculate the thread-high density memory access pattern. You can judge whether or not. Finally, it is possible to determine whether there is data for prefetching using Equation (3) for the thread-high density memory access pattern.

다음으로, 도 3 및 도 4를 참조하여 본 발명의 일 실시예에 따른 GPU 메모리 제어방법을 설명한다. 도 3은 본 발명의 일 실시예에 따른 GPU 메모리 제어방법이 도시된 순서도이고, 도 4는 본 발명의 일 실시예에 따른 GPU 메모리 제어방법이 도시된 순서도로서, 도 3을 구체화한 순서도이다.Next, a GPU memory control method according to an embodiment of the present invention will be described with reference to FIGS. 3 and 4. FIG. 3 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention, and FIG. 4 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention.

도 3 및 도 4에 도시된 바와 같이, 본 발명의 일 실시예에 따른 GPU 메모리 제어방법은, 인덱스 정의단계(S100)와, 스레드-고밀도 메모리 접근 패턴 판단단계(S200)와, 프리패치 대상 결정단계(S300)를 포함한다. 3 and 4, the GPU memory control method according to an embodiment of the present invention includes an index definition step (S100), a thread-high density memory access pattern determination step (S200), and a prefetch target determination step It includes step S300.

인덱스 정의단계(S100, S100‘)에서는, GPU 커널 내부에 존재하는 스레드들을 수집하고, 수집된 스레드들의 고유 스레드 ID를 데이터 차원별로 스레드 ID의 계수와 반복문 변수의 계수를 이용하여 전역 메모리 변수의 인덱스를 정의한다. 인덱스 정의단계(S100)에서 정의하는 메모리 변수의 인덱스는 전술한 식(1)과 같다. In the index definition step (S100, S100'), threads existing inside the GPU kernel are collected, and the unique thread IDs of the collected threads are indexed using the thread ID count and the loop variable count for each data dimension. Defines The index of the memory variable defined in the index definition step (S100) is as in Equation (1) described above.

스레드-고밀도 메모리 접근 패턴 판단단계(S200, S200‘)에서는, 데이터 차원별 스레드 ID의 계수에 따라 스레드 간의 거리를 계산하고, 스레드들 간의 거리에 따라 스레드-고밀도 메모리 접근 패턴 여부를 판단한다. 스레드-고밀도 메모리 접근 패턴 판단단계(S200)에서는 전술한 식 (2)를 이용하여 스레드 간의 거리를 계산한다. 스레드-고밀도 메모리 접근 패턴 판단단계(S200)에서, 데이터 차원별 스레드 ID의 계수 중 적어도 어느 하나의 스레드 ID의 계수가 0이면, 해당 데이터를 스레드-고밀도 메모리 접근 패턴으로 판단한다.In the step of determining a thread-high density memory access pattern (S200 and S200'), a distance between threads is calculated according to a coefficient of a thread ID for each data dimension, and whether a thread-high density memory access pattern is determined according to the distance between threads. In the thread-high density memory access pattern determination step (S200), the distance between threads is calculated using Equation (2) described above. In step S200 of determining a thread-high density memory access pattern, if the coefficient of at least one of the thread ID's for each data dimension is 0, the data is determined as a thread-high density memory access pattern.

프리패치 대상 결정단계(S300, S300‘)에서는, 반복문 변수의 계수를 이용하여, 스레드-고밀도 메모리 접근 패턴을 가진 데이터 중에서, 프리패치 대상 데이터가 존재하는 지를 판단한다. 프리패치 대상 결정단계(S300)는 전술한 식 (3)을 이용하여 프리패치 대상 데이터의 존재 여부를 판단한다. 프리패치 대상 결정단계(S300)에서, 반복문 변수의 계수 S 값이 0이 아니면, 즉, 주소 간의 거리가 1 이상이면, 반복문마다 스레드에서 서로 다른 주소에 접근하는 것을 의미하므로, 프리패치를 위한 데이터가 존재하는 것으로 판단할 수 있다.In the prefetch target determination step (S300, S300'), it is determined whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficients of the loop variable. In the prefetch target determination step (S300), it is determined whether or not the prefetch target data is present using Equation (3). In the prefetch target determination step (S300), if the coefficient S value of the loop variable is not 0, that is, if the distance between addresses is 1 or more, it means that a thread accesses different addresses for each loop, so data for prefetching Can be determined to exist.

상기의 단계들을 거쳐서 프리패치 대상 데이터가 결정되면, 이를 프리패치 리스트에 추가한다. (S400, S400’)When the data to be prefetched is determined through the above steps, it is added to the prefetch list. (S400, S400’)

온-칩 메모리를 활용하여 오프-칩 메모리에 대한 접근을 최소화하는 종래의 연구들은 동적 분석기법(예를 들어, 온-라인 프로파일링)에 대한 것이고, 동적 분석기법은 비정형적 패턴에 대해 정확한 프로그램 분석이 가능하지만, 프로그램의 여러 입력값에 대해 매번 프로그램 실행을 통해 새로운 분석을 해야하는 단점이 있다. 이에 반해, 본 발명에서 제시하는 정적 분석기법은 소스코드 레벨에서 분석하는 것이기 때문에, 프로그램 실행없이 프로그램을 분석하는 것이 가능하다는 장점이 있다. GPU 프로그램은 일반적으로 메모리 접근에 대해서 정형적 형태를 보이기 때문에 정적 분석만으로도 충분히 정확한 프로그램 분석이 가능하다.Conventional studies that minimize access to off-chip memory by using on-chip memory are for dynamic analysis (eg, on-line profiling), and dynamic analysis is an accurate program for atypical patterns. Analysis is possible, but there is a disadvantage of having to perform a new analysis through the execution of the program each time for multiple input values of the program. On the other hand, since the static analysis method proposed in the present invention analyzes at the source code level, it is possible to analyze a program without executing a program. Since GPU programs generally show a formal form of memory access, a sufficiently accurate program analysis is possible only with static analysis.

이상, 본 발명의 일 실시예에 대하여 설명하였으나, 해당 기술 분야에서 통상의 지식을 가진 자라면 특허청구범위에 기재된 본 발명의 사상으로부터 벗어나지 않는 범위 내에서, 구성 요소의 부가, 변경, 삭제 또는 추가 등에 의해 본 발명을 다양하게 수정 및 변경시킬 수 있을 것이며, 이 또한 본 발명의 권리범위 내에 포함된다고 할 것이다.As described above, one embodiment of the present invention has been described, but those skilled in the art may add, change, delete, or add elements within the scope of the present invention as described in the claims. It will be said that the present invention can be variously modified and changed by the like, and this is also included within the scope of the present invention.

100 : 인덱스 정의모듈
200 : 스레드-고밀도 메모리 접근 패턴 판단모듈
300 : 프리패치 대상 결정모듈100: index definition module
200: thread-high density memory access pattern determination module
300: Prefetch target determination module

Claims

An index definition module that collects threads existing in the GPU kernel and defines an index of a global memory variable by using a count of a thread ID and a count of a loop variable for each data dimension from the unique thread ID of the collected threads;
A thread-high-density memory access pattern determination module that calculates a distance between threads according to a coefficient of the thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between the threads; And,
And a prefetch target determination module for determining whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficients of the loop variable.
The index of the global memory variable defined in the index definition module satisfies the following equation (1),
Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + SX i + C]
(Here, threadIdx.x, threadIdx.y, threadIdx.z are the thread IDs for each data dimension, Tx, Ty, Tz are the counts of the thread IDs for each data dimension, i is the loop variable, and S is the count of the loop variable. , C is a constant)
The thread-high density memory access pattern determination module,
When the coefficient of at least one of the coefficients of the thread IDs for each data dimension is 0, the GPU memory control device determines the thread as a thread-high density memory access pattern.

delete

The method according to claim 1, wherein the prefetch target determination module,
If the coefficient of the loop variable is not 0 among the threads having the thread-high density memory access pattern, the GPU memory control device determines that the thread is a thread in which prefetch target data exists.

delete

In computing means,
An index definition step of collecting threads existing in the GPU kernel and defining an index of a global memory variable using a count of a thread ID and a count of a loop variable for each data dimension from the unique thread ID of the collected threads;
A thread-high-density memory access pattern determining step of calculating a distance between threads according to a coefficient of the thread ID for each data dimension, and determining whether a thread-high density memory access pattern is based on the distance between the threads; And,
A prefetch target determination step of determining whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficient of the loop variable is executed.
The index of the global memory variable defined in the index definition step satisfies the following equation (1),
Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + SX i + C]
(Here, threadIdx.x, threadIdx.y, threadIdx.z are the thread IDs for each data dimension, Tx, Ty, Tz are the counts of the thread IDs for each data dimension, i is the loop variable, and S is the loop variable count. , C is a constant)
The step of determining the thread-high density memory access pattern,
A computer program stored in a recording medium for determining a corresponding thread as a thread-high density memory access pattern when the coefficient of at least one of the thread ID coefficients for each data dimension is 0.