KR20180058332A

KR20180058332A - Memory control apparatus for optimizing gpu memory access through pre-patched scratchpad memory data and control method thereof

Info

Publication number: KR20180058332A
Application number: KR1020160157130A
Authority: KR
Inventors: 한환수; 김현준; 홍성인
Original assignee: 성균관대학교산학협력단
Priority date: 2016-11-24
Filing date: 2016-11-24
Publication date: 2018-06-01
Also published as: KR101957855B1

Abstract

The present invention relates to a memory control apparatus and a method thereof for pre-patching data that frequently causes compulsory miss in an L1 cache in a GPU to a scratchpad memory and reducing overhead that is increased by the compulsory miss of the cache memory. According to the memory control apparatus and the method thereof of the present invention, data can be cached to rapidly access and use data. When a large amount of data is read in advance, the bandwidth of the memory of the GPU is fully utilized to minimize the overhead due to a memory delay time. The memory control apparatus includes a GPU kernel analysis module, a pre-patch target determination module, and a pre-patch execution module.

Description

TECHNICAL FIELD [0001] The present invention relates to a memory control device for optimizing access to a GPU memory by prefetching data into a scratch pad, and a memory control method thereof.

본 발명은 메모리 제어 장치 및 그 메모리 제어 방법에 관한 것으로서, 구체적으로는 스크래치패드에 데이터 프리페치를 통해 GPU 메모리 접근 최적화를 위한 메모리 제어 장치 및 그 메모리 제어 방법에 관한 것이다.The present invention relates to a memory control device and a memory control method thereof, and more particularly, to a memory control device for optimizing GPU memory access through data prefetching to a scratch pad and a memory control method thereof.

그래픽 처리 장치(Graphics Processing Unit, GPU)는 중앙 처리 장치(Central Processing Unit, CPU)와 같이 다중 단계의 캐시 메모리(Cache Memory) 구조를 가지고 있다.A graphics processing unit (GPU) has a multi-level cache memory structure such as a central processing unit (CPU).

도 1은 일반적인 GPU의 아키텍처이다.Figure 1 is a general GPU architecture.

도 1을 참조하면, GPU가 n개의 SIMD를 포함하는 것을 확인할 수 있다. SIMD(Single Instruction Multiple Data)란, 하나의 명령어로 여러 개의 값을 동시에 계산하는 방식의 병렬 프로세서 종류이다. 상기 SIMD를 GPU 제조사마다 서로 다른 이름으로 부르기도 한다. NVIDIA^®에서는 Stream Multiprocessor, AMD^®에서는 Compute Unit이라고 부른다. NVIDIA^®의 최근 제품인 GTX980^®에는 상기와 같은 SIMD가 총 16개 존재하고 있고, 각 SIMD 내부에는 4개의 warp 스케쥴러, 각 스케줄러마다 32개의 CUDA core가 있으며 8개의 로드/스토어 유닛, Special Functional Unit이 있다.　다만, 이에 대한 내용은 본 명세서에서 설명하고자 하는 내용과 직접적인 연관성이 높지 않아서 보다 자세한 설명은 생략하겠다. 따라서 도 1에는 본 명세서에서 설명하고자 하는 내용과 연관성이 높은 구성만을 간략하게 도시한 것이다.Referring to FIG. 1, it can be seen that the GPU includes n SIMDs. SIMD (Single Instruction Multiple Data) is a kind of parallel processor that calculates several values simultaneously with one instruction. The SIMD may also be referred to as a different name for each GPU maker. Stream Multiprocessor in NVIDIA ^® and Compute Unit in AMD ^® . NVIDIA ^® 's recent product, GTX980 ^®, has 16 SIMDs in total, 4 warp schedulers in each SIMD, 32 CUDA cores in each scheduler, 8 load / store units, and a Special Functional Unit . However, the contents of this description are not directly related to the contents described in this specification, so a detailed explanation will be omitted. Therefore, FIG. 1 shows only a configuration that is highly related to the content to be described in this specification.

상기 SIMD는 레지스터(Register), 스크래치패드 메모리(Scratchpad memory 또는 Shared Memory), L1 캐시(L1 cache) 및 ROM(Read Only Memory)를 포함할 수 있다. 상기 n개의 SIMD는 L2 캐시(L2 cache) 및 전역 메모리(Global Memory, DRAM)를 공유할 수 있다. 일 예로, 상기 L1 캐시는 SIMD마다 48KB, L2 캐시는 2MB의 크기를 가질 수 있다. 오른쪽 점선 밖은 주기억장치(Main memory)는 CPU에 연결된 메모리로서, 상기 SIMD와 연결된다.The SIMD may include a register, a scratchpad memory or a shared memory, an L1 cache, and a read only memory (ROM). The n SIMDs may share an L2 cache and a global memory (DRAM). For example, the L1 cache may have a size of 48 KB per SIMD, and the L2 cache may have a size of 2 MB. Main memory outside the right dotted line is a memory connected to the CPU and connected to the SIMD.

상기 메모리들이 존재하는 위치에 따라 온-칩(on-chip) 메모리와 오프-칩(off-chip) 메모리로 구분될 수 있다. 레지스터, L1 캐시, L2 캐시, ROM(Read-only memory) 및 스크래치패드 메모리는 온-칩 메모리이다. 전역메모리(Global memory)는 오프-칩 메모리이다.And may be divided into an on-chip memory and an off-chip memory according to the location of the memories. The registers, L1 cache, L2 cache, read-only memory (ROM), and scratch pad memory are on-chip memory. Global memory is off-chip memory.

한편, 상기 L1 캐시는 CPU와 동일하게 강제 손실(Compulsory miss), 용량 손실(Capacity miss) 및 충돌 손실(Conflict miss) 총 3가지 형태의 손실(miss)이 존재할 수 있으며, 캐시 손실(cache miss) 발생 시에 L2 캐시와 전역 메모리에 필요한 데이터를 요청하도록 설계 되어 있다. GPU는 계산에 필요한 데이터를 오프-칩의 전역메모리에 저장한다. 하지만, 오프-칩 메모리인 전역메모리로부터 데이터를 가져오기 위한 지연시간은 온-칩 메모리보다 10-100배 긴 것은 자명하다. Meanwhile, the L1 cache may have three types of misses, such as a compulsory miss, a capacity miss, and a conflict miss, as in the case of a CPU, It is designed to request the L2 cache and global memory for the required data when it occurs. The GPU stores the data needed for the calculation in off-chip global memory. However, it is obvious that the delay time for fetching data from the global memory which is the off-chip memory is 10-100 times longer than the on-chip memory.

이를 해소하기 위해 GPU에서는 다중 레벨의 캐시(L1 캐시 및 L2 캐시)를 이용하여 지역성과 일시성을 고려한 데이터 재사용 패턴에서 이득을 보고자 한다. 하지만 강제 손실(Compulsory miss)이 빈번하게 발생하는 애플리케이션의 패턴에서는 매번 L2 캐시와 전역메모리에 데이터를 요청하고 기다려야 하는 오버헤드가 여전히 존재한다.To overcome this, the GPU seeks to benefit from a data reuse pattern that takes into account local and temporal aspects, using multiple levels of cache (L1 cache and L2 cache). However, in an application pattern where frequent compulsory misses occur, there is still the overhead of requesting and waiting for data in the L2 cache and global memory each time.

대한민국 공개특허공보 제10-2015-0092440호Korean Patent Publication No. 10-2015-0092440

본 명세서에 따른 메모리 제어 장치 및 방법은 캐시 메모리의 강제 손실로 인해 증가하는 오버 헤드를 줄이고자 한다. The memory control apparatus and method according to the present invention is intended to reduce the overhead that is increasing due to the forced loss of the cache memory.

본 명세서에 기재된 해결과제는 이상에서 언급한 것들에 한정되지 않으며, 언급되지 아니한 다른 해결과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다. The solutions described herein are not limited to those mentioned above, and other solutions not mentioned may be clearly understood by those skilled in the art from the following description.

상술한 목적을 해결하기 위한 본 명세서에 따른 메모리 제어 장치는, GPU 커널에서 요청되는 메모리주소를 수집하는 GPU 커널 분석 모듈; L1 캐시에서 강제 손실(Compulsory miss)이 미리 설정된 기준 이상 발생하는 데이터를 프리페치 대상 데이터로 결정하는 프리페치 대상 결정 모듈; 및 상기 프리페치 대상 데이터로 결정된 데이터들을 스크래치패드 메모리에 복사하는 프리페치 실행 모듈;을 포함할 수 있다.According to an aspect of the present invention, there is provided a memory control apparatus comprising: a GPU kernel analysis module for collecting a memory address requested in a GPU kernel; A prefetch target determination module that determines data for which a forced loss (Compulsory miss) occurs in the L1 cache above a preset reference as prefetch target data; And a prefetch execution module for copying data determined as prefetch target data to a scratch pad memory.

본 명세서의 일 실시예에 따르면, 상기 프리페치 대상 결정 모듈은 L1 캐시에 저장된 데이터의 재사용이 없이 전역메모리에 데이터를 로드하는 메모리 접근 패턴을 통해 강제 손실이 발생하는지 판단할 수 있다.According to an embodiment of the present invention, the prefetch target determination module can determine whether forcible loss occurs through a memory access pattern that loads data into the global memory without reusing data stored in the L1 cache.

본 명세서의 일 실시예에 따르면, 상기 프리페치 대상 결정 모듈은 L1 캐시에서 강제 손실(Compulsory miss)이 2회 이상 발생하는 데이터를 프리페치 대상 데이터로 결정할 수 있다.According to an embodiment of the present invention, the prefetch target determination module may determine data in which a compulsory miss occurs twice or more in the L1 cache as prefetch target data.

본 명세서의 일 실시예에 따르면, 상기 메모리 제어 장치는, 상기 프리페치 대상 데이터로 결정된 데이터들(이하 '프리페치 후보 데이터')의 양이 상기 스크래치패드 메모리의 전체 크기를 초과하는지 여부를 계산하는 프리페치 데이터량 산출 모듈;을 더 포함할 수 있다. 이 경우, 상기 프리페치 실행 모듈은 상기 프리페치 후보 데이터의 양이 상기 스크래치패드 메모리의 전체 크기를 초과하기 이전까지의 프리페치 후보 데이터들을 상기 스크래치패드 메모리에 복사할 수 있다.According to one embodiment of the present invention, the memory control device calculates whether or not the amount of data (hereinafter, prefetch candidate data) determined as the prefetch target data exceeds the total size of the scratch pad memory And a prefetch data amount calculating module. In this case, the prefetch execution module may copy the prefetch candidate data until the amount of the prefetch candidate data exceeds the total size of the scratch pad memory, to the scratch pad memory.

본 명세서의 다른 실시예에 따르면, 상기 메모리 제어 장치는, 상기 프리페치 대상 데이터로 결정된 데이터들(이하 '프리페치 후보 데이터')의 양이 아래 수식을 초과하는지 여부를 계산하는 프리페치 데이터량 산출 모듈;을 더 포함할 수 있다. According to another embodiment of the present invention, the memory control apparatus further includes a prefetch data amount calculation unit that calculates whether or not the amount of data determined as the prefetch target data (hereinafter, prefetch candidate data) Module. &Lt; / RTI >

(SM당 사용 가능한 스크래치패드 메모리의 크기) / (SM당 스케쥴링 된 쓰레드 블락의 개수)(Size of scratch pad memory available per SM) / (number of scheduled thread blocks per SM)

이 경우, 상기 프리페치 실행 모듈은 상기 수식의 크기를 초과하기 이전까지의 프리페치 후보 데이터들을 상기 스크래치패드 메모리에 복사할 수 있다.In this case, the prefetch execution module may copy the prefetch candidate data before the size of the formula is exceeded to the scratch pad memory.

상술한 목적을 해결하기 위한 본 명세서에 따른 메모리 제어 방법은, (a) GPU 커널 분석 모듈이 GPU 커널에서 요청되는 메모리주소를 수집하는 단계; (b) 프리페치 대상 결정 모듈이 L1 캐시에서 강제 손실(Compulsory miss)이 미리 설정된 기준 이상 발생하는 데이터를 프리페치 대상 데이터로 결정하는 단계; 및 (c) 프리페치 실행 모듈이 상기 프리페치 대상 데이터로 결정된 데이터들을 스크래치패드 메모리에 복사하는 단계;를 포함할 수 있다.According to an aspect of the present invention, there is provided a memory control method including: (a) collecting a memory address requested by a GPU kernel analysis module in a GPU kernel; (b) determining, as prefetch target data, data in which a prefetch target determination module generates a compulsory miss in the L1 cache above a preset reference; And (c) copying the data determined as prefetch target data to the scratch pad memory by the prefetch execution module.

상술한 목적을 해결하기 위한 본 명세서에 따른 메모리 제어 방법은, 컴퓨터에서 각 단계들을 수행하도록 작성되어 컴퓨터로 독출 가능한 기록 매체에 기록된 컴퓨터프로그램이 될 수 있다.The memory control method according to the present invention for solving the above-mentioned object may be a computer program recorded in a computer readable recording medium which is created to perform each step in the computer.

본 명세서에 따른 메모리 제어 장치 및 방법에 따르면, 데이터를 캐싱하여 빠르게 데이터에 접근하여 사용할 수 있으며, 대량의 데이터를 한번에 미리 읽어오면 GPU의 메모리의 밴드위스(bandwidth)를 충분히 활용하여 빈번하게 발생하는 메모리 지연시간으로 인한 오버헤드를 최소화할 수 있다 According to the memory control apparatus and method of the present invention, data can be quickly accessed by accessing and caching data. When a large amount of data is read in advance, the bandwidth of the memory of the GPU is fully utilized, The overhead of memory latency can be minimized

본 명세서에 기재된 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects described in the present specification are not limited to those mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the following description.

도 1은 일반적인 GPU의 아키텍처이다.
도 2은 본 명세서의 일 실시예에 따른 메모리 제어 장치의 구성을 간략하게 도시한 블록도이다.
도 3은 본 명세서의 다른 실시예에 따른 메모리 제어 장치의 구성을 간략하게 도시한 블록도이다.
도 4는 본 명세서의 일 실시예에 따른 메모리 제어 방법의 흐름도이다.Figure 1 is a general GPU architecture.
2 is a block diagram briefly showing a configuration of a memory control device according to an embodiment of the present invention.
3 is a block diagram briefly showing a configuration of a memory control device according to another embodiment of the present invention.
4 is a flowchart of a memory control method according to an embodiment of the present invention.

이하, 첨부한 도면을 참조하여, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 설명한다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 이해할 수 있는 바와 같이, 후술하는 실시예는 본 발명의 개념과 범위를 벗어나지 않는 한도 내에서 다양한 형태로 변형될 수 있다. 가능한 한 동일하거나 유사한 부분은 도면에서 동일한 도면부호를 사용하여 나타낸다.Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings so that those skilled in the art can easily carry out the present invention. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Wherever possible, the same or similar parts are denoted using the same reference numerals in the drawings.

본 명세서에서 사용되는 전문용어는 단지 특정 실시예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지는 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular forms as used herein include plural forms as long as the phrases do not expressly express the opposite meaning thereto.

본 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특정 특성, 영역, 정수, 단계, 동작, 요소, 성분 및/또는 군의 존재나 부가를 제외시키는 것은 아니다.Means that a particular feature, region, integer, step, operation, element and / or component is specified and that other specific features, regions, integers, steps, operations, elements, components, and / It does not exclude the existence or addition of a group.

본 명세서에서 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 사전에 정의된 용어들은 관련기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다.All terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Predefined terms are further interpreted as having a meaning consistent with the relevant technical literature and the present disclosure, and are not to be construed as ideal or very formal meanings unless defined otherwise.

한편, 본 명세서에 따른 메모리 제어 장치 및 그 메모리 제어 방법은 GPU 메모리 접근 최적화를 위한 메모리 제어 장치 및 그 메모리 제어 방법이다. 따라서, 본 명세서에 따른 메모리 제어 장치 및 그 메모리 제어 방법을 설명함에 있어서, GPU의 각 구성 요소들이 언급될 수 있다. 상기 GPU의 각 구성은 앞서 도 1을 참조하여 '발명의 배경이 되는 기술'에서 상세히 설명된 내용을 원용하며, 반복적인 설명은 생략하도록 하겠다.Meanwhile, the memory control device and the memory control method according to the present invention are a memory control device for optimizing GPU memory access and a memory control method thereof. Therefore, in describing the memory control apparatus and the memory control method according to the present specification, each component of the GPU can be mentioned. Each component of the GPU will be described in detail in the 'Background of the Invention' section with reference to FIG. 1, and a repetitive description thereof will be omitted.

이하에서는 도면을 중심으로 본 명세서에 따른 메모리 제어 장치 및 그 메모리 제어 방법을 설명하고자 한다.Hereinafter, a memory control apparatus and a memory control method thereof according to the present specification will be described with reference to the drawings.

도 2은 본 명세서의 일 실시예에 따른 메모리 제어 장치의 구성을 간략하게 도시한 블록도이다.2 is a block diagram briefly showing a configuration of a memory control device according to an embodiment of the present invention.

도 2를 참조하면, 본 명세서의 일 실시예에 따른 메모리 제어 장치(100)는 GPU 커널 분석 모듈(110), 프리페치 대상 결정 모듈(130) 및 프리페치 실행 모듈(150)을 포함한다.Referring to FIG. 2, the memory control apparatus 100 according to an embodiment of the present invention includes a GPU kernel analysis module 110, a prefetch target determination module 130, and a prefetch execution module 150.

상기 GPU 커널 분석 모듈(110)은 GPU 커널(kernel)에서 요청되는 메모리주소를 수집할 수 있다. 상기 GPU 커널 분석 모듈(110)은 'Georgia tech.^®'에서 개발하고 공개한 'GPUOcelot'을 이용하여 GPU 커널(kernel)에서 요청되는 메모리주소를 수집할 수 있다.The GPU kernel analysis module 110 may collect the memory address requested in the GPU kernel. The GPU kernel analysis module 110 may be a 'Georgia tech. GPUOcelot ^' , developed and released by Intel Corporation, to collect the memory addresses requested by the GPU kernel.

상기 프리페치 대상 결정 모듈(130)은 L1 캐시에서 강제 손실(Compulsory miss)이 미리 설정된 기준 이상 발생하는 데이터를 프리페치(pre-patch) 대상 데이터로 결정할 수 있다.The prefetch target determination module 130 may determine pre-fetch target data in which a compulsory miss occurs in the L1 cache more than a predetermined threshold.

본 명세서의 일 실시예에 따르면, 상기 프리페치 대상 결정 모듈(130)은 L1 캐시에 저장된 데이터의 재사용이 없이 전역메모리에 데이터를 로드하는 메모리 접근 패턴을 통해 강제 손실이 발생하는지 판단할 수 있다. 즉, 상기 프리페치 대상 결정 모듈(130)은 데이터의 재사용이 없어 캐시히트가 발생하지 않는 패턴이 발생하는지 판단한다.According to an embodiment of the present invention, the prefetch target determination module 130 may determine whether a forced loss occurs through a memory access pattern that loads data into the global memory without reusing data stored in the L1 cache. That is, the prefetch target determination module 130 determines whether a pattern that does not cause cache hit occurs because data is not reused.

본 명세서의 일 실시예에 따르면, 상기 프리페치 대상 결정 모듈(130)은 L1 캐시에서 강제 손실(Compulsory miss)이 2회 이상 발생하는 데이터를 프리페치 대상 데이터로 결정할 수 있다. 그러나 상기 횟수는 일 예시에 불과하면, 상기 횟수는 다양하게 설정될 수 있음은 자명하다.According to an embodiment of the present invention, the prefetch target determination module 130 may determine data in which a compulsory miss occurs twice or more in the L1 cache as prefetch target data. However, if the number of times is only one example, it is obvious that the number of times can be variously set.

상기 프리페치 실행 모듈(150)은 상기 프리페치 대상 데이터로 결정된 데이터들을 스크래치패드 메모리에 복사할 수 있다. The prefetch execution module 150 may copy the data determined as the prefetch target data to the scratch pad memory.

한편, 상기 프리페치 대상 데이터로 결정된 데이터들을 스크래치패드 메모리에 복사하는 제어 로직을 코드 레벨에서 해당 데이터를 스크래치패드 메모리에 복사하고, 복사된 데이터를 이용하여 계산하는데 사용되도록 수정하는 작업이 필요할 수 있다. 상기 '소스코드 레벨에서의 수정'은 이렇게 소스코드 레벨에서 소스코드를 특정 규칙에 맞게 변환하는 기술을 소스-투-소스 변환 (Source-to-source transformation)를 의미한다. 보다 상세하게, 기존에 전역 메모리에서 로드를 하도록 작성되어 있는 소스코드(강제 손실이 있는)를 미리 해당 데이터를 스크래치패드 메모리에 로드하도록 하는 코드를 앞에 추가하고, 전역 메모리 로드하는 코드는 미리 로드한 스크래치패드 메모리로부터 데이터를 로드하도록 변경 하는 작업이 될 수 있다.The control logic for copying the data determined as the prefetch target data into the scratch pad memory may be required to copy the data to the scratch pad memory at the code level and modify the data to be used for calculation using the copied data . The 'modification at the source code level' refers to a source-to-source transformation technique of converting the source code to a specific rule at the source code level. More specifically, the code for loading the data into the scratch pad memory is added in advance to the source code (forcing loss) that has been previously written to be loaded from the global memory, and the code for loading the global memory is preloaded It can be a task of changing data to load from scratch pad memory.

도 3은 본 명세서의 다른 실시예에 따른 메모리 제어 장치의 구성을 간략하게 도시한 블록도이다.3 is a block diagram briefly showing a configuration of a memory control device according to another embodiment of the present invention.

본 명세서의 다른 실시예에 따르면, 상기 메모리 제어 장치(100)는 프리페치 데이터량 산출 모듈(140)을 더 포함할 수 있다.According to another embodiment of the present invention, the memory control apparatus 100 may further include a prefetch data amount calculating module 140. [

상기 프리페치 데이터량 산출 모듈(140)은 상기 프리페치 대상 데이터로 결정된 데이터들(이하 '프리페치 후보 데이터')의 양, 즉 프리페치 후보 데이터들을 합산하여 그 총 데이터의 크기를 계산할 수 있다.The prefetch data amount calculation module 140 may calculate the size of the total data by summing the amount of data (hereinafter, prefetch candidate data) determined as the prefetch target data, i.e., prefetch candidate data.

본 명세서의 일 실시예에 따르면, 상기 프리페치 데이터량 산출 모듈(140)은 상기 프리페치 후보 데이터들의 양이 상기 스크래치패드 메모리의 전체 크기를 초과하는지 여부를 계산할 수 있다. 이 경우, 상기 프리페치 실행 모듈(150)은 상기 프리페치 후보 데이터의 양이 상기 스크래치패드 메모리의 전체 크기를 초과하기 이전까지의 프리페치 후보 데이터들을 상기 스크래치패드 메모리에 복사할 수 있다.According to an embodiment of the present invention, the prefetch data amount calculation module 140 may calculate whether the amount of the prefetch candidate data exceeds the total size of the scratch pad memory. In this case, the prefetch execution module 150 may copy the prefetch candidate data until the amount of the prefetch candidate data exceeds the total size of the scratch pad memory, to the scratch pad memory.

본 명세서의 다른 실시예에 따르면, 상기 프리페치 데이터량 산출 모듈(140)은 상기 프리페치 후보 데이터의 양이 아래 수식을 초과하는지 여부를 계산할 수 있다.According to another embodiment of the present invention, the prefetch data amount calculation module 140 may calculate whether the amount of the prefetch candidate data exceeds the following formula.

<수학식 1>&Quot; (1) "

(SIMD당 사용 가능한 스크래치패드 메모리의 크기) / (SIMD당 스케쥴링 된 쓰레드 블락(thread block)의 개수)(Size of scratch pad memory available per SIMD) / (number of scheduled thread blocks per SIMD)

이 경우, 상기 프리페치 실행 모듈(150)은 상기 수식의 크기를 초과하기 이전까지의 프리페치 후보 데이터들을 상기 스크래치패드 메모리에 복사할 수 있다.In this case, the prefetch execution module 150 may copy the prefetch candidate data before the size of the formula is exceeded to the scratch pad memory.

상기 프리페치 데이터량 산출 모듈(140)은 스크래치패드 메모리에 프리페치할 데이터의 적정한 양을 산출하기 위함이다. 프리페치되는 데이터의 크기는 스크래치패드 메모리의 전체 크기 내에서 정하는 것이 바람직하다. 예를 들어, 강제 손실(Compulsory miss)을 발생하는 명령어들에 대해 하나의 명령어에서 요청하는 데이터의 개수와 그 다음 명령어에서 요청하는 데이터의 개수를 더하며, 그 합이 스크래치패드 메모리의 크기를 초과하지 않는 범위 내로 정한다.The prefetch data amount calculation module 140 calculates a proper amount of data to be prefetched in the scratch pad memory. The size of the data to be prefetched is preferably determined within the overall size of the scratch pad memory. For example, the number of data requested in one instruction and the number of requested data in the next instruction are added to the instructions causing the compulsory miss, and the sum exceeds the size of the scratch pad memory It is decided within the range not to do.

또한, 스크래치패드 메모리에 프리페치할 데이터양은 해당 SIMD에 할당된 warp들의 개수를 고려하여 병렬화를 해치지 않는 정도로 정하는 것이 바람직하다. 프리페치 후보 데이터들은 그 다음 명령어들을 순회하면서 실행가능(active)하게 할당된 쓰레드 블락(thread block)개수가 제한되기 전까지 계속 더해갈 수 있다. GPU에서는 쓰레드에서 공유하는 자원에 대해 충돌(contention)이 발생하게 되면 실제 코드상에서 명시한 쓰레드의 개수보다 적은 수의 쓰레드만이 스케쥴링되도록 제한되어 병렬성을 해치는 경우가 발생할 수 있다. 이때 대표적인 공유하는 자원으로는 스크래치패드 메모리와 레지스터가 될 수 있다. 이를 수식화 하면, 상기 수학식 1과 같이 된다. 따라서, 명령어들을 순회하면서 프리페치 할 데이터가 상기 수학식 1에서 나온 값을 초과할 경우, 그 전까지의 데이터를 프리페치 하는 것이 바람직하다. 그 이후부터는 새롭게 프리페치할 데이터를 재계산할 수 있다.In addition, the amount of data to be prefetched in the scratch pad memory is preferably determined in consideration of the number of warps allocated to the SIMD, so as not to hinder parallelism. The prefetch candidate data may continue to be traversed through subsequent instructions until the number of thread blocks that are actively allocated is limited. In the GPU, when contention is shared among resources shared by threads, only fewer threads than the number of threads specified in the actual code are limited to be scheduled, which may lead to a loss of parallelism. At this time, a typical shared resource may be a scratch pad memory and a register. This can be expressed as Equation (1). Therefore, if the data to be prefetched while traversing the instructions exceeds the value derived from Equation (1), it is preferable to prefetch the data up to that time. After that, data to be newly prefetched can be recalculated.

이하에서는 본 명세서에 따른 메모리 제어 방법에 대해서 설명하도록 하겠다. 다만, 본 명세서에 따른 메모리 제어 방법을 설명함에 있어서, 상술한 메모리 제어 장치를 참조하여 설명하도록 하겠다. 따라서, 상기 메모리 제어 장치의 각 구성에 대한 설명이 반복은 생략하도록 하겠다.Hereinafter, a memory control method according to the present invention will be described. However, the memory control method according to the present invention will be described with reference to the memory control device described above. Therefore, repetition of description of each configuration of the memory control device will be omitted.

도 4는 본 명세서의 일 실시예에 따른 메모리 제어 방법의 흐름도이다.4 is a flowchart of a memory control method according to an embodiment of the present invention.

단계 S100에서, 상기 GPU 커널 분석 모듈(110)은 GPU 커널에서 요청되는 메모리주소를 수집할 수 있다.In step S100, the GPU kernel analysis module 110 may collect the memory address requested in the GPU kernel.

단계 S110에서, 상기 프리페치 대상 결정 모듈(130)은 L1 캐시에서 강제 손실(Compulsory miss)이 미리 설정된 기준 이상 발생하는 데이터를 프리페치 대상 데이터로 결정할 수 있다. 이때, 프리페치 대상 결정 모듈(130)은 L1 캐시에 저장된 데이터의 재사용이 없이 전역메모리에 데이터를 로드하는 메모리 접근 패턴을 통해 강제 손실이 발생하는지 판단할 수 있다. 또한, 상기 미리 설정된 기준은 2회 이상일 수 있다.In step S110, the prefetch target determination module 130 may determine that data for which a forced loss (Compulsory miss) occurs in the L1 cache above a predetermined reference value as prefetch target data. At this time, the prefetch target determination module 130 can determine whether a forced loss occurs through a memory access pattern in which data stored in the L1 cache is not reused and data is loaded into the global memory. In addition, the preset reference may be two or more times.

본 명세서에 따른 메모리 제어 방법은 단계 S120을 더 포함할 수 있다.The memory control method according to the present invention may further include step S120.

단계 S120에서, 프리페치 데이터량 산출 모듈(140)은 상기 프리페치 대상 데이터로 결정된 데이터들(이하 '프리페치 후보 데이터')의 양을 계산할 수 있다.In step S120, the prefetch data amount calculation module 140 may calculate the amount of data (hereinafter, prefetch candidate data) determined as the prefetch target data.

본 명세서의 일 실시예에 따르면, 프리페치 데이터량 산출 모듈(140)은 상기 프리페치 후보 데이터의 양이 상기 스크래치패드 메모리의 전체 크기를 초과하는지 여부를 계산할 수 있다.According to one embodiment of the present disclosure, the prefetch data amount calculation module 140 can calculate whether the amount of the prefetch candidate data exceeds the total size of the scratch pad memory.

본 명세서의 다른 실시예에 따르면, 프리페치 데이터량 산출 모듈(140)은 상기 프리페치 후보 데이터의 양이 상기 프리페치 후보 데이터의 양이 상기 수학식 1을 초과하는지 여부를 계산할 수 있다.According to another embodiment of the present invention, the prefetch data amount calculation module 140 can calculate whether the amount of the prefetch candidate data exceeds the formula (1).

단계 S130에서, 프리페치 실행 모듈(150)은 상기 프리페치 대상 데이터로 결정된 데이터들을 스크래치패드 메모리에 복사할 수 있다.In step S130, the prefetch execution module 150 may copy the data determined as the prefetch target data to the scratch pad memory.

본 명세서의 일 실시예에 따르면, 상기 프리페치 실행 모듈(150)은 이 상기 수식의 크기를 초과하기 이전까지의 프리페치 후보 데이터들을 상기 스크래치패드 메모리에 복사할 수 있다. According to an embodiment of the present invention, the prefetch execution module 150 may copy the prefetch candidate data before the size of the formula is exceeded to the scratch pad memory.

본 명세서의 다른 실시예에 따르면, 상기 프리페치 실행 모듈(150)은 상기 수학식 1의 크기를 초과하기 이전까지의 프리페치 후보 데이터들을 상기 스크래치패드 메모리에 복사할 수 있다.According to another embodiment of the present invention, the prefetch execution module 150 may copy the prefetch candidate data before the size of Equation (1) is exceeded to the scratch pad memory.

상기 GPU 커널 분석 모듈, 프리페치 대상 결정 모듈 및 프리페치 실행 모듈은 상술한 제어 로직을 실행하기 위해 본 발명이 속한 기술분야에 알려진 프로세서, ASIC(application-specific integrated circuit), 다른 칩셋, 논리 회로, 레지스터, 통신 모뎀, 데이터 처리 장치 등을 포함할 수 있다. 또한, 상술한 제어 로직이 소프트웨어로 구현될 때, 상기 GPU 커널 분석 모듈, 프리페치 대상 결정 모듈 및 프리페치 실행 모듈은 프로그램 모듈의 집합으로 구현될 수 있다. 이 때, 프로그램 모듈은 메모리 기기에 저장되고, 프로세서에 의해 실행될 수 있다. 이때 상기 GPU 커널 분석 모듈, 프리페치 대상 결정 모듈 및 프리페치 실행 모듈은 프로그램의 소스코드 레벨에서 함수 단위를 모듈이 될 수 있다.The GPU kernel analysis module, prefetch target determination module, and prefetch execution module may be implemented as a processor, an application-specific integrated circuit (ASIC), other chipset, logic circuit, Registers, communication modems, data processing devices, and the like. Further, when the control logic is implemented in software, the GPU kernel analysis module, prefetch target determination module, and prefetch execution module may be implemented as a set of program modules. At this time, the program module is stored in the memory device and can be executed by the processor. At this time, the GPU kernel analysis module, the prefetch target determination module, and the prefetch execution module can be function modules in the source code level of the program.

한편, 강제 손실이 반복적으로 반복문 안에서 발생하는 패턴의 경우, 루프-언롤링을 통해 일반적으로 알려진 성능 이득을 기대할 수 있기 때문에 루프-언롤링을 적용하여 스크래치패드 메모리에 데이터를 복사하고, 그 데이터를 읽어와 사용하도록 코드를 수정할 수 있다. 상기 루프-언롤링이란, 반복문 내에서 동작하는 코드를 풀어 내는 작업을 통해 컴파일러 레벨에서 최적화 이득을 얻고자 사용하는 기술이다. 본 명세서에서 루프-언롤링이란 스크래치패드 메모리에 쓰고자 하는 데이터를 호출하는 소스코드가 반복문 내에 위치하면 해당 반복문을 루프-언롤링하고 풀어진 반복문에 대해 소스코드 레벨에서의 코드 변환을 적용할 수 있다는 것을 의미한다.On the other hand, in the case of a pattern in which the forced loss occurs repeatedly in the loop, we can expect a generally known performance gain through loop-unrolling, so we apply loop-unrolling to copy the data to the scratch pad memory, You can modify the code to read and use it. The loop-unrolling is a technique used to obtain an optimization gain at the compiler level by unlocking code that operates in a loop. In the present specification, loop-unrolling means that if the source code that calls the data to be written to the scratch pad memory is located within the loop, the loop can be looped-unrolled and the code translation at the source code level applied to the loops .

본 명세서에서 설명되는 실시예와 첨부된 도면은 본 발명에 포함되는 기술적 사상의 일부를 예시적으로 설명하는 것에 불과하다. 따라서, 본 명세서에 개시된 실시예들은 본 발명의 기술적 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이므로, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아님은 자명하다. 본 발명의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형 예와 구체적인 실시 예는 모두 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The embodiments and the accompanying drawings described in the present specification are merely illustrative of some of the technical ideas included in the present invention. Accordingly, the embodiments disclosed herein are for the purpose of describing rather than limiting the technical spirit of the present invention, and it is apparent that the scope of the technical idea of the present invention is not limited by these embodiments. It will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

100 : 메모리 제어 장치
110 : GPU 커널 분석 모듈
130 : 프리페치 대상 결정 모듈
150 : 프리페치 실행 모듈100: memory control device
110: GPU kernel analysis module
130: Prefetch target determination module
150: Prefetch execution module

Claims

A GPU kernel analysis module that collects the memory addresses requested in the GPU kernel;
A prefetch target determination module that determines data for which a forced loss (Compulsory miss) occurs in the L1 cache above a preset reference as prefetch target data; And
And a prefetch execution module for copying data determined as prefetch target data to a scratch pad memory.

The method according to claim 1,
The prefetch target determination module includes:
And determines whether a forced loss occurs through a memory access pattern that loads data into the global memory without reuse of data stored in the L1 cache.

The method according to claim 1,
Wherein the prefetch target determination module determines data for which a forced loss (Compulsory miss) occurs twice or more in the L1 cache as prefetch target data.

The method according to claim 1,
Further comprising a prefetch data amount calculating module for calculating whether or not the amount of data determined as the prefetch target data (hereinafter, prefetch candidate data) exceeds the total size of the scratch pad memory,
Wherein the prefetch execution module copies the prefetch candidate data until the amount of the prefetch candidate data exceeds the total size of the scratch pad memory to the scratch pad memory.

The method according to claim 1,
Further comprising a prefetch data amount calculating module for calculating whether or not the amount of data determined as the prefetch target data (hereinafter, prefetch candidate data) exceeds the following formula,
(Size of scratch pad memory available per SM) / (number of scheduled thread blocks per SM)
Wherein the prefetch execution module copies the prefetch candidate data before the size of the formula is exceeded to the scratch pad memory.

(a) collecting the memory address requested by the GPU kernel analysis module in the GPU kernel;
(b) determining, as prefetch target data, data in which a prefetch target determination module generates a compulsory miss in the L1 cache above a preset reference; And
(c) copying the data determined as prefetch target data to the scratch pad memory by the prefetch execution module.

The method of claim 6,
Wherein the prefetch target determination module determines whether a forced loss occurs through a memory access pattern in which data stored in the L1 cache is not reused and the data is loaded into the global memory. .

The method of claim 6,
Wherein the prefetch target determination module determines data in which a compulsory miss occurs twice or more in the L1 cache as prefetch target data.

The method of claim 6,
(c-1) calculating whether the amount of data (hereinafter, prefetch candidate data) determined as the prefetch target data exceeds a total size of the scratch pad memory, by the prefetch data amount calculating module Further included,
The step (d) is a step of copying the prefetch candidate data until the amount of the prefetch candidate data exceeds the total size of the scratch pad memory, to the scratch pad memory .

The method of claim 6,
The step (c-1) further includes the step of calculating whether the amount of data (hereinafter, prefetch candidate data) determined as the prefetch target data by the prefetch data amount calculating module exceeds the following formula and,
(Size of scratch pad memory available per SM) / (number of scheduled thread blocks per SM)
Wherein the step (d) copies the prefetch candidate data until the prefetch execution module exceeds the size of the formula, to the scratch pad memory.

A computer program recorded on a computer-readable recording medium, the computer program being written in the computer to perform the steps of the memory control method according to any one of claims 6 to 10.