KR101639943B1

KR101639943B1 - Shared memory control method for facilitating shared memory of general purpose graphic processor as cache and general purpose graphic processor using same

Info

Publication number: KR101639943B1
Application number: KR1020150034583A
Authority: KR
Inventors: 한환수; 고요한; 김현준
Original assignee: 성균관대학교산학협력단
Priority date: 2015-03-12
Filing date: 2015-03-12
Publication date: 2016-07-15

Abstract

A general purpose graphic processor according to embodiments of the present invention is a general purpose graphic processor including a plurality of multi-processors, an L2 cache, and a global GPU memory. Each of the multi-processors may include: a plurality of core units; a configurable memory which is configured by being divided into an L1 cache and a shared memory in a fixed ratio; and a local memory control part which stores a victim memory sacrificed in the L1 cache in the shared memory.

Description

TECHNICAL FIELD [0001] The present invention relates to a shared memory control method for operating a shared memory of a general-purpose graphics processor as a cache, and a general-purpose graphics processor using the shared memory control method. [0002]

본 발명은 그래픽 프로세서에 관한 것으로, 더욱 상세하게는, 그래픽 프로세서의 공유 메모리에 관한 것이다.The present invention relates to a graphics processor, and more particularly, to a shared memory of a graphics processor.

최근에 그래픽 프로세서는 알려진 그래픽 관련 연산들 중 상당 부분이 파이프라인화될 수 있다는 점과 반복적인 스트림 연산들로 프로그래밍될 수 있다는 점을 이용하여, 수많은 스트림 프로세서 코어들을 클러스터링한 아키텍처에 기반하는 범용 그래픽 프로세서(General Purpose Graphic Processor Unit, GPGPU)로 발전하고 있다.Recently, a graphics processor has been developed that utilizes the fact that a large portion of known graphics-related operations can be pipelined and can be programmed with repetitive stream operations, Processor (General Purpose Graphic Processor Unit, GPGPU).

범용 그래픽 프로세서는 물론 그래픽 어플리케이션에 가장 적합하지만, 파이프라인 및 스트림 연산의 특징을 가지도록 프로그래밍할 수 있는 물리 효과 시뮬레이션이나 계산 생물학, 암호학, 영상 처리, 입자 핵물리 시뮬레이션, 천체 시뮬레이션 등 다양한 과학 및 공학 분야에서 엄청난 연산 속도의 향상이 예상된다. It is best suited for graphics applications as well as general purpose graphics processors, but it can also be used for a variety of scientific and engineering applications such as physics effects simulation, computational biology, cryptography, image processing, particle nuclear physics simulation, A tremendous increase in computation speed is expected in the field.

최근에 출시된 범용 그래픽 프로세서는 스트림 프로세서 코어들이 공통적으로 사용하는 전역 변수나 데이터를 저장하여 복수의 스트림 프로세서 코어들이 동시에 접근할 수 있는 공유 메모리(shared memory)를 제공한다. Recently, the general purpose graphics processor stores shared global variables or data used by the stream processor cores to provide shared memory that can be accessed by a plurality of stream processor cores concurrently.

예를 들어 NVIDIA 사의 그래픽 프로세서들은 64 kB의 메모리를 물리적으로 제공하는데, 프로그래머는 이 64 kB의 메모리를 미리 정해진 비율, 예를 들어 48kB:16kB, 32kB:32kB, 16kB:48kB와 같은 비율에 따라, 캐시 또는 공유 메모리로 사용할 수 있다.For example, NVIDIA graphics processors physically provide 64 kB of memory, which the programmer can use to store 64 kB of memory at a predetermined rate, for example 48 kB: 16 kB, 32 kB: 32 kB, 16 kB: 48 kB, It can be used as cache or shared memory.

하지만, 공유 메모리를 효율적으로 활용할 수 있는 어플리케이션은 많지 않고, 실제로 NVIDIA에서 제공하는 소프트웨어 개발 키트 내의 예제들 중에 64 kB 메모리를 공유 메모리로서 활용하는 예제들의 비중은 많지 않다.However, there are not many applications that can efficiently utilize shared memory. In fact, among the examples in the software development kit provided by NVIDIA, there are not many examples of utilizing 64 kB memory as shared memory.

이는 프로세서 코어와 같은 속도로 동작하고, 메인 메모리의 레이턴시와는 비교도 안 되게 짧은 레이턴시를 가지는 수십 kB의 메모리가 낭비되고 있음을 의미한다.This means that it runs at the same speed as the processor core and wastes tens of kB of memory with a short latency comparable to the latency of the main memory.

다시 말해, 하드웨어적으로 공유 메모리로 설정된 메모리 용량에서도, 프로그램에서 요청이 있을 경우에 런타임(run-time)으로 캐시처럼 활용할 수 있는 방법이 필요하다.In other words, there is a need for a method that can be used as a cache at run-time when there is a request from a program, even if the memory capacity is set to hardware shared memory.

본 발명이 해결하고자 하는 과제는 범용 그래픽 프로세서의 공유 메모리를 캐시로 동작시키기 위한 공유 메모리 제어 방법 및 이를 이용한 범용 그래픽 프로세서를 제공하는 데에 있다.SUMMARY OF THE INVENTION The present invention is directed to a shared memory control method for operating a shared memory of a general purpose graphic processor as a cache and a general purpose graphics processor using the same.

본 발명이 해결하고자 하는 과제는, 사전에 캐시가 아닌 공유 메모리로 설정되었어도, 프로그램에서 요청이 있을 경우에 런타임으로 범용 그래픽 프로세서의 공유 메모리에서 캐시 기능을 수행하기 위한 공유 메모리 제어 방법 및 이를 이용한 범용 그래픽 프로세서를 제공하는 데에 있다.A shared memory control method for performing a cache function in a shared memory of a general-purpose graphics processor at a run-time in response to a request from a program, and a general-purpose Graphics processor.

본 발명의 해결과제는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 해결과제들은 아래의 기재로부터 당업자에게 명확히 이해될 수 있을 것이다.The solution to the problem of the present invention is not limited to those mentioned above, and other solutions not mentioned can be clearly understood by those skilled in the art from the following description.

본 발명의 일 측면에 따른 범용 그래픽 프로세서는 복수의 멀티프로세서들, L2 캐시 및 글로벌 GPU 메모리를 포함하는 범용 그래픽 프로세서로서, 상기 멀티프로세서들의 각각은 복수의 코어 유닛들, 소정의 용량 비율로 L1 캐시와 공유 메모리로 나뉘어 설정되는 설정 가능(configurable) 메모리; 및 상기 범용 그래픽 프로세서에서 구동되는 커널 어플리케이션이 상기 공유 메모리를 사용하지 않는 동안에, 상기 L1 캐시에서 희생되는 희생 블록(victim block)을 상기 공유 메모리에 저장하는 로컬 메모리 제어부를 포함할 수 있다.A general purpose graphics processor according to an aspect of the present invention is a general purpose graphics processor including a plurality of multiprocessors, an L2 cache and a global GPU memory, each of which comprises a plurality of core units, A configurable memory that is divided into shared memory; And a local memory controller for storing a victim block, which is sacrificed in the L1 cache, in the shared memory while the kernel application running on the general-purpose graphics processor does not use the shared memory.

일 실시예에 따라, 상기 로컬 메모리 제어부는,According to one embodiment, the local memory control unit,

상기 L1 캐시의 매핑 방식과 동일한 라인 크기로 상기 공유 메모리를 구성하도록 동작할 수 있다.And configure the shared memory with the same line size as the mapping scheme of the L1 cache.

일 실시예에 따라, 상기 로컬 메모리 제어부는 According to one embodiment, the local memory control unit

상기 코어 유닛들 중 하나에 의해 데이터 액세스를 위한 로드/스토어 명령이 있을 경우에, 상기 L1 캐시를 탐색하고, If there is a load / store instruction for data access by one of the core units,

만약 상기 L1 캐시에서 캐시 히트가 발생하면 상기 L1 캐시에서 데이터를 액세스하고 로드/스토어 명령을 종료하며,If a cache hit occurs in the L1 cache, data is accessed from the L1 cache and the load / store instruction is terminated.

상기 L1 캐시에서 캐시 미스가 발생하면 상기 공유 메모리를 탐색하도록 동작할 수 있다.And to search for the shared memory if a cache miss occurs in the L1 cache.

일 실시예에 따라, 상기 L1 캐시에서 캐시 미스가 발생한 경우에, According to one embodiment, when a cache miss occurs in the L1 cache,

상기 로컬 메모리 제어부는,The local memory control unit,

만약 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 상기 L1 캐시에 있으면 상기 공유 메모리를 탐색하고,If a line corresponding to a set index of a memory address to be accessed exists in the L1 cache,

만약 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 상기 L1 캐시에 없으면 소정의 캐시 교체 정책에 따라, 상기 L1 캐시에서 희생 블록을 선정하고, 선정된 희생 블록을 상기 L1 캐시로부터 상기 공유 메모리로 복사하며, 상기 공유 메모리를 탐색하도록 동작할 수 있다.If a line corresponding to a set index of a memory address to access is not in the L1 cache, a victim block is selected from the L1 cache according to a predetermined cache replacement policy, and the selected victim block is copied from the L1 cache to the shared memory And may be operable to search for the shared memory.

만약 상기 공유 메모리에 상기 희생 블록을 복사할 빈 공간이 없으면, 상기 공유 메모리에 적용되는 소정의 캐시 교체 정책에 따라, 상기 공유 메모리에서 2차 희생 블록을 선정하고, 선정된 2차 희생 블록을 상기 공유 메모리로부터 상기 L2 메모리 또는 상기 글로벌 GPU 메모리로 복사한 다음에, 상기 2차 희생 블록이 저장되었던 공간에 상기 L1 캐시로부터 희생되는 상기 희생 블록을 저장하도록 동작할 수 있다.If there is no empty space to copy the victim block to the shared memory, a second sacrifice block is selected from the shared memory according to a predetermined cache replacement policy applied to the shared memory, Copying from the shared memory to the L2 memory or the global GPU memory, and then storing the sacrificial block that is sacrificed from the L1 cache in the space where the secondary sacrificial block was stored.

일 실시예에 따라, 상기 공유 메모리를 탐색한 결과에 따라, According to one embodiment, depending on the result of searching the shared memory,

상기 로컬 메모리 제어부는,The local memory control unit,

만약 상기 공유 메모리에서 캐시 히트가 발생하면 상기 공유 메모리에서 데이터를 액세스하고 로드/스토어 명령을 종료하며,If a cache hit occurs in the shared memory, data is accessed from the shared memory and the load / store instruction is terminated.

만약 상기 공유 메모리에서도 캐시 미스가 발생하면, MSHR(Miss Status Holding Register)에 미처리 캐시 미스에 관련된 정보를 저장하고, 상기 L2 캐시나 상기 글로벌 GPU 메모리에서 데이터를 액세스하며, 액세스된 데이터를 상기 L1 캐시에 저장하고 상기 MSHR의 미처리 캐시 미스 정보를 갱신하면서, 로드/스토어 명령을 종료하도록 동작할 수 있다.If a cache miss occurs in the shared memory, information related to an unprocessed cache miss is stored in a Miss Status Holding Register (MSHR), data is accessed from the L2 cache or the global GPU memory, And to terminate the load / store instruction while updating the unprocessed cache miss information of the MSHR.

본 발명의 다른 측면에 따라 복수의 멀티프로세서들, L2 캐시 및 글로벌 GPU 메모리를 포함하는 범용 그래픽 프로세서의 공유 메모리 제어 방법은In accordance with another aspect of the present invention, a method for controlling a shared memory of a general purpose graphics processor including a plurality of multiprocessors, an L2 cache and a global GPU memory,

상기 멀티프로세서들의 각각이 복수의 코어 유닛들, 설정 가능 메모리 및 로컬 메모리 제어부를 포함하고, 또한 상기 설정 가능 메모리는 소정의 비율로 L1 캐시와 공유 메모리로 나뉘어 설정되는 경우에, Each of the multiprocessors includes a plurality of core units, a configurable memory, and a local memory controller, and when the configurable memory is divided into L1 cache and shared memory at a predetermined ratio,

상기 로컬 메모리 제어부가, The local memory control unit,

(a) 상기 코어 유닛들 중 하나에 의해 데이터 액세스를 위한 로드/스토어 명령이 있으면 상기 L1 캐시를 탐색하는 단계;(a) if there is a load / store instruction for data access by one of the core units, searching the L1 cache;

(b) 만약 단계 (a)에서 상기 L1 캐시에서 캐시 히트가 발생하면 상기 L1 캐시에서 데이터를 액세스하고, 로드/스토어 명령을 종료하는 단계;(b) accessing data in the L1 cache and terminating a load / store instruction when a cache hit occurs in the L1 cache in step (a);

(c) 만약 단계 (a)에서 상기 L1 캐시에서 캐시 미스가 발생하였지만 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 상기 L1 캐시에 있으면, 상기 공유 메모리를 탐색하는 단계;(c) if in step (a) a cache miss occurs in the L1 cache but a line corresponding to a set index of a memory address to access is in the L1 cache, searching for the shared memory;

(d) 만약 단계 (a)에서 상기 L1 캐시에서 캐시 미스가 발생하였지만 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 상기 L1 캐시에 없으면, 소정의 캐시 교체 정책에 따라, 상기 L1 캐시에서 희생 블록을 선정하고, 상기 선정된 희생 블록을 상기 L1 캐시로부터 상기 공유 메모리로 복사하며, 상기 공유 메모리를 탐색하는 단계;(d) if in step (a) a cache miss occurs in the L1 cache but there is no line in the L1 cache corresponding to the set index of the memory address to access, in accordance with a predetermined cache replacement policy, Copying the selected victim block from the L1 cache to the shared memory, and searching for the shared memory;

(e) 만약 단계 (c) 또는 단계 (d)에서 상기 공유 메모리에서 캐시 히트가 발생하면, 상기 공유 메모리에서 데이터를 액세스하고, 로드/스토어 명령을 종료하는 단계;(e) accessing data in the shared memory and terminating a load / store instruction when a cache hit occurs in the shared memory in step (c) or step (d);

(f) 만약 단계 (c) 또는 단계 (d)에서 상기 공유 메모리에서 캐시 미스가 발생하면, MSHR에 미처리 캐시 미스에 관련된 정보를 저장하고, 상기 L2 캐시나 상기 글로벌 GPU 메모리에서 데이터를 액세스하고, 상기 액세스된 데이터를 상기 L1 캐시에 저장하며, 상기 미처리 캐시 미스 정보를 갱신하면서, 로드/스토어 명령을 종료하는 단계를 포함할 수 있다.(f) if a cache miss occurs in the shared memory in step (c) or step (d), storing information related to an unprocessed cache miss in the MSHR, accessing data in the L2 cache or the global GPU memory, Storing the accessed data in the L1 cache, and terminating the load / store instruction while updating the unprocessed cache miss information.

일 실시예에 따라, 상기 범용 그래픽 프로세서의 공유 메모리 제어 방법은, 단계 (a)에 앞서, 상기 로컬 메모리 제어부가, According to one embodiment, the method for controlling a shared memory of the general-purpose graphics processor is characterized in that prior to step (a)

만약 커널 어플리케이션이 상기 공유 메모리를 사용하지 않을 경우에, 상기 설정 가능 메모리를 최대 설정 가능 크기의 L1 캐시와 최소 설정 가능 크기의 공유 메모리로 설정하는 단계를 더 포함할 수 있다.If the kernel application does not use the shared memory, setting the configurable memory to a maximum configurable size L1 cache and a minimum configurable shared memory size may be included.

일 실시예에 따라, 상기 설정 가능 메모리를 설정하는 단계는,According to one embodiment, the step of setting the configurable memory further comprises:

상기 로컬 메모리 제어부가, 상기 L1 캐시의 매핑 방식과 동일한 라인 크기로 상기 공유 메모리를 구성하는 단계를 포함할 수 있다.And configuring the shared memory with the same line size as the mapping method of the L1 cache.

일 실시예에 따라, 단계 (d)는,According to one embodiment, step (d)

상기 로컬 메모리 제어부가, 만약 상기 공유 메모리에 희생 블록을 복사할 빈 블록이 없으면, 상기 공유 메모리에서 적용되는 소정의 캐시 교체 정책에 따라, 상기 공유 메모리에서 2차 희생 블록을 선정하고, 상기 선정된 2차 희생 블록을 상기 공유 메모리로부터 상기 L2 메모리 또는 상기 글로발 GPU 메모리로 복사한 다음에, 상기 2차 희생 블록이 저장되었던 공간에 상기 L1 캐시로부터 희생되는 상기 희생 블록을 저장하는 단계를 포함할 수 있다.Wherein the local memory control unit selects a second sacrifice block in the shared memory according to a predetermined cache replacement policy applied in the shared memory if there is no empty block to copy the victim block to the shared memory, And copying the secondary victim block from the shared memory to the L2 memory or the global GPU memory and then storing the victim block that is sacrificed from the L1 cache in the space where the secondary victim block was stored have.

본 발명의 또 다른 측면에 따라 복수의 멀티프로세서들, L2 캐시 및 글로벌 GPU 메모리를 포함하는 범용 그래픽 프로세서의 공유 메모리 제어 방법은According to another aspect of the present invention, a method for controlling a shared memory of a general purpose graphics processor including a plurality of multiprocessors, an L2 cache and a global GPU memory

상기 로컬 메모리 제어부가, The local memory control unit,

(a') 상기 코어 유닛들 중 하나에 의해 데이터 액세스를 위한 로드/스토어 명령이 있으면 상기 L1 캐시를 탐색하는 단계;(a ') searching the L1 cache if there is a load / store instruction for data access by one of the core units;

(b') 만약 단계 (a')에서 상기 L1 캐시에서 캐시 히트가 발생하면 상기 L1 캐시에서 데이터를 액세스하는 단계;(b ') accessing data in the L1 cache if a cache hit occurs in the L1 cache in step (a');

(c') 만약 단계 (a')에서 상기 L1 캐시에서 캐시 미스가 발생하면, 상기 공유 메모리를 탐색하는 단계; 및(c ') if a cache miss occurs in the L1 cache in step (a'), searching for the shared memory; And

(d') 만약 단계 (c')에서 상기 공유 메모리에서 캐시 미스가 발생하면, 상기 L2 캐시나 상기 글로벌 GPU 메모리에서 데이터를 액세스하는 단계를 포함할 수 있다.(d ') if the cache miss occurs in the shared memory at step (c'), accessing data in the L2 cache or the global GPU memory.

본 발명의 공유 메모리 제어 방법 및 이를 이용한 범용 그래픽 프로세서에 따르면, 범용 그래픽 프로세서의 공유 메모리를 캐시로 동작시킬 수 있다.According to the shared memory control method of the present invention and the general-purpose graphics processor using the same, the shared memory of the general-purpose graphics processor can be operated as a cache.

본 발명의 공유 메모리 제어 방법 및 이를 이용한 범용 그래픽 프로세서에 따르면, 사전에 L1 캐시가 아닌 공유 메모리로 설정되었어도, 프로그램에서 요청이 있을 경우에 런타임으로 범용 그래픽 프로세서의 공유 메모리에서 캐시 기능을 수행할 수 있다.According to the shared memory control method and the general-purpose graphics processor using the shared memory control method of the present invention, it is possible to perform the cache function in the shared memory of the general-purpose graphics processor at runtime even when the program is set as the shared memory, have.

본 발명의 공유 메모리 제어 방법 및 이를 이용한 범용 그래픽 프로세서에 따르면, 범용 그래픽 프로세서의 공유 메모리를 런타임으로 L1 캐시의 보조적 캐시로서 이용함으로써 L1 캐시에서 캐시 미스 시에 범용 그래픽 프로세서 또는 CPU에 연결된 글로벌 메모리까지 탐색하는 시간을 최소화할 수 있다.According to the shared memory control method of the present invention and the general-purpose graphics processor using the shared memory, the shared memory of the general-purpose graphics processor is used as an auxiliary cache of the L1 cache at runtime, so that the L1 cache stores a global memory The time for searching can be minimized.

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 공유 메모리를 가지는 범용 그래픽 프로세서를 예시한 블록도이다.
도 2는 본 발명의 일 실시예에 따른 공유 메모리 제어 방법을 예시한 순서도이다.1 is a block diagram illustrating a general purpose graphics processor having a shared memory according to an embodiment of the present invention.
2 is a flowchart illustrating a shared memory control method according to an embodiment of the present invention.

본문에 개시되어 있는 본 발명의 실시예들에 대해서, 특정한 구조적 내지 기능적 설명들은 단지 본 발명의 실시예를 설명하기 위한 목적으로 예시된 것으로, 본 발명의 실시예들은 다양한 형태로 실시될 수 있으며 본문에 설명된 실시예들에 한정되는 것으로 해석되어서는 아니 된다.For the embodiments of the invention disclosed herein, specific structural and functional descriptions are set forth for the purpose of describing an embodiment of the invention only, and it is to be understood that the embodiments of the invention may be practiced in various forms, The present invention should not be construed as limited to the embodiments described in Figs.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

도 1은 본 발명의 일 실시예에 따른 공유 메모리를 가지는 범용 그래픽 프로세서를 개략적으로 예시한 블록도이다.1 is a block diagram schematically illustrating a general purpose graphics processor having a shared memory according to an embodiment of the present invention.

도 1을 참조하면, 범용 그래픽 프로세서(10)는 개략적으로 복수의 멀티프로세서들(110), L2 캐시(120), 글로벌 메모리 제어부(130), 글로벌 GPU 메모리(140), 쓰레드 관리부(150)를 포함한다.1, general purpose graphics processor 10 includes a plurality of multiprocessors 110, an L2 cache 120, a global memory controller 130, a global GPU memory 140, and a thread manager 150 .

범용 그래픽 프로세서(10)는 호스트 장치(미도시)의 중앙 프로세서 유닛(CPU)으로부터 명령어들과 데이터가 수신되면, 수신된 명령어들을 각각의 멀티프로세서들(110)에 분배하고, 수신된 데이터를 글로벌 GPU 메모리(140)에 저장하며, 멀티프로세서들(110)에서 명령어들이 병렬적으로 처리되어 연산 결과가 생성되면, 생성된 연산 결과를 병합하여 CPU 또는 호스트 메모리로 다시 전송한다.The general-purpose graphics processor 10 distributes the received instructions to the respective multiprocessors 110 when the instructions and data are received from the central processor unit (CPU) of the host device (not shown) GPU memory 140, and when the instructions are processed in parallel in the multiprocessors 110 to generate operation results, the generated operation results are merged and sent back to the CPU or the host memory.

이를 위해, 쓰레드 관리부(150)는 호스트 장치의 CPU로부터 수신된 명령어들을 각각의 멀티프로세서들(110)에 분배하고, 글로벌 메모리 제어부(130)는 수신된 데이터를 글로벌 GPU 메모리(140)에 저장할 수 있다. 글로벌 GPU 메모리(140)에 저장되는 데이터는 예를 들어 텍스처나 폴리곤 데이터일 수 있다.To this end, the thread manager 150 distributes the instructions received from the CPU of the host device to the respective multiprocessors 110, and the global memory controller 130 may store the received data in the global GPU memory 140 have. The data stored in global GPU memory 140 may be, for example, texture or polygon data.

멀티프로세서들(110)의 각각은 분배된 명령어들에 따른 연산을 수행하면서 데이터들의 로드/스토어(load/store)가 필요할 때에는 L2 캐시(120)에 먼저 접근하고, L2 캐시(120)에서 캐시 미스가 발생하면 글로벌 메모리 제어부(130)를 통해 글로벌 GPU 메모리(140)에 접근할 수 있다.Each of the multiprocessors 110 accesses the L2 cache 120 first when it is necessary to load / store data while performing operations according to the distributed instructions, The global GPU memory 140 can access the global GPU memory 140 through the global memory controller 130.

한편, 멀티프로세서들(110)의 각각은 복수의 코어 유닛들(111, 112, 113), 설정 가능(configurable) 메모리(114), 텍스처 캐시(117) 및 로컬 메모리 제어부(118)를 포함할 수 있다.Each of the multiple processors 110 may include a plurality of core units 111, 112 and 113, a configurable memory 114, a texture cache 117 and a local memory controller 118 have.

코어 유닛들(111, 112, 113)은 쓰레드 관리부(150)에 의해 분배된 명령어들을 각각 처리한다. The core units 111, 112, and 113 process the instructions distributed by the thread management unit 150, respectively.

로컬 메모리 제어부(118)는 코어 유닛들(111, 112, 113)이 명령어들에 따라 처리할 데이터들을 설정 가능 메모리(114) 또는 텍스처 캐시(117)에 로드/스토어하며, 만약 데이터가 설정 가능 메모리(114) 또는 텍스처 캐시(117)에서 탐색되지 않으면 멀티프로세서(110) 외부에서, 예를 들어 L2 캐시(120)나 글로벌 GPU 메모리(140)에서 데이터를 탐색할 수 있다.The local memory control unit 118 loads / stores the data to be processed by the core units 111, 112 and 113 into the configurable memory 114 or the texture cache 117 according to the instructions, For example, in the L2 cache 120 or in the global GPU memory 140, outside the multiprocessor 110, unless it is searched in the texture cache 117 or the texture cache 117.

앞서 설명하였듯이, 설정 가능 메모리(114)는 사용자 설정, 프로그램 설정 또는 디폴트(default) 설정에 따라 소정의 비율로 L1 캐시(115)와 공유 메모리(116)로 나뉘어 동작할 수 있다. 예를 들어 64 kB의 설정 가능 메모리(114)는 16 kB의 L1 캐시(115)와 48 kB의 공유 메모리(116)로 설정되거나, 32 kB의 L1 캐시(115)와 32 kB의 공유 메모리(116)로 설정될 수도 있고, 48 kB의 L1 캐시(115)와 16 kB의 공유 메모리(116)로 설정될 수도 있다.As described above, the configurable memory 114 can be divided into the L1 cache 115 and the shared memory 116 at a predetermined ratio according to the user setting, the program setting, or the default setting. For example, 64 kB of configurable memory 114 may be set to 16 kB of L1 cache 115 and 48 kB of shared memory 116, or 32 kB of L1 cache 115 and 32 kB of shared memory 116 ) And may be set to a 48 kB L1 cache 115 and a 16 kB shared memory 116. [

만약 범용 그래픽 프로세서(10)에서 실행될 커널 어플리케이션이 공유 메모리를 사용하지 않는다면, 적어도 16 kB로 설정되는 공유 메모리(116) 공간이 낭비되는 셈이다. If the kernel application to be executed in the general purpose graphics processor 10 does not use the shared memory, the space of the shared memory 116 set to at least 16 kB is wasted.

이에 따라, 로컬 메모리 제어부(118)는 만약 커널 어플리케이션이 공유 메모리를 사용하지 않을 경우에, 설정 가능 메모리(114)를 최대 설정 가능 크기의 L1 캐시(115)와 최소 설정 가능 크기의 공유 메모리(116)로 설정할 수 있다.Accordingly, the local memory control unit 118, when the kernel application does not use the shared memory, sets the configurable memory 114 to the L1 cache 115 of the maximum configurable size and the shared memory 116 of the minimum configurable size ).

구체적으로, 로컬 메모리 제어부(118)는 멀티프로세서(110)에 분배된 명령어들을 분석하거나 또는 공유 메모리 사용에 관하여 선언되는 변수의 상태에 따라, 커널 어플리케이션이 공유 메모리를 사용하는지 여부를 런타임으로 식별하고, 만약 커널 어플리케이션이 공유 메모리를 사용하지 않을 경우에, 설정 가능 메모리(114)를 런타임으로 최대 설정 가능 크기의 L1 캐시(115)와 최소 설정 가능 크기의 공유 메모리(116)로 설정할 수 있다.Specifically, the local memory control unit 118 analyzes the commands distributed to the multiprocessor 110, or identifies at runtime whether the kernel application uses the shared memory, depending on the state of the variable declared for shared memory usage , The configurable memory 114 can be set at runtime to the L1 cache 115 of the maximum configurable size and the shared memory 116 of the minimum configurable size if the kernel application does not use the shared memory.

이때, 본 발명의 범용 그래픽 프로세서(10)를 위한 공유 메모리 제어 기법에서, 로컬 메모리 제어부(118)는 설정된 공유 메모리(116)를 L1 캐시(115)와 똑같이 사용하려는, 다시 말해 설정 가능 메모리(114)의 메모리 공간 전체를 L1 캐시(115)로 활용하려는 의도가 아니라는 점에 유의할 필요가 있다.In this case, in the shared memory control scheme for the general-purpose graphics processor 10 of the present invention, the local memory control unit 118 is configured to use the set shared memory 116 in the same manner as the L1 cache 115, Is not intended to be used as the L1 cache 115 as a whole.

또한, 본 발명의 범용 그래픽 프로세서(10)를 위한 공유 메모리 제어 기법에서, 로컬 메모리 제어부(118)는 설정된 공유 메모리(116)를 접근 시간의 차이를 극복하기 위한 버퍼 캐시로 사용하려는 의도가 아니다.Also, in the shared memory control scheme for the general-purpose graphics processor 10 of the present invention, the local memory controller 118 does not intend to use the set shared memory 116 as a buffer cache for overcoming the difference in access time.

오히려, 본 발명의 범용 그래픽 프로세서(10)를 위한 공유 메모리 제어 기법에서, 로컬 메모리 제어부(118)는 실행되는 커널 어플리케이션이 종종 공유 메모리를 활용하지만 일시적으로 공유 메모리를 사용하지 않을 경우에도, 할당된 공유 메모리(116)의 메모리 공간을 L1 캐시(115)의 보조 캐시로 활용하려는 의도이다.Rather, in the shared memory control scheme for the general-purpose graphics processor 10 of the present invention, the local memory control unit 118 determines whether the allocated kernel application is to be allocated, even if the executed kernel application often utilizes shared memory, Is intended to utilize the memory space of the shared memory 116 as an auxiliary cache of the L1 cache 115. [

특히, 로컬 메모리 제어부(118)는 L1 캐시(115)에서 희생되는 희생 블록(victim block)을 공유 메모리(116)에 저장한다.In particular, the local memory control unit 118 stores a victim block that is sacrificed in the L1 cache 115 in the shared memory 116.

이를 위해, 로컬 메모리 제어부(118)는 공유 메모리(116)를 L1 캐시(115)와 동일한 매핑 방식으로 구성한다. 만약 L1 캐시(115)의 라인이 128 Byte라면, L1 캐시(115)가 16 kB 크기로 설정될 때에 L1 캐시(115)는 4 웨이 32 셋 연관 매핑 캐시로 구성될 수 있고, L1 캐시(115)가 48 kB 크기로 설정된다면 6 웨이 64 셋 연관 매핑 캐시로 구성될 수 있다.To this end, the local memory control unit 118 configures the shared memory 116 in the same mapping scheme as the L1 cache 115. If the L1 cache 115 has a 128-byte line, the L1 cache 115 may be configured as a 4-way 32-bit associative mapping cache when the L1 cache 115 is set to a 16 kB size, Is set to a size of 48 kB, it can be configured as a six-way 64-bit associative mapping cache.

커널 어플리케이션이 공유 메모리(116)를 활용하지 않는다면 L1 캐시(115)는 최대 크기인 48 kB로 설정되는 것이 바람직하므로, 공유 메모리(116)는 L1 캐시(115)와 동일하게 128 바이트의 라인 크기를 가지는 4 웨이 32 셋 연관 매핑 캐시로 구성될 수 있다.If the kernel application does not utilize the shared memory 116, it is preferable that the L1 cache 115 is set to the maximum size of 48 kB, so that the shared memory 116 has a line size of 128 bytes It can be configured as a four-way 32-bit associative mapping cache.

로컬 메모리 제어부(118)는 코어 유닛들(111, 112, 113) 중 하나에 의해 데이터 액세스를 위한 로드/스토어 명령이 있을 경우에, 먼저 설정 가능 메모리(114) 내의 L1 캐시(115)를 탐색하고, 만약 L1 캐시(115)에서 캐시 히트(cache hit)가 발생하면 데이터를 액세스하고 로드/스토어 명령을 종료한다. When there is a load / store instruction for data access by one of the core units 111, 112, and 113, the local memory control unit 118 first searches the L1 cache 115 in the configurable memory 114 . If a cache hit occurs in the L1 cache 115, the data is accessed and the load / store command is terminated.

하지만 만약 액세스하려는 메모리 주소의 셋 인덱스(set index)에 상응하는 라인이 L1 캐시(115)에 있음에도 불구하고 L1 캐시(115)에서 캐시 미스(cache miss)가 발생한 경우에는, 로컬 메모리 제어부(118)는 공유 메모리(116)를 더 탐색한다.However, if a cache miss occurs in the L1 cache 115 even though a line corresponding to the set index of the memory address to be accessed is in the L1 cache 115, Lt; / RTI > searches shared memory 116 further.

구체적으로, 로컬 메모리 제어부(118)는 만약, 비록 L1 캐시(115)에서 캐시 미스가 발생하였더라도, 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 L1 캐시(115)에 있었다면, L2 캐시(120)에 액세스하는 대신에, 추가적으로 공유 메모리(116)를 탐색할 수 있다.Specifically, the local memory control unit 118 determines whether the L2 cache 120 is in the LI cache 115 if a line corresponding to the set index of the memory address to be accessed was in the LI cache 115, even if a cache miss occurred in the L1 cache 115. [ The shared memory 116 may be searched.

이는, 후술하듯이 로컬 메모리 제어부(118)가 이전에 L1 캐시(115)에 저장되어 있었다가 희생된 희생 블록들만을 공유 메모리(116)에 저장하므로, L1 캐시(115)에서 캐시 미스가 발생하였더라도 공유 메모리(116)에는 아직 해당 셋 인덱스의 다른 라인들(코어 유닛이 L1 캐시에서 찾는 메모리 주소와 동일한 셋 인덱스를 가지는 데이터들)이 저장되어 있을 가능성이 높기 때문이다.This is because even if a cache miss occurs in the L1 cache 115 because the local memory controller 118 previously stored in the L1 cache 115 stores only the victimized victim blocks in the shared memory 116, This is because there is a high possibility that other lines (data whose core unit has the same set index as the memory address found in the L1 cache) of the set index are stored in the shared memory 116. [

여기서 블록은 L1 캐시(115)와 공유 메모리(116) 사이의 데이터 교환 단위이므로 블록의 크기는 라인의 크기와 같을 수 있다.Here, since the block is a data exchange unit between the L1 cache 115 and the shared memory 116, the size of the block may be equal to the size of the line.

한편, 로컬 메모리 제어부(118)는, 만약 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 L1 캐시(115)에 없어서 캐시 미스가 발생하였다면, 소정의 캐시 교체 정책, 예를 들어 LRU(Least Recently Used), NRU(Not Recently Used) 또는 RRIP(Re-Reference Interval Prediction)에 따라, L1 캐시(115)에서 희생 블록을 선정하고, 선정된 희생 블록을 L1 캐시(115)로부터 공유 메모리(116)로 복사한다.On the other hand, if a cache miss occurs because a line corresponding to a set index of a memory address to be accessed is not in the L1 cache 115, the local memory control unit 118 sets a predetermined cache replacement policy, for example, a Least Recently Used ), Selects a victim block in the L1 cache 115 according to NRU (Not Recently Used) or RRIP (Re-Reference Interval Prediction), and copies the selected victim block from the L1 cache 115 to the shared memory 116 do.

이때, 만약 공유 메모리(116)에 희생 블록을 복사할 빈 블록이 없으면, 로컬 메모리 제어부(118)는, 공유 메모리(116)에서 적용되는 소정의 캐시 교체 정책, 예를 들어 LRU, NRU 또는 RRIP에 따라, 공유 메모리(116)에서 2차 희생 블록을 선정하고, 선정된 2차 희생 블록을 공유 메모리(116)로부터 L2 메모리(120)로 복사한 다음에, 2차 희생 블록이 저장되었던 공간에 L1 캐시(115)로부터 희생되는 희생 블록을 저장할 수 있다.At this time, if there is no empty block in the shared memory 116 to copy the victim block, the local memory controller 118 sets a predetermined cache replacement policy, for example, LRU, NRU, or RRIP applied in the shared memory 116 The secondary victim block is selected from the shared memory 116 and the selected secondary victim block is copied from the shared memory 116 to the L2 memory 120 and then the L1 And may store a victim block that is sacrificed from the cache 115.

이어서, 로컬 메모리 제어부(118)는, 비록 L1 캐시(115)에서 캐시 미스가 발생하였고 또한 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 L1 캐시(115)에 없었더라도, L1 캐시(115)에서 희생된 희생 블록이 공유 메모리(116)에 아직 저장되어 있을 수 있으므로, 추가적으로 공유 메모리(116)를 탐색할 수 있다.The local memory control unit 118 then determines whether or not a cache miss occurs in the L1 cache 115 and also if a line corresponding to the set index of the memory address to access is not in the L1 cache 115 The victimized victim block may still be stored in the shared memory 116, so that the shared memory 116 may be further searched.

로컬 메모리 제어부(118)는 공유 메모리(116)를 탐색한 결과 만약 공유 메모리(116)에서 캐시 히트가 발생하면 데이터를 액세스하고 로드/스토어 명령을 종료한다.The local memory control unit 118 accesses the data and terminates the load / store command if a cache hit occurs in the shared memory 116 as a result of searching the shared memory 116. [

나아가, 로컬 메모리 제어부(118)는 만약 공유 메모리(116)에서도 캐시 미스가 발생하면, 비로소 MSHR(Miss Status Holding Register)에 미처리 캐시 미스(outstanding miss)에 관련된 정보를 저장할 수 있다.Further, if a cache miss occurs in the shared memory 116, the local memory controller 118 may store information related to an outstanding miss in the miss status holding register (MSHR).

MSHR에 미처리 캐시 미스 정보가 저장되면, 로컬 메모리 제어부(118)는 미처리 캐시 미스 정보에 따라, 멀티프로세서(110)의 외부에 있는, 하위의 L2 캐시(120)나 글로벌 GPU 메모리(140)에 필요한 데이터를 액세스하여 로드/스토어 명령에 대응한다. 로컬 메모리 제어부(118)는 하위의 메모리 수단에서 액세스한 데이터를 L1 캐시(115)에 저장하고 미처리 캐시 미스 정보를 갱신할 수 있다.When the unprocessed cache miss information is stored in the MSHR, the local memory control unit 118 stores the unprocessed cache miss information in the lower L2 cache 120 or the global GPU memory 140, which is outside the multiprocessor 110, And accesses the data to correspond to the load / store instruction. The local memory control unit 118 can store the data accessed by the lower memory means in the L1 cache 115 and update the unprocessed cache miss information.

종래의 캐시 제어 기법은 L1 캐시에서 캐시 히트가 발생하면 그에 따라 데이터 액세스를 곧바로 처리하지만, L1 캐시에서 캐시 미스가 발생한 경우에는 곧바로 MSHR에 미처리 캐시 미스 정보를 기록하고 하위의 L2 캐시나 메인 메모리에서 데이터를 액세스하는 시간이 소요된다.In the conventional cache control technique, when a cache hit occurs in the L1 cache, data access is immediately performed. However, if a cache miss occurs in the L1 cache, the cache miss information is immediately written to the MSHR, It takes time to access the data.

본 발명의 공유 메모리 제어 기법에 따르면, L1 캐시에서 캐시 히트가 발생하면 그에 따라 데이터 액세스를 곧바로 처리하는 것은 종래의 기법과 동일하다. 다만, 본 발명의 공유 메모리 제어 기법에서는, L1 캐시(115)에서 캐시 미스가 발생한 경우에는, 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 L1 캐시(115) 내에 있을 때에는 공유 메모리(116)를 더 탐색하는 시간 비용이 추가되고, 셋 인덱스에 상응하는 라인이 L1 캐시(115) 내에 없을 때에는 공유 메모리(116)를 더 탐색하는 시간 비용과 L1 캐시(115)에서 공유 메모리(116)로 캐시 교체하는 시간 비용이 추가된다.According to the shared memory control technique of the present invention, when a cache hit occurs in the L1 cache, processing the data access accordingly is the same as the conventional technique. However, in the shared memory control technique of the present invention, when a cache miss occurs in the L1 cache 115, when the line corresponding to the set index of the memory address to be accessed is in the L1 cache 115, The time cost of further searching is added and when there is no line in the L1 cache 115 corresponding to the set index, the time cost of further searching the shared memory 116 and the cache change from the L1 cache 115 to the shared memory 116 Time cost is added.

하지만, 만약 L1 캐시(115)에서 캐시 미스된 데이터가 공유 메모리(116)에서 캐시 히트된다면, 하위의 L2 캐시나 메인 메모리에서 데이터를 액세스하지 않아도 되므로, 전체적으로 캐시 미스 시에 액세스에 소요되는 시간 비용을 크게 절감할 수 있다.However, if the cache missed data in the L1 cache 115 is cached in the shared memory 116, it is not necessary to access the data in the lower L2 cache or the main memory, so that the time cost Can be greatly reduced.

또한, 본 발명의 공유 메모리 제어 기법에 따르면, 설정 가능 메모리(114)의 전체 메모리 용량을 모두 L1 캐시(115)로 설정한 것이 아니고 공유 메모리(116)의 설정은 그대로 유지하므로, 만약 커널 어플리케이션이 공유 메모리(116)를 코어 유닛들(111, 112, 113)이 공유하는 메모리로서 활용하고자 할 경우에는 즉시 L1 캐시(115)의 보조 캐시로서 역할을 중단하고 공유 메모리로서 본연의 역할을 수행할 수 있다. According to the shared memory control technique of the present invention, since the entire memory capacity of the configurable memory 114 is not set to the L1 cache 115 and the setting of the shared memory 116 is maintained, When the shared memory 116 is to be used as a memory shared by the core units 111, 112 and 113, the shared memory 116 can immediately stop serving as an auxiliary cache of the L1 cache 115 and perform its role as a shared memory have.

커널 어플리케이션의 필요에 따라 공유 메모리(116)가 L1 캐시(115)의 보조 캐시의 역할을 중단하더라도, 로컬 메모리 제어부(118)는, L1 캐시(115)에서 캐시 미스가 발생하면, 공유 메모리(116)를 탐색하는 절차만 중단할 뿐이므로, 곧바로 미처리 캐시 미스 정보를 MSHR에 저장하고 외부의 L2 캐시(120)나 글로벌 GPU 메모리(140)에서 데이터를 액세스한다.Even if the shared memory 116 stops serving as the auxiliary cache of the L1 cache 115 in response to the need of the kernel application, the local memory control unit 118, if a cache miss occurs in the L1 cache 115, ), The unprocessed cache miss information is immediately stored in the MSHR and data is accessed from the external L2 cache 120 or the global GPU memory 140. [

따라서, 본 발명의 공유 메모리 제어 기법에 따르면, 런타임으로 범용 그래픽 프로세서의 공유 메모리에서 캐시 기능을 수행할 수 있고, L1 캐시에서 캐시 미스 시에 범용 그래픽 프로세서 또는 CPU에 연결된 글로벌 메모리까지 탐색하는 시간을 최소화할 수 있다.Therefore, according to the shared memory control technique of the present invention, it is possible to perform the cache function in the shared memory of the general-purpose graphics processor at runtime and to search for time from the L1 cache to the global memory connected to the general- Can be minimized.

다음 표 1은 소정 크기의 행렬을 곱하는 어플리케이션을 범용 그래픽 프로세서에서 수행하는 예제에서 공유 메모리를 L1 캐시의 보조 캐시로 이용할 경우에 성능 향상을 조사한 것이다.Table 1 below shows the performance improvement when an application that multiplies a matrix of a predetermined size is performed in a general-purpose graphics processor and a shared memory is used as an auxiliary cache of the L1 cache.

행렬 크기Matrix size 기존 기법의 소요 시간
(단위: 사이클)Time required for existing techniques
(Unit: cycle) 본 발명의 기법의 소요 시간
(단위: 사이클)The time required for the technique of the present invention
(Unit: cycle) 감소율Reduction rate 96×9696 × 96 3495534955 3273432734 6.4%6.4% 128×128128 x 128 8749587495 6783767837 22.5%22.5% 160×160160 x 160 104521104521 8876388763 15.1%15.1% 256×256256 × 256 635045635045 545608545608 14.1%14.1% 320×320320 x 320 791043791043 454586454586 42.5%42.5% 400×400400 x 400 16584831658483 796374796374 52%52% 512×512512 × 512 41981544198154 48118304811830 7.4%7.4%

표 1을 참조하면, 기존의 캐시 제어 기법에 따라 공유 메모리를 활용하지 못하고 L1 캐시만 이용할 경우에 비해, 본 발명의 공유 메모리 제어 기법에 따라 공유 메모리를 L1 캐시의 보조 캐시로 이용하는 경우에, 행렬의 크기에 따라, 소요되는 시간을 적게는 6.4%, 많게는 52%까지 단축시킬 수 있었다. Referring to Table 1, when the shared memory is used as the auxiliary cache of the L1 cache according to the shared memory control technique of the present invention, compared to the case of using only the L1 cache without utilizing the shared memory according to the existing cache control technique, The time required could be shortened by as much as 6.4% and as much as 52%.

그래픽 어플리케이션들이나 과학 시뮬레이션 어플리케이션은 많은 수의 행렬 연산들 또는 벡터 연산들로 구성되므로, 본 발명의 공유 메모리 제어 기법은 전체적으로 상당한 성능 향상을 가져올 수 있다.Because graphical applications or scientific simulation applications are made up of a large number of matrix operations or vector operations, the shared memory control technique of the present invention can result in significant performance improvements overall.

도 2는 본 발명의 일 실시예에 따른 공유 메모리 제어 방법을 예시한 순서도이다.2 is a flowchart illustrating a shared memory control method according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 공유 메모리 제어 방법은, 복수의 멀티프로세서들(110), L2 캐시(120)와 글로벌 GPU 메모리(140)를 포함하는 범용 그래픽 프로세서(10)에서, 멀티프로세서들(110)의 각각이 복수의 코어 유닛들(111, 112, 113), 설정 가능(configurable) 메모리(114) 및 로컬 메모리 제어부(118)를 포함하고, 또한 설정 가능 메모리(114)는 소정의 비율로 L1 캐시(115)와 공유 메모리(116)로 나뉘어 동작할 경우에, 단계(S21)에서, 로컬 메모리 제어부(118)가, 만약 커널 어플리케이션이 공유 메모리를 사용하지 않을 경우에, 설정 가능 메모리(114)를 최대 설정 가능 크기의 L1 캐시(115)와 최소 설정 가능 크기의 공유 메모리(116)로 설정하는 단계로부터 시작할 수 있다.A shared memory control method according to an embodiment of the present invention includes a general graphics processor 10 including a plurality of multiprocessors 110, an L2 cache 120 and a global GPU memory 140, 110 each include a plurality of core units 111, 112, 113, a configurable memory 114 and a local memory control unit 118 and the configurable memory 114 also includes a plurality of core units 111, In step S21, the local memory control unit 118 determines whether or not the kernel application is to use the shared memory in the configurable memory 114 (step < RTI ID = 0.0 > ) To the maximum configurable size of the L1 cache 115 and the minimum configurable size of the shared memory 116. [

이때, 단계(S21)에서 로컬 메모리 제어부(118)는 L1 캐시(115)의 매핑 방식과 동일한 라인 크기로 공유 메모리(116)를 구성할 수 있다.At this time, in step S21, the local memory controller 118 may configure the shared memory 116 with the same line size as the mapping method of the L1 cache 115. [

단계(S22)에서, 코어 유닛들(111, 112, 113) 중 하나에 의해 데이터 액세스를 위한 로드/스토어 명령이 있을 경우에, 로컬 메모리 제어부(118)는, 설정 가능 메모리(114) 내의 L1 캐시(115)를 탐색한다.In step S22, when there is a load / store instruction for data access by one of the core units 111, 112, and 113, the local memory control unit 118 sets the L1 cache in the configurable memory 114 (115).

단계(S23)에서, 만약 단계(S22)의 탐색에 따라, L1 캐시(115)에서 캐시 히트(cache hit)가 발생하면, 단계(S29)로 진행하여, 로컬 메모리 제어부(118)는 데이터를 액세스하고, 로드/스토어 명령을 종료한다.In step S23, if a cache hit occurs in the L1 cache 115 in accordance with the search in step S22, the process proceeds to step S29, where the local memory control unit 118 accesses the data And terminates the load / store command.

만약 단계(S22)의 탐색에서 L1 캐시(115)에서 캐시 미스(cache miss)가 발생할 경우에는, 단계(S24)로 진행한다.If a cache miss occurs in the L1 cache 115 in the search of step S22, the process proceeds to step S24.

단계(S24)에서, 로컬 메모리 제어부(118)는 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 L1 캐시(115)에 있었다면, 단계(S25)로 진행하고, 그렇지 않고 만약 액세스하려는 메모리 주소의 셋 인덱스에 상응하는 라인이 L1 캐시(115)에 없었다면, 단계(S26)로 진행한다.In step S24, the local memory control unit 118 proceeds to step S25 if a line corresponding to the set index of the memory address to be accessed is in the L1 cache 115. Otherwise, if the set of memory addresses to be accessed If there is no line corresponding to the index in the L1 cache 115, the process proceeds to step S26.

단계(S25)에서, 로컬 메모리 제어부(118)는, L2 캐시(120)에 액세스하는 대신에, 공유 메모리(116)를 탐색한다.The local memory control unit 118 searches the shared memory 116 instead of accessing the L2 cache 120 in step S25.

로컬 메모리 제어부(118)가 이전에 L1 캐시(115)에 저장되어 있었다가 희생된 희생 블록들만을 공유 메모리(116)에 저장하므로, L1 캐시(115)에서 캐시 미스가 발생하였더라도 공유 메모리(116)에는 아직 해당 셋 인덱스의 라인들이 저장되어 있을 가능성이 높다.The local memory controller 118 stores the victimized victim blocks in the L1 cache 115 and stores the victimized victim blocks in the shared memory 116. Even if a cache miss occurs in the L1 cache 115, It is highly likely that the lines of the set index are still stored.

단계(S26)에서, 로컬 메모리 제어부(118)는 소정의 캐시 교체 정책, 예를 들어 LRU, NRU 또는 RRIP에 따라, L1 캐시(115)에서 희생 블록을 선정하고, 선정된 희생 블록을 L1 캐시(115)로부터 공유 메모리(116)로 복사하고, 단계(S25)로 진행하여, 공유 메모리(116)를 탐색한다. In step S26, the local memory control unit 118 selects a victim block in the L1 cache 115 according to a predetermined cache replacement policy, for example, LRU, NRU, or RRIP, 115 to the shared memory 116 and proceeds to step S25 to search for the shared memory 116. [

좀더 구체적으로, 단계(S26)에서, 만약 공유 메모리(116)에 희생 블록을 복사할 빈 블록이 없으면, 로컬 메모리 제어부(118)는, 공유 메모리(116)에서 적용되는 소정의 캐시 교체 정책, 예를 들어 LRU, NRU 또는 RRIP에 따라, 공유 메모리(116)에서 2차 희생 블록을 선정하고, 선정된 2차 희생 블록을 공유 메모리(116)로부터 L2 메모리(120)로 복사한 다음에, 2차 희생 블록이 저장되었던 공간에 L1 캐시(115)로부터 희생되는 희생 블록을 저장할 수 있다.In step S26, if there is no empty block in the shared memory 116 to copy the victim block, then the local memory controller 118 determines whether a cache replacement policy applied in the shared memory 116, The secondary victim block is selected in the shared memory 116 according to the LRU, NRU, or RRIP, the selected secondary victim block is copied from the shared memory 116 to the L2 memory 120, It may store the victim block that is sacrificed from the L1 cache 115 in the space where the victim block was stored.

단계(S25)에서 공유 메모리(116)를 탐색한 후에, 단계(S27)에서, 로컬 메모리 제어부(118)는 공유 메모리(116)를 탐색한 결과 만약 공유 메모리(116)에서 캐시 히트가 발생하면, 단계(S29)로 진행하여, 데이터를 액세스하고, 로드/스토어 명령을 종료한다.After searching the shared memory 116 in step S25, the local memory control unit 118, in step S27, searches the shared memory 116 and if a cache hit occurs in the shared memory 116, The flow advances to step S29 to access the data and terminate the load / store instruction.

단계(S27)에서 캐시 미스가 발생하면, 단계(S28)로 진행한다.If a cache miss occurs in step S27, the flow advances to step S28.

단계(S28)에서, 로컬 메모리 제어부(118)는 MSHR에 미처리 캐시 미스(outstanding miss)에 관련된 정보를 저장하고, 하위의 L2 캐시(120)나 글로벌 GPU 메모리(140)에 필요한 데이터를 액세스하고, 데이터를 L1 캐시(115)에 저장하고 미처리 캐시 미스 정보를 갱신하면서, 로드/스토어 명령을 종료할 수 있다.In step S28, the local memory control unit 118 stores information related to an outstanding miss in the MSHR, accesses the data in the lower L2 cache 120 or the global GPU memory 140, It is possible to terminate the load / store instruction while storing the data in the L1 cache 115 and updating the unprocessed cache miss information.

본 실시예 및 본 명세서에 첨부된 도면은 본 발명에 포함되는 기술적 사상의 일부를 명확하게 나타내고 있는 것에 불과하며, 본 발명의 명세서 및 도면에 포함된 기술적 사상의 범위 내에서 당업자가 용이하게 유추할 수 있는 변형예와 구체적인 실시예는 모두 본 발명의 권리범위에 포함되는 것이 자명하다고 할 것이다.It is to be understood that both the foregoing general description and the following detailed description of the present invention are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. It will be understood that variations and specific embodiments which may occur to those skilled in the art are included within the scope of the present invention.

또한, 본 발명에 따른 장치는 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의하여 읽힐 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 기록매체의 예로는 ROM, RAM, 광학 디스크, 자기 테이프, 플로피 디스크, 하드 디스크, 비휘발성 메모리 등을 포함한다. 또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다.Further, the apparatus according to the present invention can be implemented as a computer-readable code on a computer-readable recording medium. A computer-readable recording medium includes all kinds of recording apparatuses in which data that can be read by a computer system is stored. Examples of the recording medium include ROM, RAM, optical disk, magnetic tape, floppy disk, hard disk, nonvolatile memory and the like. The computer-readable recording medium may also be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner.

10 범용 그래픽 프로세서
110 멀티프로세서
111, 112, 113 코어 유닛
114 설정 가능 메모리
115 L1 캐시
116 공유 메모리
117 텍스처 캐시
118 로컬 메모리 제어부
120 L2 캐시
130 글로벌 메모리 제어부
140 글로벌 GPU 메모리
150 쓰레드 관리부10 Universal graphics processor
110 multiprocessor
111, 112, and 113 core units
114 Configurable Memory
115 L1 cache
116 shared memory
117 Texture cache
118 Local Memory Control
120 L2 cache
130 global memory controller
140 Global GPU Memory
150 thread management unit

Claims

A general purpose graphics processor comprising a plurality of multiprocessors, an L2 cache, and a global GPU memory,
A plurality of core units;
A configurable memory configured to be divided into an L1 cache and a shared memory at a predetermined capacity ratio; And
And a local memory controller for storing a victim block that is sacrificed in the L1 cache in the shared memory while the kernel application running on the general purpose graphics processor does not use the shared memory. Processor.

The memory control apparatus according to claim 1,
And to configure the shared memory with the same line size as the mapping of the L1 cache.

The apparatus of claim 1, wherein the local memory control unit
If there is a load / store instruction for data access by one of the core units,
If a cache hit occurs in the L1 cache, data is accessed from the L1 cache and the load / store instruction is terminated.
And to search for the shared memory if a cache miss occurs in the L1 cache.

4. The method of claim 3, further comprising: when a cache miss occurs in the L1 cache,
The local memory control unit,
If a line corresponding to a set index of a memory address to be accessed exists in the L1 cache,
If a corresponding index of the memory address to access is not in the L1 cache, a victim block is selected from the L1 cache and the selected victim block is copied from the L1 cache to the shared memory And to search for the shared memory.

5. The apparatus according to claim 4,
If there is no empty space to copy the victim block to the shared memory, a second sacrifice block is selected from the shared memory according to a predetermined cache replacement policy applied to the shared memory, To the L2 memory or the global GPU memory from the shared memory and then to store the victim block that is sacrificed from the L1 cache in the space where the secondary victim block was stored.

5. The method according to claim 4, further comprising:
The local memory control unit,
If a cache hit occurs in the shared memory, data is accessed from the shared memory and the load / store instruction is terminated.
If a cache miss occurs in the shared memory, information related to an unprocessed cache miss is stored in a Miss Status Holding Register (MSHR), data is accessed from the L2 cache or the global GPU memory, And to terminate the load / store instruction while updating the unprocessed cache miss information of the MSHR.

1. A shared memory control method for a general purpose graphics processor comprising a plurality of multiprocessors, an L2 cache and a global GPU memory,
Each of the multiprocessors includes a plurality of core units, a configurable memory, and a local memory controller, and when the configurable memory is divided into L1 cache and shared memory at a predetermined ratio,
The local memory control unit,
(a) if there is a load / store instruction for data access by one of the core units, searching the L1 cache;
(b) accessing data in the L1 cache and terminating a load / store instruction when a cache hit occurs in the L1 cache in step (a);
(c) if in step (a) a cache miss occurs in the L1 cache and a line corresponding to a set index of a memory address to access is in the L1 cache, searching for the shared memory;
(d) if in step (a) a cache miss occurs in the L1 cache and a line corresponding to a set index of a memory address to be accessed is not in the L1 cache, in accordance with a predetermined cache replacement policy, Copying the selected victim block from the L1 cache to the shared memory, and searching for the shared memory;
(e) accessing data in the shared memory and terminating a load / store instruction when a cache hit occurs in the shared memory in step (c) or step (d);
(f) if a cache miss occurs in the shared memory in step (c) or step (d), storing information related to an unprocessed cache miss in the MSHR, accessing data in the L2 cache or the global GPU memory, Storing the accessed data in the L1 cache, and terminating a load / store instruction while updating the unprocessed cache miss information.

8. The method of claim 7, wherein prior to step (a)
If the kernel application does not use the shared memory, setting the configurable memory to a maximum configurable size L1 cache and a minimum configurable shared memory size, Shared memory control method.

9. The method of claim 8, wherein setting the configurable memory further comprises:
And configuring the shared memory with the same line size as the mapping method of the L1 cache, by the local memory control unit.

The method of claim 7, wherein step (d)
Wherein the local memory control unit selects a second sacrifice block in the shared memory according to a predetermined cache replacement policy applied in the shared memory if there is no empty block to copy the victim block to the shared memory, And copying the secondary victim block from the shared memory to the L2 memory or the global GPU memory and then storing the victim block that is sacrificed from the L1 cache in the space where the secondary victim block was stored A shared memory control method for a general purpose graphics processor.

1. A shared memory control method for a general purpose graphics processor comprising a plurality of multiprocessors, an L2 cache and a global GPU memory,
Each of the multiprocessors includes a plurality of core units, a configurable memory, and a local memory controller, and when the configurable memory is divided into L1 cache and shared memory at a predetermined ratio,
The local memory control unit,
(a ') searching the L1 cache if there is a load / store instruction for data access by one of the core units;
(b ') accessing data in the L1 cache if a cache hit occurs in the L1 cache in step (a');
(c ') if a cache miss occurs in the L1 cache in step (a'), searching for the shared memory; And
(d ') accessing data in the L2 cache or in the global GPU memory if a cache miss occurs in the shared memory in step (c').