KR101980999B1

KR101980999B1 - Method and apparatus for bypassing thread group level cache

Info

Publication number: KR101980999B1
Application number: KR1020180000317A
Authority: KR
Inventors: 홍성인; 한환수; 김현준
Original assignee: 성균관대학교산학협력단; 재단법인 초고성능 컴퓨팅 연구단
Priority date: 2018-01-02
Filing date: 2018-01-02
Publication date: 2019-05-21

Abstract

The present invention relates to a method and apparatus for bypassing cache with a thread group level. According to one embodiment of the present invention, the method performed by the apparatus comprises the steps of: extracting memory information and cache information for a memory and cache provided in the apparatus; determining at least one switching point for each thread group using cache by using the extracted memory information and cache information; and bypassing the cache used by the thread group according to the determined switching point.

Description

[0001] METHOD AND APPARATUS FOR BYPASSING THREAD GROUP LEVEL CACHE [

본 발명은 캐시 바이패싱(Cache bypassing) 기술에 관한 것으로, 보다 상세하게는, 쓰레드 그룹 레벨의 캐시 바이패싱 방법 및 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cache bypassing technique, and more particularly, to a thread-group level cache bypassing method and apparatus.

현대 그래픽 처리 장치(Graphic processing unit, GPU)는 캐시 계층(cache hierarchies)을 하기의 [표 1]과 같이 널리 채택하고 있다. 그래픽 처리 장치들의 비교적 작은 캐시는 수천 개의 쓰레드로 쉽게 혼잡해질 수 있다. Modern graphics processing units (GPUs) have widely adopted cache hierarchies as shown in Table 1 below. A relatively small cache of graphics processing units can easily become congested with thousands of threads.

현대 그래픽 처리 장치(GPU)는 긴 전역 메모리 접근 지연 시간을 줄이고자 다중 계층 메모리 구조를 채택하고 있다. 그래픽 처리 장치는 하드웨어로 관리하는 L1/L2 캐시를 가지고 있다. 그런데 제한된 크기의 캐시와 메모리 자원들로 인해 병목 현상이 발생하여 성능 저하가 발생한다.Modern graphics processing units (GPUs) employ a multi-tiered memory architecture to reduce long global memory access latency. The graphics processing unit has an L1 / L2 cache managed by hardware. However, due to the limited size of cache and memory resources, bottlenecks occur and performance degradation occurs.

도 1은 일반적인 캐시 바이패싱 방법을 설명하기 위한 도면이다.1 is a diagram for explaining a general cache bypassing method.

도 1에 도시된 바와 같이, 캐시 바이패싱 설계 방식은 이러한 병목 현상을 해결하기 위해, 특정 명령어 혹은 일부 쓰레드 그룹이 캐시를 우회하도록 한다. 우선, 캐시 바이패싱 방법에는 명령어 레벨의 캐시 바이패싱(Instruction-level Cache Bypassing) 방법(110)과 쓰레드 그룹 레벨의 캐시 바이패싱(Warp-level Cache Bypassing) 방법(120)이 있다. 여기서, 쓰레드 그룹은 "Warp" 또는 "Tread Group"으로 지칭될 수 있다. As shown in FIG. 1, the cache bypassing design scheme allows certain instructions or some thread groups to bypass the cache to address these bottlenecks. First, an instruction-level cache bypassing method 110 and a thread-level cache bypassing method 120 are used in the cache bypassing method. Here, the thread group may be referred to as " Warp " or " Tread Group ".

쓰레드 그룹 레벨의 캐시 바이패싱 방법(120)이 GPU 컴퓨팅에서 명령어 레벨의 캐시 바이패싱 방법(110)보다 낫다. 왜냐하면, 쓰레드 그룹 레벨의 캐시 바이패싱 방법(120)이 명령어 레벨의 캐시 바이패싱 방법(110) 보다 캐시와 메모리 자원을 더 많이 사용할 수 있다. 이에 따라, 다른 데이터, 쓰레드가 캐시와 메모리 자원을 더 많이 사용할 수 있도록 하는 캐시 바이패싱 기술이 연구되고 있다. 수천 개의 쓰레드가 병렬적으로 동시에 수행하는 그래픽 처리 장치(GPU)에서는 메모리 자원에 대한 집중을 해소할 수 있는 쓰레드 그룹 레벨의 캐시 바이패싱이 더욱 좋은 성능을 보인다. 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 전체 프로그램에 대해서 일부 쓰레드 그룹인, Warp 0에 대한 쓰레드 그룹 0(121), Warp 1에 대한 쓰레드 그룹 1(122)이 캐시를 사용하고, 나머지 쓰레드 그룹인 Warp 2에 대한 쓰레드 그룹 2(123)이 캐시를 우회하도록 하는 방법이다.The thread group level cache bypassing method 120 is better than the instruction level cache bypassing method 110 in GPU computing. Because the thread group level cache bypassing method 120 can use more cache and memory resources than the instruction level cache bypassing method 110, Thus, cache bypassing techniques are being explored to allow other data, threads, and more cache and memory resources to be used. Thread-group-level cache bypassing, which can eliminate memory resource concentration, performs better in graphics processing units (GPUs), where thousands of threads execute concurrently in parallel. The thread-group-level cache bypassing method uses a cache for thread group 0 (121) for Warp 0, thread group 1 (122) for Warp 1, and some threads group Warp 2 < / RTI > 123 is to bypass the cache.

도 2는 쓰레드 그룹 간 실행 시간의 격차를 설명하기 위한 도면이다.2 is a diagram for explaining a difference in execution time between thread groups.

도 2에 도시된 바와 같이, 전체 프로그램에 대해 일괄적으로 캐시를 하거나 우회하게 되면, 쓰레드 그룹 간 실행 시간의 격차가 생기게 된다. 동일한 작업, 작업량을 수행하더라도 캐시를 사용하는 쓰레드 그룹은 더 빨리 실행이 완료된다. 반면, 캐시를 우회하는 그룹은 비교적 실행 시간이 오래 걸리게 된다. 쓰레드 블록은 스트리밍 멀티프로세서(Streaming Multiprocessor, SM)에 스케줄 되는 단위이다. As shown in FIG. 2, if the entire program is cached or bypassed collectively, there will be a difference in execution time between the thread groups. Even if you do the same task, the workload, the thread group that uses the cache will run faster. On the other hand, a group bypassing the cache takes a relatively long execution time. A thread block is a unit that is scheduled to a streaming multiprocessor (SM).

한국 등록특허공보 제10-1662363호(2016년09월27일 등록)Korean Registered Patent No. 10-1662363 (registered on September 27, 2016)

일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서, 실행 시간이 빠른 쓰레드 그룹(202, 203, 204)은 실행을 완료하였음에도 불구하고, 느린 쓰레드 그룹(201)이 완료될 때까지 정지(stall) 해야 한다. 실행 예정인 쓰레드 블록(TB)은 느린 쓰레드 그룹(201)이 완료될 때까지 대기(wait)하고, 느린 쓰레드 그룹(201)이 완료된 후에 실행이 개시(launch)된다. 즉, 예약된 쓰레드 블록(TB)의 모든 쓰레드 그룹의 실행이 완료되기 전에 다음 쓰레드 블록(TB)의 실행이 개시될 수 없다. 이러한 이유에서 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 기술은 최적의 성능 향상을 얻기 위해서는 개선이 필요하다.In a general thread group level cache bypassing method, a fast-running thread group 202, 203, 204 must stall until a slow thread group 201 is completed, despite completion of execution. The scheduled thread block TB waits until the slow thread group 201 is completed and the execution is started after the slow thread group 201 is completed. That is, the execution of the next thread block (TB) can not be started before execution of all the thread groups of the reserved thread block (TB) is completed. For this reason, general thread group level cache bypassing techniques require improvements to achieve optimal performance improvements.

본 발명의 실시 예들은 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 기술을 그래픽 처리 장치에 적용하였을 때 발생하는 쓰레드 그룹 간 실행 시간 격차를 감소시킴으로써, 성능 저하 현상을 해결하여 더욱 좋은 성능을 얻을 수 있는, 쓰레드 그룹 레벨의 캐시 바이패싱 방법 및 장치를 제공하고자 한다.Embodiments of the present invention reduce the execution time gap between thread groups that occurs when a general thread group level cache bypassing technique is applied to a graphics processing apparatus, And to provide a method and apparatus for cache bypassing at a group level.

본 발명의 실시 예들은 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 기술을 적용하였을 때 캐시를 사용하는 쓰레드 그룹과 사용하지 않는 쓰레드 그룹의 실행 시간 격차로 인해 발생하는 성능 손실을 모든 쓰레드 그룹이 캐시를 공평하게 하도록 특정 프로그램 지점에서 캐시를 사용하는 그룹을 스위칭하여 쓰레드 그룹 간의 실행 시간 격차를 줄이는 방법으로 최소화함으로써, 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 보다 더욱 좋은 성능을 얻을 수 있는, 쓰레드 그룹 레벨의 캐시 바이패싱 방법 및 장치를 제공하고자 한다.Embodiments of the present invention are based on the assumption that when a general thread group level cache bypassing technique is applied, a performance loss caused by a difference in execution time between a thread group using a cache and a thread group not using a cache Group-level cache bypassing that can achieve better performance than normal thread group-level cache bypassing by minimizing the execution time gap between thread groups by switching groups that use the cache at specific program points Method, and apparatus.

본 발명의 일 실시 예에 따르면, 캐시 바이패싱(Cache bypassing) 장치에 의해 수행되는 쓰레드 그룹 레벨의 캐시 바이패싱 방법에 있어서, 캐시를 사용하는 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정하는 단계; 및 상기 결정된 적어도 하나의 스위칭 지점에 따라 상기 쓰레드 그룹이 사용하는 캐시를 바이패싱하는 단계를 포함하는 쓰레드 그룹 레벨의 캐시 바이패싱 방법이 제공될 수 있다. According to an embodiment of the present invention, there is provided a cache bypass bypassing method performed by a cache bypassing apparatus, comprising: determining at least one switching point for each thread group using a cache; And bypassing a cache used by the thread group according to the determined at least one switching point.

상기 방법은, 구비된 메모리에 접근할 때 발생하는 메모리 접근 정보 및 상기 캐시를 바이패싱할 때 발생하는 캐시 바이패싱 정보를 추출하는 단계를 더 포함하고, 상기 스위칭 지점을 결정하는 단계는, 상기 추출된 메모리 접근 정보 및 상기 추출된 캐시 바이패싱 정보를 이용하여 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정할 수 있다.The method may further comprise extracting memory access information that occurs when accessing the provided memory and cache bypassing information that occurs when bypassing the cache, wherein determining the switching point comprises: At least one switching point may be determined for each thread group using the extracted memory access information and the extracted cache bypassing information.

상기 방법은, 상기 쓰레드 그룹별로 적어도 하나의 스위칭 가능 지점을 검색하는 단계를 더 포함할 수 있다.The method may further comprise searching for at least one switchable point for each thread group.

상기 스위칭 가능 지점을 검색하는 단계는, 상기 쓰레드 그룹이 캐싱한 데이터가 재사용되지 않는 적어도 하나의 지점을 검색하여 상기 적어도 하나의 스위칭 가능 지점으로 설정할 수 있다.The step of retrieving the switchable point may set at least one switchable point by searching for at least one point where the data cached by the thread group is not reused.

상기 스위칭 가능 지점을 검색하는 단계는, 쓰레드 블록에서 쓰레드 그룹의 전체 개수를 동시에 캐시를 사용하는 쓰레드 그룹의 개수로 나눈 개수에 따라 적어도 하나의 스위칭 가능 지점을 검색할 수 있다.The step of searching for the switchable point may search for at least one switchable point according to the number of threads in the thread block divided by the total number of thread groups divided by the number of thread groups simultaneously using the cache.

상기 스위칭 지점을 결정하는 단계는, 상기 검색된 적어도 하나의 스위칭 가능 지점 중에서 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정할 수 있다.The determining the switching point may determine at least one switching point for each thread group among the at least one switchable point searched.

상기 스위칭 지점을 결정하는 단계는, 상기 검색된 적어도 하나의 스위칭 가능 지점 중에서, 캐시를 사용하는 쓰레드 그룹 및 캐시를 미사용하는 쓰레드 그룹 간의 실행 시간 격차를 산출하고, 상기 산출된 실행 시간 격차가 최소화되는 적어도 하나의 스위칭 가능 지점을 적어도 하나의 스위칭 지점으로 결정할 수 있다.Wherein determining the switching point comprises: calculating an execution time gap between a thread group using a cache and a thread group not using a cache among the searched at least one switchable point, One switching point can be determined as at least one switching point.

상기 스위칭 지점을 결정하는 단계는, 상기 검색된 적어도 하나의 스위칭 가능 지점 중에서 스위칭 지점들 간의 거리 차이가 최소화되는 적어도 하나의 스위칭 가능 지점을 적어도 하나의 스위칭 지점으로 결정할 수 있다.The step of determining the switching point may determine the at least one switchable point at which the difference in distance between the switching points among the at least one switchable point searched is minimized as the at least one switching point.

상기 방법은, 상기 결정된 스위칭 지점을 기초로 하여 프로그램 코드를 수정하는 단계; 및 상기 수정된 프로그램 코드를 컴파일링하여 캐시 바이패싱 프로그램을 생성하는 단계를 더 포함하고, 상기 캐시를 바이패싱하는 단계는, 상기 생성된 캐시 바이패싱 프로그램에 따라 상기 쓰레드 그룹이 사용하는 캐시를 바이패싱할 수 있다.The method comprising: modifying the program code based on the determined switching point; And generating a cache bypassing program by compiling the modified program code, wherein bypassing the cache further comprises: caching the cache used by the thread group according to the generated cache bypassing program, You can pass it.

상기 프로그램 코드를 수정하는 단계는, 상기 결정된 스위칭 지점을 기초로, 상기 쓰레드 그룹의 바이패싱을 위한 제1 캐시 연산자 및 상기 쓰레드 그룹의 캐싱을 위한 제2 캐시 연산자를 이용하여 프로그램 코드를 수정할 수 있다.The modifying the program code may modify the program code using a first cache operator for bypassing the thread group and a second cache operator for caching of the thread group based on the determined switching point .

한편, 본 발명의 다른 실시 예에 따르면, 캐시를 사용하는 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정하는 스위칭 지점 결정부; 및 상기 결정된 적어도 하나의 스위칭 지점에 따라 상기 쓰레드 그룹이 사용하는 캐시를 바이패싱하는 캐시 바이패싱부를 포함하는 쓰레드 그룹 레벨의 캐시 바이패싱 장치가 제공될 수 있다.According to another embodiment of the present invention, there is provided a cache management method comprising: a switching point determining unit determining at least one switching point for each thread group using a cache; And a cache bypassing unit for bypassing a cache used by the thread group according to the determined at least one switching point.

상기 장치는, 상기 메모리에 접근할 때 발생하는 메모리 접근 정보 및 상기 캐시를 바이패싱할 때 발생하는 캐시 바이패싱 정보를 추출하는 정보 추출부를 더 포함하고, 상기 스위칭 지점 결정부는, 상기 추출된 메모리 접근 정보 및 상기 추출된 캐시 바이패싱 정보를 이용하여 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정할 수 있다.Wherein the apparatus further comprises an information extracting unit for extracting memory access information generated when accessing the memory and cache bypassing information generated when bypassing the cache, At least one switching point may be determined for each thread group using the information and the extracted cache bypassing information.

상기 장치는, 상기 쓰레드 그룹별로 적어도 하나의 스위칭 가능 지점을 검색하는 스위칭 지점 검색부를 더 포함할 수 있다.The apparatus may further include a switching point searching unit for searching at least one switchable point for each thread group.

상기 스위칭 지점 검색부는, 상기 쓰레드 그룹이 캐싱한 데이터가 재사용되지 않는 적어도 하나의 지점을 검색하여 상기 적어도 하나의 스위칭 가능 지점으로 설정할 수 있다.The switching point searching unit may search at least one point where data cached by the thread group is not reused to set the at least one switchable point.

상기 스위칭 지점 검색부는, 쓰레드 블록에서 쓰레드 그룹의 전체 개수를 동시에 캐시를 사용하는 쓰레드 그룹의 개수로 나눈 개수에 따라 적어도 하나의 스위칭 가능 지점을 검색할 수 있다.The switching point searching unit may search at least one switchable point according to the number of the thread groups divided by the total number of the thread groups simultaneously by the number of the thread groups using the cache.

상기 스위칭 지점 결정부는, 상기 검색된 적어도 하나의 스위칭 가능 지점 중에서 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정할 수 있다.The switching point determination unit may determine at least one switching point for each thread group among the searched at least one switchable point.

상기 스위칭 지점 결정부는, 상기 검색된 적어도 하나의 스위칭 가능 지점 중에서, 캐시를 사용하는 쓰레드 그룹 및 캐시를 미사용하는 쓰레드 그룹 간의 실행 시간 격차를 산출하고, 상기 산출된 실행 시간 격차가 최소화되는 적어도 하나의 스위칭 가능 지점을 적어도 하나의 스위칭 지점으로 결정할 수 있다.Wherein the switching point determiner calculates an execution time difference between a thread group that uses a cache and a thread group that does not use a cache among the searched at least one switchable point and performs at least one switching operation in which the calculated execution time gap is minimized The possible points can be determined as at least one switching point.

상기 스위칭 지점 결정부는, 상기 검색된 적어도 하나의 스위칭 가능 지점 중에서 스위칭 지점들 간의 거리 차이가 최소화되는 적어도 하나의 스위칭 가능 지점을 적어도 하나의 스위칭 지점으로 결정할 수 있다.The switching point determination unit may determine at least one switchable point at which the difference in distance between the switching points among the at least one switchable points searched is minimized as the at least one switching point.

상기 장치는, 상기 결정된 스위칭 지점을 기초로 하여 프로그램 코드를 수정하는 코드 수정부; 및 상기 수정된 프로그램 코드를 컴파일링하여 캐시 바이패싱 프로그램을 생성하는 프로그램 컴파일러를 더 포함하고, 상기 캐시 바이패싱부는, 상기 생성된 캐시 바이패싱 프로그램에 따라 상기 쓰레드 그룹이 사용하는 캐시를 바이패싱할 수 있다.The apparatus comprising: a code modifying unit for modifying the program code based on the determined switching point; And a program compiler for generating a cache bypassing program by compiling the modified program code, wherein the cache bypassing unit is configured to bypass the cache used by the thread group according to the generated cache bypassing program .

상기 코드 수정부는, 상기 결정된 스위칭 지점을 기초로, 상기 쓰레드 그룹의 바이패싱을 위한 제1 캐시 연산자 및 상기 쓰레드 그룹의 캐싱을 위한 제2 캐시 연산자를 이용하여 프로그램 코드를 수정할 수 있다.The code modifying unit may modify the program code using a first cache operator for bypassing the thread group and a second cache operator for caching the thread group based on the determined switching point.

본 발명의 실시 예들은 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 기술을 그래픽 처리 장치에 적용하였을 때 발생하는 쓰레드 그룹 간 실행 시간 격차를 감소시킴으로써, 성능 저하 현상을 해결하여 더욱 좋은 성능을 얻을 수 있다.Embodiments of the present invention solve the performance degradation phenomenon by reducing the execution time gap between thread groups that occurs when a general thread group level cache bypassing technique is applied to a graphics processing apparatus, thereby achieving better performance.

본 발명의 실시 예들은 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 기술을 적용하였을 때 캐시를 사용하는 쓰레드 그룹과 사용하지 않는 쓰레드 그룹의 실행 시간 격차로 인해 발생하는 성능 손실을 모든 쓰레드 그룹이 캐시를 공평하게 하도록 특정 프로그램 지점에서 캐시를 사용하는 그룹을 스위칭하여 쓰레드 그룹 간의 실행 시간 격차를 줄이는 방법으로 최소화함으로써, 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 보다 더욱 좋은 성능을 얻을 수 있다.Embodiments of the present invention are based on the assumption that when a general thread group level cache bypassing technique is applied, a performance loss caused by a difference in execution time between a thread group using a cache and a thread group not using a cache To achieve a better performance than normal thread group level cache bypassing by minimizing the execution time gap between thread groups by switching groups that use the cache at specific program points.

도 1은 일반적인 캐시 바이패싱 방법을 설명하기 위한 도면이다.
도 2는 쓰레드 그룹 간 실행 시간의 격차를 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 장치의 구성을 설명하기 위한 블록 구성도이다.
도 4는 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 스위칭 지점을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 복수의 스위칭 지점을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 스위칭 가능 지점의 검색을 설명하기 위한 도면이다.
도 8은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 스위칭 지점 결정을 설명하기 위한 도면이다.
도 9 내지 도 12는 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 코드 수정 과정을 설명하기 위한 도면이다.
도 13은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법의 흐름을 설명하기 위한 순서도이다.
도 14는 캐시 미사용, 캐시 사용 및 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법에 따른 실행 시간을 나타낸 도면이다.
도 15는 일반적인 캐시 바이패싱 방법 및 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에 따른 실행 시간을 나타낸 도면이다.
도 16은 밸런싱이 되지 않은 일반적인 캐시 바이패싱 방법과 본 발명의 일 실시 예에 따른 밸런싱된 캐시 바이패싱 방법에 따른 실행 시간을 나타낸 도면이다.1 is a diagram for explaining a general cache bypassing method.
2 is a diagram for explaining a difference in execution time between thread groups.
3 is a block diagram illustrating a configuration of a thread bypass leveling apparatus according to an embodiment of the present invention.
4 is a diagram for explaining a general thread group level cache bypassing method.
5 is a diagram for explaining a switching point in a thread bypassing method of a thread group level according to an embodiment of the present invention.
FIG. 6 is a view for explaining a plurality of switching points in a thread bypassing method of a thread group level according to an embodiment of the present invention.
FIG. 7 is a diagram for explaining a search for a switchable point in a thread bypassing method of a thread group level according to an embodiment of the present invention.
8 is a diagram for explaining switching point determination in a thread bypass level cache-level method according to an embodiment of the present invention.
9 to 12 are diagrams for explaining a code modification procedure in a thread bypass level cache-level bypassing method according to an embodiment of the present invention.
FIG. 13 is a flow chart for explaining a flow of a thread-group-level cache bypassing method according to an embodiment of the present invention.
14 is a view showing execution time according to a cache unused, a cache use, and a general thread group level cache bypass passing method.
FIG. 15 is a diagram illustrating an execution time according to a general cache bypass passing method and a cache group bypassing method according to an embodiment of the present invention.
16 is a diagram illustrating an execution time according to a conventional cache bypass transfer method that is not balanced and a balanced cache bypass transfer method according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail.

그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가진 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the relevant art and are to be interpreted in an ideal or overly formal sense unless explicitly defined in the present application Do not.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted.

도 3은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 장치의 구성을 설명하기 위한 블록 구성도이다.3 is a block diagram illustrating a configuration of a thread bypass leveling apparatus according to an embodiment of the present invention.

도 3에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 정보 추출부(310), 스위칭 지점 결정부(330) 및 캐시 바이패싱부(360)를 포함한다. 여기서, 본 발명의 다른 실시 예에 따른 캐시 바이패싱 장치(300)는 스위칭 지점 검색부(320)를 더 포함할 수 있다. 또한, 본 발명의 다른 실시 예에 따른 캐시 바이패싱 장치(300)는 코드 수정부(340) 및 프로그램 컴파일러(350)를 더 포함할 수 있다. 3, a thread group level cache bypassing apparatus 300 according to an embodiment of the present invention includes an information extracting unit 310, a switching point determining unit 330, and a cache bypassing unit 360, . Here, the cache bypassing apparatus 300 according to another embodiment of the present invention may further include a switching point searching unit 320. In addition, the cache bypassing apparatus 300 according to another embodiment of the present invention may further include a code correction unit 340 and a program compiler 350.

그러나 도시된 구성요소 모두가 필수구성요소인 것은 아니다. 도시된 구성요소보다 많은 구성요소에 의해 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)가 구현될 수도 있고, 그보다 적은 구성요소에 의해서도 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 구현될 수 있다.However, not all illustrated components are required. The cache bypassing apparatus 300 of the thread group level may be implemented by a larger number of components than the illustrated components, and the thread group level cache bypassing apparatus 300 may be implemented by fewer components.

이하, 도 1의 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)의 각 구성요소들의 구체적인 구성 및 동작을 설명한다.The concrete configuration and operation of each component of the cache bypassing apparatus 300 of the thread group level in FIG. 1 will be described below.

정보 추출부(310)는 그래픽 처리 장치에 구비된 메모리 및 캐시에 대한 메모리 정보 및 캐시 정보를 추출한다. 정보 추출부(310)는 구비된 메모리에 접근할 때 발생하는 메모리 접근 정보 및 구비된 캐시를 바이패싱할 때 발생하는 캐시 바이패싱 정보를 메모리 정보 및 캐시 정보로 추출할 수 있다.The information extracting unit 310 extracts memory information and cache information for a memory and a cache provided in the graphics processing apparatus. The information extracting unit 310 may extract the memory access information generated when accessing the provided memory and the cache bypassing information generated when the provided cache is bypassed, as memory information and cache information.

스위칭 지점 결정부(330)는 추출된 메모리 정보 및 캐시 정보를 이용하여 캐시를 사용하는 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정한다.The switching point determination unit 330 determines at least one switching point for each thread group using the cache by using the extracted memory information and cache information.

캐시 바이패싱부(360)는 스위칭 지점 결정부(330)에서 결정된 적어도 하나의 스위칭 지점에 따라 쓰레드 그룹이 사용하는 캐시를 바이패싱한다.The cache bypassing unit 360 bypasses the cache used by the thread group according to the at least one switching point determined by the switching point determining unit 330. [

한편, 스위칭 지점 검색부(320)는 쓰레드 그룹별로 적어도 하나의 스위칭 가능 지점을 검색한다. 스위칭 지점 검색부(320)는, 쓰레드 그룹이 캐싱한 데이터가 재사용되지 않는 적어도 하나의 지점을 검색하여 적어도 하나의 스위칭 가능 지점으로 설정할 수 있다. 스위칭 지점 검색부(320)는, 쓰레드 블록에서 쓰레드 그룹의 전체 개수를 동시에 캐시를 사용하는 쓰레드 그룹의 개수로 나눈 개수에 따라 적어도 하나의 스위칭 가능 지점을 검색할 수 있다.Meanwhile, the switching point searching unit 320 searches at least one switchable point for each thread group. The switching point searching unit 320 can search at least one point where data cached by the thread group is not reused and set the at least one switchable point. The switching point searching unit 320 may search for at least one switchable point according to the number of threads in the thread block divided by the total number of thread groups and the number of thread groups simultaneously using the cache.

스위칭 가능 지점이 검색되면, 스위칭 지점 결정부(330)는, 검색된 적어도 하나의 스위칭 가능 지점 중에서 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정할 수 있다. 스위칭 지점 결정부(330)는, 스위칭 지점 검색부(320)에서 검색된 적어도 하나의 스위칭 가능 지점 중에서, 캐시를 사용하는 쓰레드 그룹 및 캐시를 미사용하는 쓰레드 그룹 간의 실행 시간 격차를 산출한다. 그리고 스위칭 지점 결정부(330)는, 산출된 실행 시간 격차가 최소화되는 적어도 하나의 스위칭 가능 지점을 적어도 하나의 스위칭 지점으로 결정할 수 있다.When a switchable point is found, the switching point determining unit 330 can determine at least one switching point for each thread group among the detected at least one switchable point. The switching point determining unit 330 calculates an execution time difference between a thread group using a cache and a thread group not using a cache among at least one switchable point searched by the switching point searching unit 320. [ The switching point determination unit 330 may determine at least one switchable point at which the calculated execution time gap is minimized as the at least one switching point.

스위칭 지점 결정부(330)는, 스위칭 지점 검색부(320)에서 검색된 적어도 하나의 스위칭 가능 지점 중에서 스위칭 지점들 간의 거리 차이가 최소화되는 적어도 하나의 스위칭 가능 지점을 적어도 하나의 스위칭 지점으로 결정할 수 있다.The switching point determining unit 330 may determine at least one switchable point at which the difference in distance between the switching points in the at least one switchable point detected by the switching point searching unit 320 is minimized as at least one switching point .

한편, 코드 수정부(340)는 스위칭 지점 결정부(330)에서 결정된 스위칭 지점을 기초로 하여 그래픽 처리 장치의 프로그램 코드를 수정한다. 코드 수정부(340)는, 스위칭 지점 결정부(330)에서 결정된 스위칭 지점을 기초로, 쓰레드 그룹의 바이패싱을 위한 제1 캐시 연산자 및 쓰레드 그룹의 캐싱을 위한 제2 캐시 연산자를 이용하여 그래픽 처리 장치의 프로그램 코드를 수정할 수 있다.On the other hand, the code correcting unit 340 modifies the program code of the graphics processing apparatus based on the switching point determined by the switching point determining unit 330. [ Based on the switching point determined by the switching point determination unit 330, the code correction unit 340 performs a graphic processing (not shown) using the first cache operator for bypassing the thread group and the second cache operator for caching the thread group The program code of the device can be modified.

프로그램 컴파일러(350)는 코드 수정부(340)에서 수정된 그래픽 처리 장치의 프로그램 코드를 컴파일링하여 캐시 바이패싱 프로그램을 생성한다. The program compiler 350 compiles the program code of the modified graphics processing unit in the code correction unit 340 to generate a cache bypassing program.

그리고 캐시 바이패싱부(360)는, 프로그램 컴파일러(350)에서 생성된 캐시 바이패싱 프로그램에 따라 쓰레드 그룹이 사용하는 캐시를 바이패싱하는 쓰레드 그룹 레벨의 캐시 바이패싱을 수행할 수 있다.The cache bypassing unit 360 may perform cache bypassing at the thread group level for bypassing the cache used by the thread group according to the cache bypassing program generated by the program compiler 350. [

도 4는 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법을 설명하기 위한 도면이다.4 is a diagram for explaining a general thread group level cache bypassing method.

일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 전체 쓰레드 그룹을 캐시를 사용하는 쓰레드 그룹과 캐시를 사용하지 않는 쓰레드 그룹으로 나누고, 쓰레드 그룹을 실행시킨다. 쓰레드 그룹은 적어도 하나의 쓰레드를 포함할 수 있으며, 특정 쓰레드 그룹으로 한정되지 않는다. 예컨대, 적어도 하나의 쓰레드는 op0; op1; op2; op3;과 같이 표현될 수 있다. A common thread group level cache bypassing method divides the entire thread group into a thread group that uses the cache and a thread group that does not use the cache, and executes the thread group. A thread group may include at least one thread, and is not limited to a particular thread group. For example, at least one thread is op0; op1; op2; op3 < / RTI >

도 4에 도시된 바와 같이, Warp 0으로 표시되는 쓰레드 그룹 0(410)과 Warp 1로 표시되는 쓰레드 그룹 1(420)은 캐시를 사용하는 캐싱된 쓰레드 그룹이다. 반면, Warp 2로 표시되는 쓰레드 그룹 2(430)와 Warp 3으로 표시되는 쓰레드 그룹 3(440)은 캐시를 사용하지 않고 캐시 바이패싱된 쓰레드 그룹이다.As shown in FIG. 4, thread group 0 410 indicated by Warp 0 and thread group 1 420 indicated by Warp 1 are cached thread groups using a cache. On the other hand, thread group 2 (430), indicated by Warp 2, and thread group 3 (440), indicated by Warp 3, are cache-bypassed thread groups without using cache.

여기서, 캐시를 사용하는 쓰레드 그룹 0(410)과 쓰레드 그룹 1(420)은 쓰레드 그룹 2(430)와 쓰레드 그룹 3(440)에 비해 실행이 미리 완료된다. 따라서 캐시를 사용하는 쓰레드 그룹 0(410)과 쓰레드 그룹 1(420)은 쓰레드 그룹 2(430)와 쓰레드 그룹 3(440)의 실행이 완료될 때까지 대기해야 한다.Here, the execution of the thread group 0 410 and the thread group 1 420 using the cache is completed in advance compared to the thread group 2 430 and the thread group 3 440. Accordingly, the thread group 0 410 and the thread group 1 420 using the cache must wait until the execution of the thread group 2 430 and the thread group 3 440 is completed.

도 5는 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 스위칭 지점을 설명하기 위한 도면이다.5 is a diagram for explaining a switching point in a thread bypassing method of a thread group level according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 도 4와 유사하게 우선 전체 쓰레드 그룹을 캐시를 사용하는 쓰레드 그룹과 캐시를 사용하지 않는 쓰레드 그룹으로 나누고, 쓰레드 그룹을 실행시킨다.The method of bypassing a thread group level according to an embodiment of the present invention divides the entire thread group into a thread group that uses a cache and a thread group that does not use a cache, and executes a thread group, similar to FIG.

그러나 스위칭 지점 이후, 캐시를 사용하는 캐싱된 쓰레드 그룹은 바이패스되고, 캐시를 사용하지 않는 바이패싱 쓰레드 그룹은 캐시를 사용하게 된다.However, after the switching point, the cached thread group using the cache is bypassed, and the bypassing thread group that does not use the cache uses the cache.

도 5에 도시된 바와 같이, Warp 0으로 표시되는 쓰레드 그룹 0(410)과 Warp 1로 표시되는 쓰레드 그룹 1(420)은 캐시를 사용하다가, 스위칭 지점 이후에 캐시를 사용하지 않는 캐시 바이패싱된 쓰레드 그룹이 된다.As shown in FIG. 5, thread group 0 410 indicated by Warp 0 and thread group 1 420 indicated by Warp 1 use a cache, and cache-bypassed after the switching point It becomes a thread group.

반면, Warp 2로 표시되는 쓰레드 그룹 2(430)와 Warp 3으로 표시되는 쓰레드 그룹 3(440)은 바이패싱되어 캐시를 사용하지 않다가, 스위칭 지점 이후에 캐시를 사용하는 캐싱된 쓰레드 그룹이 된다.On the other hand, thread group 2 (430) indicated by Warp 2 and thread group 3 (440) indicated by Warp 3 are bypassed and do not use the cache, but become a cached thread group using the cache after the switching point .

여기서, 캐시를 사용하는 쓰레드 그룹 0(410)과 쓰레드 그룹 1(420)은 쓰레드 그룹 2(430)와 쓰레드 그룹 3(440)과 거의 유사하게 실행이 완료된다. 왜냐하면, 쓰레드 그룹 0(410), 쓰레드 그룹 1(420), 쓰레드 그룹 2(430) 및 쓰레드 그룹 3(440)은 모두 캐시를 한 번씩 균등하게 사용하기 때문이다. 따라서 캐시를 먼저 사용하는 쓰레드 그룹이나, 캐시를 나중에 사용하는 쓰레드 그룹 모두 다른 쓰레드 그룹의 실행이 완료될 때까지 대기할 필요가 없어진다. 이를 통해, 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 쓰레드 그룹 간의 실행 시간 차이를 감소시킬 수 있다.Here, the thread group 0 (410) and the thread group 1 (420) using the cache are almost similar to the thread group 2 (430) and the thread group 3 (440). This is because thread group 0 410, thread group 1 420, thread group 2 430, and thread group 3 440 both use the cache equally once. Therefore, it is not necessary to wait until the execution of another thread group is completed in both the thread group that uses the cache first or the thread group that uses the cache later. Accordingly, the thread bypass method of the thread group level according to an embodiment of the present invention can reduce the execution time difference between the thread groups.

도 6은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 복수의 스위칭 지점을 설명하기 위한 도면이다.FIG. 6 is a view for explaining a plurality of switching points in a thread bypassing method of a thread group level according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 스위칭 지점 #1(Switching Point #1) 및 스위칭 지점 #2(Switching Point #2)를 기준으로 각 쓰레드 그룹을 개별적으로 스위칭시킨다. 일례로, 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 캐시를 사용하는 쓰레드 그룹을 스위칭 지점 #1 이후에 캐시를 사용하지 않도록 캐시 바이패싱시킨다. 다른 예로, 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 캐시를 사용하지 않는 쓰레드 그룹을 스위칭 지점 #1 이후에 캐시를 사용하도록 캐싱시킨다.The thread bypass method of the thread group level according to an embodiment of the present invention switches each thread group individually on the basis of the switching point # 1 and the switching point # 2. For example, a thread-group-level cache bypassing method cache-bypasses a thread group using a cache so as not to use the cache after switching point # 1. As another example, the thread-group level cache bypassing method caches non-cached thread groups to use the cache after switching point # 1.

각 쓰레드 그룹의 개별적인 캐시 바이패싱 방법에 대해서 도 6을 참조하여 살펴보면 다음과 같다. 도 6에는 쓰레드 블록이 6개의 쓰레드 그룹으로 구성되고, 동시에 캐시를 사용하는 적절한 쓰레드 그룹의 개수가 2개인 경우가 도시되어 있다.The individual cache bypass transfer method of each thread group will be described with reference to FIG. FIG. 6 shows a case where a thread block is composed of six thread groups and at the same time, the number of suitable thread groups using the cache is two.

첫째 그룹인 Warp 0으로 표시되는 쓰레드 그룹 0(410)과 Warp 1로 표시되는 쓰레드 그룹 1(420)은 스위칭 지점 #1 이전에는 캐시를 사용한다. 쓰레드 그룹 0(410)과 쓰레드 그룹 1(420)은 캐시 바이패싱 장치(300)에 의해 스위칭 지점 #1에서 스위칭된다. 스위칭 지점 #1 이후, 쓰레드 그룹 0(410)과 쓰레드 그룹 1(420)은 캐시를 사용하지 않도록 바이패싱된다.Thread group 0 (410), which is represented by the first group Warp 0, and thread group 1 (420), which is represented by Warp 1, uses the cache before the switching point # 1. Thread group 0 410 and thread group 1 420 are switched at switching point # 1 by cache bypassing device 300. After switching point # 1, thread group 0 410 and thread group 1 420 are bypassed to not use the cache.

둘째 그룹인 Warp 2로 표시되는 쓰레드 그룹 2(430)과 Warp 3으로 표시되는 쓰레드 그룹 3(440)은 스위칭 지점 #1 이전에는 캐시를 사용하지 않는다. 쓰레드 그룹 2(430)와 쓰레드 그룹 3(440)은 캐시 바이패싱 장치(300)에 의해 스위칭 지점 #1에서 스위칭된다. 스위칭 지점 #1 이후, 쓰레드 그룹 2(430)와 쓰레드 그룹 3(440)은 캐시를 사용하도록 캐싱된다. 그리고 쓰레드 그룹 2(430)와 쓰레드 그룹 3(440)은 캐시 바이패싱 장치(300)에 의해 스위칭 지점 #2에서 다시 스위칭된다. 스위칭 지점 #2 이후, 쓰레드 그룹 2(430)와 쓰레드 그룹 3(440)은 캐시를 다시 사용하지 않도록 바이패싱된다.Thread group 2 430 indicated by Warp 2 and thread group 3 440 indicated by Warp 3 do not use the cache before switching point # 1. Thread group 2 430 and thread group 3 440 are switched at switching point # 1 by cache bypassing device 300. After switching point # 1, thread group 2 430 and thread group 3 440 are cached to use the cache. The thread group 2 430 and the thread group 3 440 are switched again at the switching point # 2 by the cache bypassing apparatus 300. After switching point # 2, thread group 2 430 and thread group 3 440 are bypassed so as not to reuse the cache.

셋째 그룹인 Warp 4로 표시되는 쓰레드 그룹 4(450)과 Warp 5로 표시되는 쓰레드 그룹 5(460)는 스위칭 지점 #2 이전에는 캐시를 사용하지 않는다. 쓰레드 그룹 4(450)와 쓰레드 그룹 5(460)는 캐시 바이패싱 장치(300)에 의해 스위칭 지점 #2에서 스위칭된다. 스위칭 지점 #2 이후, 쓰레드 그룹 4(450)와 쓰레드 그룹 5(460)는 캐시를 사용하도록 캐싱된다.Thread group 4 (450), represented by Warp 4, and thread group 5 (460), indicated by Warp 5, do not use a cache prior to switching point # 2. Thread group 4 450 and thread group 5 460 are switched at switching point # 2 by cache bypassing device 300. After switching point # 2, thread group 4 450 and thread group 5 460 are cached to use the cache.

도 7은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 스위칭 가능 지점의 검색을 설명하기 위한 도면이다.FIG. 7 is a diagram for explaining a search for a switchable point in a thread bypassing method of a thread group level according to an embodiment of the present invention.

도 7에 도시된 바와 같이, 쓰레드 블록(700)은 복수의 쓰레드 그룹으로 구성될 수 있다. 쓰레드 블록(700)은 로드 A(load A) 내지 로드 H(load H)를 포함하며, 복수의 쓰레드 그룹은 각각 여러 개의 로드들을 포함할 수 있다.As shown in FIG. 7, the thread block 700 may be composed of a plurality of thread groups. The thread block 700 includes loads A (load A) to load H (load H), and a plurality of thread groups may each include a plurality of loads.

여기서, 캐시 바이패싱 장치(300)는 쓰레드 블록(700)에 대해 적어도 하나의 스위칭 가능 지점(Switchable point)을 검색한다. 이때, 스위칭 가능 지점은 캐싱된 데이터가 더는 재사용되지 않는 지점을 나타낸다. 즉, 캐시 바이패싱 장치(300)는 쓰레드 그룹이 캐싱한 데이터가 재사용되지 않는 적어도 하나의 지점을 검색하고, 그 검색된 적어도 하나의 지점을 적어도 하나의 스위칭 가능 지점으로 설정할 수 있다.Here, the cache bypassing apparatus 300 searches for at least one switchable point with respect to the thread block 700. At this point, the switchable point represents the point where the cached data is no longer reused. That is, the cache bypassing apparatus 300 can search for at least one point at which data cached by the thread group is not reused, and set at least one point that is searched for as at least one switchable point.

도 7에 도시된 일례에서, 캐시 바이패싱 장치(300)는 쓰레드 블록(700)에 대해 3개의 스위칭 가능 지점인 스위칭 가능 지점 #1(Switchable point #1), 스위칭 가능 지점 #2(Switchable point #2), 스위칭 가능 지점 #3(Switchable point #3)을 검색할 수 있다. In the example shown in Figure 7, the cache bypassing device 300 has three switchable points # 1 (Switchable point # 1), Switchable point # 2 (Switchable point # 2), and switchable point # 3 (switchable point # 3).

도 8은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 스위칭 지점 결정을 설명하기 위한 도면이다.8 is a diagram for explaining switching point determination in a thread bypass level cache-level method according to an embodiment of the present invention.

도 8에 도시된 바와 같이, 쓰레드 블록이 8개의 쓰레드 그룹(warp)으로 구성될 수 있다. 캐시를 동시에 사용할 수 있는 적정한 쓰레드 그룹의 개수가 2개라고 가정하면, 총 4개의 쓰레드 그룹으로 구성된다. 그러면, 3번의 스위칭 지점이 필요하게 된다.As shown in FIG. 8, a thread block may be composed of eight thread groups (warp). Assuming that there are two proper thread groups that can use the cache at the same time, it consists of a total of four thread groups. Then, three switching points are required.

따라서 본 발명의 일 실시 예에 따른 캐시 바이패싱 장치(300)는 스위칭 가능 지점 중에서 3개의 스위칭 지점을 결정한다. 이때, 캐시 바이패싱 장치(300)는 스위칭 지점 간의 거리가 가장 비슷하도록 스위칭 지점을 결정할 수 있다.Thus, the cache bypassing apparatus 300 according to an embodiment of the present invention determines three switching points among the switchable points. At this time, the cache bypassing apparatus 300 can determine the switching point so that the distance between the switching points is the closest.

도 8에 도시된 일례에서, 캐시 바이패싱 장치(300)는 스위칭 가능 지점 #1 내지 #6중에서 스위칭 가능 지점 #2, 스위칭 가능 지점 #4 및 스위칭 가능 지점 #6을 스위칭 지점으로 결정할 수 있다. 즉, 캐시 바이패싱 장치(300)는 결정된 스위칭 지점 #2, #4 및 #6에서 쓰레드 그룹을 스위칭시킬 수 있다.In the example shown in FIG. 8, the cache bypassing apparatus 300 can determine the switchable point # 2, the switchable point # 4, and the switchable point # 6 as the switching points among the switchable points # 1 to # 6. That is, the cache bypassing apparatus 300 may switch thread groups at the determined switching points # 2, # 4 and # 6.

도 9 내지 도 12는 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 코드 수정 과정을 설명하기 위한 도면이다.9 to 12 are diagrams for explaining a code modification procedure in a thread bypass level cache-level bypassing method according to an embodiment of the present invention.

이하, 도 9 내지 도 12를 참조하여 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 코드 수정 과정을 설명하기로 한다.Hereinafter, the code modification process in the thread bypass level cache-level bypassing method will be described with reference to FIGS. 9 to 12. FIG.

도 9에 도시된 바와 같이, 캐시 바이패싱 장치(300)는 쓰레드 그룹 레벨의 캐시 바이패싱 방법에서 전용 기본 블록(dedicated basic blocks, BB)을 분할한다. 여기서, 전용 기본 블록은 쓰레드 그룹 레벨일 수 있다. 일례로, 캐시 바이패싱 장치(300)는 전용 기본 블록 BB0(900)를 스위칭 지점을 기준으로 전용 기본 블록 BB0_0(910)과 전용 기본 블록 BB0_1(920)로 나눈다.As shown in FIG. 9, the cache bypassing apparatus 300 divides dedicated basic blocks (BB) in a thread group level cache bypassing method. Here, the dedicated basic block may be at the thread group level. For example, the cache bypassing apparatus 300 divides a dedicated basic block BB0 900 into a dedicated basic block BB0_0 910 and a dedicated basic block BB0_1 920 based on a switching point.

여기서, 스위칭 지점이 N 개의 스위칭 지점이면, 캐시 바이패싱 장치(300)는 전용 기본 블록 BB#을 전용 기본 블록 BB0#_0, 전용 기본 블록 BB0#_1, …, 전용 기본 블록 BB0#_N으로 나눌 수 있다.Here, if the switching point is the N switching points, the cache bypassing apparatus 300 allocates the dedicated basic block BB # to the dedicated basic block BB0 # _0, the dedicated basic block BB0 # , And a dedicated basic block BB0 # _N.

도 10에 도시된 바와 같이, 캐시 바이패싱 장치(300)는 분할된 전용 기본 블록들을 복제하고, 그 복제된 전용 기본 블록들의 이름을 다시 지정할 수 있다. 일례로, 캐시 바이패싱 장치(300)는 분할된 전용 기본 블록 BB0_0(910)을 복제하여 전용 기본 블록 BB0_0_0(911)과 BB0_0_1(912)로 이름을 다시 지정할 수 있다. 또한, 캐시 바이패싱 장치(300)는 분할된 전용 기본 블록 BB0_1(920)을 복제하여 전용 기본 블록 BB0_1_0(921)과 BB0_1_1(922)로 이름을 다시 지정할 수 있다.As shown in FIG. 10, the cache bypassing apparatus 300 may duplicate partitioned dedicated basic blocks and rename the duplicated dedicated basic blocks. By way of example, the cache bypassing device 300 may duplicate the partitioned dedicated basic block BB0_0 910 and reassign it to the dedicated basic blocks BB0_0_0 911 and BB0_0_1 912. In addition, the cache bypassing apparatus 300 can duplicate the divided dedicated basic block BB0_1 920 and reassign the dedicated basic blocks BB0_1_0 921 and BB0_1_1 922.

그리고 캐시 바이패싱 장치(300)는 전용 기본 블록과 복제된 전용 기본 블록에 캐시 연산자들을 추가할 수 있다. 예컨대, 캐시 바이패싱 장치(300)는 전용 기본 블록 BB0_0_0(911)에서의 명령어에 바이패싱 연산자 "cg"를 추가하여 ld.global.cg #1, ld.global.cg #2, ld.global.cg #3과 같이 수정할 수 있다. 반대로, 캐시 바이패싱 장치(300)는 복제된 전용 기본 블록 BB0_0_1(912)에서의 명령어에 캐싱 연산자 "ca"를 추가하여 ld.global.ca #1, ld.global.ca #2, ld.global.ca #3과 같이 수정할 수 있다. 전용 기본 블록 BB0_1_0(921)과 복제된 전용 기본 블록 BB0_1_1(922)에 대해서도 동일하게 적용될 수 있다.The cache bypassing device 300 may add cache operators to the dedicated basic block and the replicated dedicated basic block. For example, the cache bypassing apparatus 300 adds the bypassing operator " cg " to the instruction in the dedicated basic block BB0_0_0 911 to obtain ld.global.cg # 1, ld.global.cg # 2, ld.global. It can be modified like cg # 3. Conversely, the cache bypassing device 300 adds the caching operator " ca " to the instruction in the replicated dedicated basic block BB0_0_1 912 to obtain ld.global.ca # 1, ld.global.ca # 2, ld.global .ca # 3 and so on. The same can be applied to the dedicated basic block BB0_1_0 921 and the replicated dedicated basic block BB0_1_1 922.

도 11에 도시된 바와 같이, 캐시 바이패싱 장치(300)는 각 쓰레드 그룹에 대해 제어 플로우(control flow)를 생성한다. 일례로, 캐시 바이패싱 장치(300)는 쓰레드 그룹 0에 대해 {BB#_0_1}, {BB#_1_0}, …, {BB#_N_0}과 같이 제어 플로우를 생성할 수 있다.As shown in FIG. 11, the cache bypassing apparatus 300 generates a control flow for each thread group. By way of example, the cache by-passing device 300 may determine that {BB # _0_1}, {BB # _1_0}, ... , {BB # _N - 0}.

예컨대, 캐시 바이패싱 장치(300)는 캐싱 연산자가 추가된 전용 기본 블록 BB0_0_1(912)을 실행하고, 바이패싱 연산자가 추가된 전용 기본 블록 BB0_1_0(921)을 실행하도록 제어 플로우를 생성할 수 있다. 또한, 캐시 바이패싱 장치(300)는 바이패싱 연산자가 추가된 전용 기본 블록 BB0_0_0(911)을 실행하고, 캐싱 연산자가 추가된 전용 기본 블록 BB0_1_1(922)을 실행하도록 제어 플로우를 생성할 수 있다. 즉, 캐시 바이패싱 장치(300)는 스위칭 지점을 기준으로 각 쓰레드 그룹이 서로 번갈아가며 캐시를 사용하거나 캐시를 사용하지 않도록 제어 플로우를 생성할 수 있다. 이를 통해, 캐시는 각 쓰레드 그룹에 대해서 공평하게 사용될 수 있다.For example, the cache bypassing apparatus 300 may execute a dedicated basic block BB0_0_1 912 to which a caching operator is added, and generate a control flow to execute a dedicated basic block BB0_1_0 921 to which a bypassing operator is added. In addition, the cache bypassing apparatus 300 may execute a dedicated basic block BB0_0_0 911 to which a bypassing operator is added, and generate a control flow to execute a dedicated basic block BB0_1_1 922 to which a caching operator is added. That is, the cache bypassing apparatus 300 can generate a control flow such that each thread group alternates with each other based on the switching point, and does not use the cache or use the cache. This allows the cache to be used fairly for each thread group.

도 12에 도시된 바와 같이, 캐시 바이패싱 장치(300)는 전용 기본 블록들을 모은다. 그리고 캐시 바이패싱 장치(300)는 제어 플로우 명령어들(control flow instructions)을 삽입시킨다.As shown in FIG. 12, the cache bypassing apparatus 300 collects dedicated basic blocks. The cache bypassing device 300 inserts control flow instructions.

이와 같이, 캐시 바이패싱 장치(300)는 전용 기본 블록을 스위칭 지점을 기준으로 분할하고, 캐시 연산자들을 이용하여 병렬 쓰레드 실행(PTX) 코드를 수정할 수 있다.In this manner, the cache bypassing apparatus 300 can divide the dedicated basic block based on the switching point, and modify the parallel thread execution (PTX) code using the cache operators.

도 13은 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법의 흐름을 설명하기 위한 순서도이다. FIG. 13 is a flow chart for explaining a flow of a thread-group-level cache bypassing method according to an embodiment of the present invention.

본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 쓰레드 그룹 레벨의 캐시 바이패싱에서 쓰레드 그룹별로 균등한 캐시를 사용하기 위해 하기와 같은 단계로 구성된다.The thread bypassing method of the thread group level according to an embodiment of the present invention includes the following steps in order to use an even cache for each thread group in the thread bypassing of the thread group level.

단계 S101에서, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 메모리 접근 정보 및 캐시 바이패싱 정보를 추출한다. 캐시 바이패싱 장치(300)는 GPU 시뮬레이터 기반으로 작성된 정보 추출부(310)를 통해, 메모리 접근 정보 및 캐시 바이패싱 정보를 추출한다. 여기서, 캐시 바이패싱 정보에는 캐시를 바이패싱 하는 쓰레드 그룹 아이디가 포함될 수 있다.In step S101, the cache bypassing apparatus 300 of the thread group level extracts memory access information and cache bypassing information. The cache bypassing apparatus 300 extracts memory access information and cache bypassing information through an information extracting unit 310 created based on the GPU simulator. Here, the cache bypassing information may include a thread group ID for bypassing the cache.

단계 S102에서, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 쓰레드 그룹별로 적어도 하나의 스위칭 가능 지점을 검색한다. 캐시 바이패싱 장치(300)는 스위칭 지점 검색부(320)를 통해, 추출한 메모리 접근 정보를 바탕으로 캐시 적중률을 떨어뜨리지 않는 스위칭 지점을 모두 검색한다. 캐시 바이패싱 장치(300)는 모든 쓰레드 그룹이 캐시를 공평하게 번갈아가며 사용함으로써, 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱에서 발생하는 메모리 접근 지연 시간 차이로 인한 쓰레드 그룹 간 실행 시간 격차를 줄일 수 있다.In step S102, the thread-group-level cache bypassing device 300 searches for at least one switchable point for each thread group. The cache bypassing apparatus 300 searches all the switching points that do not drop the cache hit rate based on the extracted memory access information through the switching point searching unit 320. [ The cache bypassing apparatus 300 can reduce the execution time gap between the thread groups due to the difference in the memory access delay time which occurs in the cache bypassing at the general thread group level by using all of the thread groups alternately using the cache in a fairly alternate manner.

이를 위해, 캐시 바이패싱 장치(300)는 프로그램의 특정 지점에서 캐시를 사용하는 쓰레드 그룹을 스위칭하도록 프로그램을 수정한다. 이때, 캐시를 사용하는 쓰레드 그룹을 스위칭하는 지점에 따라 캐시 적중률이 떨어질 수도 있기 때문에, 캐시 바이패싱 장치(300)는 캐시 적중률이 떨어지지 않는 스위칭 가능 지점을 검색한다.To this end, the cache bypassing device 300 modifies the program to switch the thread group that uses the cache at a particular point in the program. At this time, since the cache hit ratio may be lowered depending on the point at which the thread group using the cache is switched, the cache bypassing apparatus 300 searches for a switchable point where the cache hit ratio does not fall.

단계 S103에서, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정한다. 캐시 바이패싱 장치(300)는 스위칭 지점을 이전에 캐시를 사용하던 쓰레드 그룹이 캐싱한 데이터가 더는 재사용되지 않는 지점으로 결정한다. 해당 스위칭 지점에서 캐싱된 데이터가 앞으로 더는 재사용되지 않기 때문에, 캐시 바이패싱 장치(300)가 캐시를 사용하는 쓰레드 그룹을 스위칭하여도 캐시 적중률이 저하되지 않는다.In step S103, the cache bypassing apparatus 300 of the thread group level determines at least one switching point for each thread group. The cache bypassing apparatus 300 determines the switching point as a point at which the data cached by the thread group that previously used the cache is no longer reused. The cache hit ratio is not lowered even when the cache bypassing apparatus 300 switches the thread group using the cache because the cached data at the corresponding switching point is not reused any more.

캐시 바이패싱 장치(300)는 스위칭 지점 결정부(330)를 통해, 캐시 적중률을 떨어뜨리지 않는 모든 스위칭 가능 지점 중에서 실제로 스위칭을 할 스위칭 지점을 결정한다. 스위칭 과정에서 쓰레드 그룹 간 스위칭 지점까지 도달하는 시간 차이 동안 어느 쓰레드 그룹도 캐시를 사용하지 않게 되므로, 스위칭은 최소한으로 이루어져야 한다. 따라서 캐시 바이패싱 장치(300)는 스위칭 가능 지점 중에서 스위칭 횟수만큼 스위칭 지점을 결정할 수 있다. 즉, 캐시 바이패싱 장치(300)는 최소 개수만큼 스위칭 지점을 결정할 수 있다.The cache bypassing apparatus 300 determines, through the switching point determining unit 330, a switching point to actually switch among all switchable points that do not drop the cache hit ratio. Switching should be minimized because no thread group will use the cache during the time difference to reach the switching point between thread groups in the switching process. Thus, the cache bypassing apparatus 300 can determine the switching point by the number of switching times among the switchable points. That is, the cache bypassing apparatus 300 can determine a minimum number of switching points.

스위칭 횟수는 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱에 의해 결정될 수 있다. 캐시 바이패싱 장치(300)는 전체 쓰레드 그룹의 수를 캐시를 사용할 쓰레드 그룹의 수로 나눈 값의 올림으로 계산할 수 있다. 캐시 바이패싱 장치(300)는 스위칭 가능 지점 중에서 스위칭 횟수만큼 스위칭 지점을 결정할 수 있다. 이때, 캐시 바이패싱 장치(300)는 각 스위칭 지점 간의 거리가 최대한 유사하거나, 각 스위칭 지점 간의 거리 차이가 최소가 되도록 결정할 수 있다.The number of switching times can be determined by general thread group level cache bypassing. The cache bypassing apparatus 300 can calculate the total number of thread groups by increasing the value obtained by dividing the cache by the number of thread groups to be used. The cache bypassing apparatus 300 can determine the switching point by the number of switching times among the switchable points. At this time, the cache bypassing apparatus 300 may determine that the distance between the switching points is maximally similar or the difference between the respective switching points is the minimum.

단계 S104에서, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 그래픽 처리 장치의 프로그램 코드를 수정한다. 일례로, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 병렬 쓰레드 실행(Parallel Thread Execution, PTX) 코드를 수정할 수 있다. 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 단계 S103에서 스위칭 지점이 결정되면, 실제 프로그램에 적용하기 위해서, 스위칭 지점에 대한 정보를 바탕으로 PTX 코드를 수정할 수 있다. PTX 코드는 GPU 컴퓨팅 프로그램의 중간 언어로 작성된 코드 중 대표적인 코드를 나타낸다.In step S104, the cache bypassing apparatus 300 of the thread group level modifies the program code of the graphics processing apparatus. As an example, the thread-group level cache bypassing device 300 may modify the Parallel Thread Execution (PTX) code. When the switching point is determined in step S103, the cache bypassing apparatus 300 at the thread group level can modify the PTX code based on the information about the switching point to be applied to an actual program. The PTX code represents representative of the code written in the intermediate language of the GPU computing program.

캐시 바이패싱 장치(300)는 제1 캐시 연산자 및 제2 캐시 연산자를 이용하여 어느 쓰레드 그룹이 특정 프로그램 영역에서 캐시를 사용할 것인지를 명시한다. 여기서, 제1 캐시 연산자는 쓰레드 그룹의 바이패싱을 위한 cg 바이패싱 연산자를 나타낸다. 제2 캐시 연산자는 쓰레드 그룹의 캐싱을 위한 ca 캐싱 연산자를 나타낸다.Cache bypassing device 300 specifies which thread group will use the cache in a particular program area using the first cache operator and the second cache operator. Here, the first cache operator represents a cg bypassing operator for bypassing a thread group. The second cache operator represents a ca caching operator for caching of the thread group.

예를 들어, 프로그램 영역 0을 BB0이라고 표현한다면, 캐시 바이패싱 장치(300)는 BB0을 이용하여 캐시를 바이패싱하는 BB0_0에서는 load 명령어에 제1 캐시 연산자(cg 바이패싱 연산자)를 붙인다. 캐시 바이패싱 장치(300)는 캐시를 사용하는 BB0_1에서는 load 명령어에 제2 캐시 연산자(ca 캐싱 연산자)를 붙인다. 이후, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 캐시를 사용할 쓰레드 그룹은 BB0_1을, 캐시를 바이패싱할 쓰레드 그룹은 BB0_0을 수행하도록 컨트롤 플로우를 작성할 수 있다.For example, if the program area 0 is expressed as BB0, the cache bypassing apparatus 300 attaches a first cache operator (cg bypassing operator) to the load instruction in BB0_0, which bypasses the cache using BB0. The cache bypassing apparatus 300 attaches a second cache operator (ca caching operator) to the load instruction in BB0_1 using the cache. Thereafter, the cache bypassing apparatus 300 at the thread group level can create the control flow to perform BB0_1 for the thread group to use the cache and BB0_0 to the thread group for bypassing the cache.

단계 S105에서, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 수정된 프로그램 코드를 컴파일하여 캐시 바이패싱 프로그램을 생성한다. 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 컴파일 단계를 수행한다. 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 PTX 코드가 생성되면, GPU 컴파일러(예컨대, 대표적으로 NVCC, NVIDIA's CUDA Compiler)가 PTX 코드를 컴파일하여 실행가능한 GPU 프로그램을 생성한다.In step S105, the thread-group-level cache bypassing device 300 compiles the modified program code to generate a cache bypassing program. The thread-group level cache bypassing device 300 performs the compilation step. When the PTX code is generated, the GPU compiler (for example, NVCC, NVIDIA's CUDA Compiler) compiles the PTX code to generate an executable GPU program.

단계 S106에서, 쓰레드 그룹 레벨의 캐시 바이패싱 장치(300)는 캐시 바이패싱 프로그램에 따라 쓰레드 그룹이 사용하는 캐시를 바이패싱한다.In step S106, the thread-group-level cache bypassing device 300 bypasses the cache used by the thread group according to the cache bypassing program.

한편, 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법과 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법을 비교한 실험 내용 및 실험 결과를 살펴보기로 한다.Meanwhile, experimental results and experimental results comparing a general thread group level cache bypass passing method and a thread group level cache bypass passing method according to an embodiment of the present invention will be described.

우선, 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법이 캐시를 사용하는 그룹과 캐시를 사용하지 않는 그룹 간의 실행 시간 격차를 야기하고 그 정도를 알아보기 위해 실험에 대해 설명하기로 한다.First, we will explain the experiment to see how the general thread group level cache bypassing method causes the execution time gap between the cache-use group and the non-cache use group.

도 14는 캐시 미사용, 캐시 사용 및 일반적인 쓰레드 그룹 레벨의 캐시 바이패싱 방법에 따른 실행 시간을 나타낸 도면이다.14 is a view showing execution time according to a cache unused, a cache use, and a general thread group level cache bypass passing method.

실험은 사이클 정확도 기반 GPU 시뮬레이터(GPGPUSim)에서 행렬, 선형대수 등 연산을 하는 벤치마크(Polybench/gpu)의 ATAX 애플리케이션의 첫 번째 커널을 대상으로 쓰레드 그룹 간 정규화된 실행 시간(Normalized Execution Time)을 측정하여 도 14와 같이 그래프로 나타내었다.The experiment is to measure normalized execution time between thread groups in the first kernel of the ATAX application of a benchmark (Polybench / gpu) that performs mathematical operations such as matrix and linear algebra in the GPU simulator based on cycle accuracy (GPGPUSim) As shown in FIG. 14.

해당 커널은 모든 쓰레드 그룹이 동일한 작업을 수행하는 워크로드로 L1 캐시를 사용하지 않을 경우인 L1 Bypass(1401)는 쓰레드 그룹 간 실행 시간 격차가 거의 없다. 또한, L1 캐시 및 메모리 자원 요청 집중으로 인해 L1 캐시를 사용할 경우인 L1 Cache(1402)는 오히려 성능이 저하되는 특성을 지닌다. L1 캐시를 효율적으로 사용하기 위해 쓰레드 그룹 레벨 캐시 바이패싱을 적용한 경우인 WLB(1403)는 도 14에서 맨 오른쪽 영역만큼 성능 향상을 보인다. 하지만, WLB(1403)는 캐시를 사용하는 그룹과 캐시를 사용하지 않는 그룹 간의 실행 격차가 약 26.53% 발생하는 것을 볼 수 있다.If the kernel does not use the L1 cache as a workload where all thread groups perform the same task, L1 Bypass (1401) has little runtime gap between the thread groups. Also, the L1 cache 1402, which uses the L1 cache due to the concentration of the L1 cache and memory resource requests, is rather degraded in performance. The WLB 1403, which is a case where thread group level cache bypassing is applied to efficiently use the L1 cache, shows a performance improvement as in the rightmost area in FIG. However, the WLB 1403 can see that the execution gap between the group using the cache and the group using the cache is about 26.53%.

도 15는 일반적인 캐시 바이패싱 방법 및 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법에 따른 실행 시간을 나타낸 도면이다.FIG. 15 is a diagram illustrating an execution time according to a general cache bypass passing method and a cache group bypassing method according to an embodiment of the present invention.

쓰레드 그룹 간 실행 시간 격차를 줄일 수 있는 본 발명의 일 실시 예에 따른 쓰레드 그룹 레벨의 캐시 바이패싱 방법을 적용하면, 도 15에 도시된 결과를 얻을 수 있다. 도 15의 막대그래프는 (1) L1 캐시를 사용하지 않는 경우인 L1 Cache Disable(1501), (2) L1 캐시를 사용한 경우인 L1 Cache Enable(1502), (3) 쓰레드 그룹 레벨 캐시 바이패싱을 적용한 경우인 Warp-level Cache Bypassing(1503), (4) 본 발명의 실시 예에 따른 방법을 적용한 경우인 Balanced Cache Bypassing(1504)의 각 애플리케이션의 특정 커널의 IPC(Instruction per Cycle)를 측정한 결과이다. 본 발명의 실시 예에 따른 방법을 적용한 경우인 Balanced Cache Bypassing(1504)은 L1 Cache Disable(1501) 및 L1 Cache Enable(1502)에 비해 높은 성능 향상을 보이고 있다. 본 발명의 실시 예에 따른 방법을 적용한 경우인 Balanced Cache Bypassing(1504)은 Warp-level Cache Bypassing(1503)에 비해 평균적으로 약 6%의 성능 향상을 보이고 있다. The result shown in FIG. 15 can be obtained by applying the thread-group-level cache bypassing method according to an embodiment of the present invention which can reduce the execution time gap between the thread groups. The bar graph in FIG. 15 shows (1) L1 Cache Disable 1501 in which L1 cache is not used, (2) L1 Cache Enable 1502 in case of using L1 cache, and (3) thread group level cache bypassing (Cache Bypassing 1503), (4) Applying the method according to an embodiment of the present invention, Balanced Cache Bypassing 1504 measures the IPC (instruction per cycle) of a specific kernel of each application to be. The Balanced Cache Bypassing 1504 in the case of applying the method according to the embodiment of the present invention shows a higher performance improvement than the L1 Cache Disable 1501 and the L1 Cache Enable 1502. [ Balanced Cache Bypassing (1504), which is the case of applying the method according to the embodiment of the present invention, shows an average improvement of about 6% over Warp-level Cache Bypassing (1503).

도 16은 밸런싱이 되지 않은 일반적인 캐시 바이패싱 방법과 본 발명의 일 실시 예에 따른 밸런싱된 캐시 바이패싱 방법에 따른 실행 시간을 나타낸 도면이다.16 is a diagram illustrating an execution time according to a conventional cache bypass transfer method that is not balanced and a balanced cache bypass transfer method according to an embodiment of the present invention.

밸런싱이 되지 않은 일반적인 캐시 바이패싱 방법인 Unbalanced Cache Bypassing(1601)은 도 14에서 설명한 바와 같이, 캐시를 사용하는 그룹과 캐시를 사용하지 않는 그룹 간의 실행 격차가 약 26.53% 발생하는 것을 볼 수 있다.Unbalanced Cache Bypassing 1601, which is an unbalanced general cache bypassing method, can see that the execution gap between a group using a cache and a group using no cache is about 26.53%, as described in FIG.

도 16에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 밸런싱된 캐시 바이패싱 방법인 Balanced Cache Bypassing(1602)은 캐시를 사용하는 그룹과 캐시를 사용하지 않는 그룹 간의 실행 격차가 발생하고 있지 않은 것을 볼 수 있다. 또한, 본 발명의 일 실시 예에 따른 밸런싱된 캐시 바이패싱 방법인 Balanced Cache Bypassing(1602)은 각 쓰레드 그룹이 캐시를 공평하기 때문에 연산 시간이 감소하여 일반적인 캐시 바이패싱 방법인 Unbalanced Cache Bypassing(1601)에 비해 14.27%의 성능이 향상되는 것으로 나타난다.As shown in FIG. 16, Balanced Cache Bypassing 1602, which is a balanced cache bypassing method according to an embodiment of the present invention, includes a case where there is no execution gap between a group using a cache and a group using no cache Can be seen. In addition, Balanced Cache Bypassing 1602, which is a balanced cache bypassing method according to an embodiment of the present invention, reduces the computation time because each thread group is fairly cached, and unbalanced cache bypassing 1601, which is a general cache bypassing method, The performance is improved by 14.27%.

상술한 쓰레드 그룹 레벨의 캐시 바이패싱 방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다.The thread-group-level cache bypassing method described above can be implemented as computer-readable code on a computer-readable recording medium.

쓰레드 그룹 레벨의 캐시 바이패싱 방법은, 프로세서에 의해 실행 가능한 명령어들을 포함하는 컴퓨터 판독 가능한 저장 매체로서, 상기 명령어들은 상기 프로세서로 하여금, 구비된 메모리 및 캐시에 대한 메모리 정보 및 캐시 정보를 추출하는 단계, 상기 추출된 메모리 정보 및 캐시 정보를 이용하여 캐시를 사용하는 쓰레드 그룹별로 적어도 하나의 스위칭 지점을 결정하는 단계, 및 상기 결정된 적어도 하나의 스위칭 지점에 따라 상기 쓰레드 그룹이 사용하는 캐시를 바이패싱하는 단계를 포함하여 실행하도록 구성되는, 컴퓨터 판독 가능한 저장 매체를 포함한다. A thread-group level cache bypassing method is a computer-readable storage medium comprising instructions executable by a processor, the instructions causing the processor to perform the steps of: extracting memory information and cache information for a memory and a cache, Determining at least one switching point for each thread group using the cache using the extracted memory information and cache information, and bypassing the cache used by the thread group according to the determined at least one switching point And a computer-readable storage medium configured to perform the method.

컴퓨터가 읽을 수 있는 기록 매체로는 컴퓨터 시스템에 의하여 해독될 수 있는 데이터가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래시 메모리, 광 데이터 저장장치 등이 있을 수 있다. 또한, 컴퓨터로 판독 가능한 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.The computer-readable recording medium includes all kinds of recording media storing data that can be decoded by a computer system. For example, there may be a ROM (Read Only Memory), a RAM (Random Access Memory), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device and the like. The computer-readable recording medium may also be distributed and executed in a computer system connected to a computer network and stored and executed as a code that can be read in a distributed manner.

이상, 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined by the appended claims. It will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

구체적으로, 설명된 특징들은 디지털 전자 회로, 또는 컴퓨터 하드웨어, 펌웨어, 또는 그들의 조합들 내에서 실행될 수 있다. 특징들은 예컨대, 프로그래밍 가능한 프로세서에 의한 실행을 위해, 기계 판독 가능한 저장 디바이스 내의 저장장치 내에서 구현되는 컴퓨터 프로그램 제품에서 실행될 수 있다. 그리고 특징들은 입력 데이터 상에서 동작하고 출력을 생성함으로써 설명된 실시예들의 함수들을 수행하기 위한 지시어들의 프로그램을 실행하는 프로그래밍 가능한 프로세서에 의해 수행될 수 있다. 설명된 특징들은, 데이터 저장 시스템으로부터 데이터 및 지시어들을 수신하기 위해, 및 데이터 저장 시스템으로 데이터 및 지시어들을 전송하기 위해 결합된 적어도 하나의 프로그래밍 가능한 프로세서, 적어도 하나의 입력 디바이스, 및 적어도 하나의 출력 디바이스를 포함하는 프로그래밍 가능한 시스템 상에서 실행될 수 있는 하나 이상의 컴퓨터 프로그램들 내에서 실행될 수 있다. 컴퓨터 프로그램은 소정 결과에 대해 특정 동작을 수행하기 위해 컴퓨터 내에서 직접 또는 간접적으로 사용될 수 있는 지시어들의 집합을 포함한다. 컴퓨터 프로그램은 컴파일된 또는 해석된 언어들을 포함하는 프로그래밍 언어 중 어느 형태로 쓰여지고, 모듈, 소자, 서브루틴(subroutine), 또는 다른 컴퓨터 환경에서 사용을 위해 적합한 다른 유닛으로서, 또는 독립 조작 가능한 프로그램으로서 포함하는 어느 형태로도 사용될 수 있다.In particular, the described features may be implemented within digital electronic circuitry, or computer hardware, firmware, or combinations thereof. The features may be implemented in a computer program product embodied in a storage device in a machine-readable storage device, for example, for execution by a programmable processor. And the features may be performed by a programmable processor executing a program of instructions for performing the functions of the described embodiments by operating on input data and generating an output. The described features include at least one programmable processor, at least one input device, and at least one output device, coupled to receive data and directives from a data storage system and to transmit data and directives to a data storage system, Such as a computer-readable recording medium. A computer program includes a set of directives that can be used directly or indirectly within a computer to perform a particular operation on a given result. A computer program may be written in any form of programming language including compiled or interpreted languages and may be implemented as a module, element, subroutine, or other unit suitable for use in other computer environments, or as a standalone program Can be used.

지시어들의 프로그램의 실행을 위한 적합한 프로세서들은, 예를 들어, 범용 및 특수 용도 마이크로프로세서들 둘 모두, 및 단독 프로세서 또는 다른 종류의 컴퓨터의 다중 프로세서들 중 하나를 포함한다. 또한 설명된 특징들을 구현하는 컴퓨터 프로그램 지시어들 및 데이터를 구현하기 적합한 저장 디바이스들은 예컨대, EPROM, EEPROM, 및 플래쉬 메모리 디바이스들과 같은 반도체 메모리 디바이스들, 내부 하드 디스크들 및 제거 가능한 디스크들과 같은 자기 디바이스들, 광자기 디스크들 및 CD-ROM 및 DVD-ROM 디스크들을 포함하는 비휘발성 메모리의 모든 형태들을 포함한다. 프로세서 및 메모리는 ASIC들(application-specific integrated circuits) 내에서 통합되거나 또는 ASIC들에 의해 추가될 수 있다.Suitable processors for execution of the program of instructions include, for example, both general purpose and special purpose microprocessors, and one of multiple processors of a single processor or other type of computer. Also, storage devices suitable for implementing the computer program instructions and data embodying the described features may be embodied in a computer-readable medium, such as, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices, Devices, magneto-optical disks, and non-volatile memory including CD-ROM and DVD-ROM disks. The processor and memory may be integrated within ASICs (application-specific integrated circuits) or added by ASICs.

이상에서 설명한 본 발명은 일련의 기능 블록들을 기초로 설명되고 있지만, 전술한 실시 예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.While the invention has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. It will be apparent to one skilled in the art to which the present invention pertains.

전술한 실시 예들의 조합은 전술한 실시 예에 한정되는 것이 아니며, 구현 및/또는 필요에 따라 전술한 실시예들 뿐 아니라 다양한 형태의 조합이 제공될 수 있다.The combination of the above-described embodiments is not limited to the above-described embodiments, and various combinations and combinations of the above-described embodiments as well as the implementation and / or the necessity may be provided.

전술한 실시 예들에서, 방법들은 일련의 단계 또는 블록으로서 순서도를 기초로 설명되고 있으나, 본 발명은 단계들의 순서에 한정되는 것은 아니며, 어떤 단계는 상술한 바와 다른 단계와 다른 순서로 또는 동시에 발생할 수 있다. 또한, 당해 기술 분야에서 통상의 지식을 가진 자라면 순서도에 나타난 단계들이 배타적이지 않고, 다른 단계가 포함되거나, 순서도의 하나 또는 그 이상의 단계가 본 발명의 범위에 영향을 미치지 않고 삭제될 수 있음을 이해할 수 있을 것이다.In the above-described embodiments, the methods are described on the basis of a flowchart as a series of steps or blocks, but the present invention is not limited to the order of steps, and some steps may occur in different orders or in a different order than the steps described above have. It will also be understood by those skilled in the art that the steps depicted in the flowchart illustrations are not exclusive and that other steps may be included or that one or more steps in the flowchart may be deleted without affecting the scope of the invention You will understand.

전술한 실시 예는 다양한 양태의 예시들을 포함한다. 다양한 양태들을 나타내기 위한 모든 가능한 조합을 기술할 수는 없지만, 해당 기술 분야의 통상의 지식을 가진 자는 다른 조합이 가능함을 인식할 수 있을 것이다. 따라서, 본 발명은 이하의 특허청구범위 내에 속하는 모든 다른 교체, 수정 및 변경을 포함한다고 할 것이다.The foregoing embodiments include examples of various aspects. While it is not possible to describe every possible combination for expressing various aspects, one of ordinary skill in the art will recognize that other combinations are possible. Accordingly, it is intended that the invention include all alternatives, modifications and variations that fall within the scope of the following claims.

이상 도면 및 실시예를 참조하여 설명하였지만, 본 발명의 보호범위가 상기 도면 또는 실시예에 의해 한정되는 것을 의미하지는 않으며 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the inventions as defined by the following claims It will be understood that various modifications and changes may be made thereto without departing from the spirit and scope of the invention.

300: 캐시 바이패싱 장치
310: 정보 추출부
320: 스위칭 지점 검색부
330: 스위칭 지점 결정부
340: 코드 수정부
350: 프로그램 컴파일러
360: 캐시 바이패싱부300: cache bypassing device
310: Information extracting unit
320: Switching point searching unit
330: Switching point determining unit
340: code counting
350: Program Compiler
360: Cache bypassing section

Claims

CLAIMS 1. A method for cache bypassing at a thread group level performed by a cache bypassing device,
Determining at least one switching point for each thread group using the cache; And
Bypassing a cache used by the thread group according to the determined at least one switching point.

The method according to claim 1,
Further comprising extracting memory access information generated when accessing the provided memory and cache bypassing information generated when the cache is bypassed,
Wherein the step of determining the switching point determines at least one switching point for each thread group using the extracted memory access information and the extracted cache bypassing information.

The method according to claim 1,
Further comprising: retrieving at least one switchable point for each thread group.

The method of claim 3,
Wherein the step of searching for the switchable point comprises:
Wherein at least one switchable point is searched for at least one point at which the thread cached data is not reused.

The method of claim 3,
Wherein the step of searching for the switchable point comprises:
A thread group-level cache bypassing method for searching at least one switchable point according to the number of threads in the thread block divided by the total number of thread groups divided by the number of thread groups simultaneously using the cache.

The method of claim 3,
Wherein determining the switching point comprises:
And determining at least one switching point for each thread group among the searched at least one switchable point.

The method of claim 3,
Wherein determining the switching point comprises:
Calculating an execution time difference between a thread group using a cache and a thread group not using a cache among the searched at least one switchable point and calculating at least one switchable point at which the calculated execution time gap is minimized, A method for cache bypassing at a thread group level that determines a switching point.

The method of claim 3,
Wherein determining the switching point comprises:
And determining at least one switchable point at which a difference in distance between switching points in the retrieved at least one switchable point is minimized as at least one switching point.

The method according to claim 1,
Modifying the program code based on the determined switching point; And
Further comprising the step of generating a cache bypassing program by compiling the modified program code,
Wherein the bypassing of the cache bypasses a cache used by the thread group according to the generated cache bypassing program.

10. The method of claim 9,
Wherein the modifying the program code comprises:
And modifying the program code using a first cache operator for bypassing the thread group and a second cache operator for caching the thread group based on the determined switching point.

A switching point determining unit for determining at least one switching point for each thread group using the cache; And
And a cache bypassing unit for bypassing a cache used by the thread group according to the determined at least one switching point.

12. The method of claim 11,
Further comprising an information extracting unit for extracting memory access information generated when accessing the provided memory and cache bypassing information generated when the cache is bypassed,
Wherein the switching point determination unit determines at least one switching point for each thread group using the extracted memory access information and the extracted cache bypassing information.

12. The method of claim 11,
And a switching point searching unit for searching at least one switchable point for each thread group.

14. The method of claim 13,
The switching point searching unit,
Wherein at least one switchable point is located at least one point at which the thread cached data is not reused to set the at least one switchable point.

14. The method of claim 13,
The switching point searching unit,
A thread group-level cache bypassing apparatus for searching at least one switchable point according to the number of threads in the thread block divided by the total number of thread groups divided by the number of thread groups simultaneously using the cache.

14. The method of claim 13,
The switching point determination unit may determine,
And determines at least one switching point for each thread group among the searched at least one switchable point.

14. The method of claim 13,
The switching point determination unit may determine,
Calculating an execution time difference between a thread group using a cache and a thread group not using a cache among the searched at least one switchable point and calculating at least one switchable point at which the calculated execution time gap is minimized, A thread group level cache bypassing device that determines the switching point.

14. The method of claim 13,
The switching point determination unit may determine,
And determines at least one switchable point at which the difference in distance between switching points in the retrieved at least one switchable point is minimized as at least one switching point.

12. The method of claim 11,
A code correcting unit for correcting the program code based on the determined switching point; And
Further comprising a program compiler for compiling the modified program code to generate a cache bypassing program,
Wherein the cache bypassing unit bypasses a cache used by the thread group according to the generated cache bypassing program.

20. The method of claim 19,
The code modifying unit,
And modifies the program code using a first cache operator for bypassing the thread group and a second cache operator for caching the thread group based on the determined switching point.