KR102142498B1 - GPU memory controller for GPU prefetching through static analysis and method of control - Google Patents

GPU memory controller for GPU prefetching through static analysis and method of control Download PDF

Info

Publication number
KR102142498B1
KR102142498B1 KR1020180118835A KR20180118835A KR102142498B1 KR 102142498 B1 KR102142498 B1 KR 102142498B1 KR 1020180118835 A KR1020180118835 A KR 1020180118835A KR 20180118835 A KR20180118835 A KR 20180118835A KR 102142498 B1 KR102142498 B1 KR 102142498B1
Authority
KR
South Korea
Prior art keywords
thread
threadidx
gpu
threads
access pattern
Prior art date
Application number
KR1020180118835A
Other languages
Korean (ko)
Other versions
KR20200039202A (en
Inventor
한환수
김현준
홍성인
Original Assignee
성균관대학교산학협력단
재단법인 초고성능 컴퓨팅 연구단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 성균관대학교산학협력단, 재단법인 초고성능 컴퓨팅 연구단 filed Critical 성균관대학교산학협력단
Priority to KR1020180118835A priority Critical patent/KR102142498B1/en
Publication of KR20200039202A publication Critical patent/KR20200039202A/en
Application granted granted Critical
Publication of KR102142498B1 publication Critical patent/KR102142498B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/3004Arrangements for executing specific machine instructions to perform operations on memory
    • G06F9/30047Prefetch instructions; cache control instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/45Caching of specific data in cache memory
    • G06F2212/455Image or video data
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6022Using a prefetch buffer or dedicated prefetch cache

Abstract

The present invention relates to a GPU memory control apparatus and a control method for performing GPU prefetch through the GPU kernel static analysis,
GPU memory control apparatus according to an embodiment of the present invention,
An index definition module that collects threads existing in the GPU kernel, and defines an index of a global memory variable by using the thread ID count and the loop variable count for each data dimension of the unique thread ID of the collected threads; A thread-high-density memory access pattern determination module that calculates a distance between threads according to a coefficient of the thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between the threads; And a prefetch target determination module for determining whether prefetch target data exists among data having a thread-high density memory access pattern by using the coefficient of the loop variable.

Description

GPU memory controller for GPU prefetching through static analysis and method of control}

The present invention relates to a GPU memory control apparatus and a control method, and more particularly, to a GPU memory control apparatus and a control method for performing GPU prefetch through the GPU kernel static analysis.

A graphics processing unit (GPU) has a multi-level cache memory structure like a central processing unit (CPU).

1 is a block diagram showing a general GPU architecture. Referring to FIG. 1, it can be seen that the GPU includes n SIMDs. SIMD (Single Instruction Multiple Data) is a type of parallel processor that calculates several values simultaneously with one instruction. The SIMD is sometimes called a different name for each GPU manufacturer. It is called Stream Multiprocessor in NVIDIA ® and Compute Unit in AMD ® . NVIDIA ® 's latest product, GTX980 ® , has a total of 16 SIMDs as above, and there are 4 warp schedulers inside each SIMD, 32 CUDA cores for each scheduler, 8 load/store units, and special functional units. . However, since the content is not directly related to the content to be described in this specification, a more detailed description will be omitted.

The SIMD may include a register, a scratchpad memory or a shared memory, an L1 cache, and a read only memory (ROM). The n SIMDs may share an L2 cache and a global memory (DRAM). For example, the L1 cache may have a size of 48 KB per SIMD, and the L2 cache may have a size of 2 MB. Outside the dotted line on the right, the main memory is a memory connected to the CPU and is connected to the SIMD.

Depending on where the memories exist, on-chip memory and off-chip memory may be classified. Registers, L1 cache, L2 cache, read-only memory (ROM) and scratchpad memory are on-chip memory. Global memory is off-chip memory.

On the other hand, GPUs are capable of large-scale parallel processing, but their computational performance is not fully utilized due to slow memory read/write performance. As described above, the memory of the GPU is divided into off-chip memory and on-chip memory. Since the access delay time for the two memory devices differs from 10 to 100 times or more, inefficient memory access patterns are It will have a big impact.

Accordingly, an optimization study was conducted to minimize access to off-chip memory by utilizing on-chip memory. As one of the methods, there are studies on maximizing bandwidth utilization between two memory devices through prefetching of data in an off-chip memory.

In general, there has been research to prefetch a specific memory access pattern to the GPU's L1 cache, but the capacity of the GPU L1 cache is not large enough, so it is not suitable as a space for prefetching.

GPU scratchpad memory is one of the on-chip memories, has read/write performance similar to L1 cache, and can be used by the user arbitrarily allocating space. In this space, it is possible to guarantee the data prefetched by the user, so prefetching techniques using this are being studied.

Korean Patent Publication No. 10-2015-0092440

An object of the present invention is to provide a GPU memory control device and a control method for performing GPU prefetch through static analysis of a GPU kernel.

GPU memory control apparatus according to an embodiment of the present invention,

An index definition module that collects threads existing in the GPU kernel, and defines an index of a global memory variable by using the thread ID count and the loop variable count for each data dimension of the unique thread ID of the collected threads; A thread-high-density memory access pattern determination module that calculates a distance between threads according to a coefficient of the thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between the threads; And a prefetch target determination module for determining whether prefetch target data exists among data having a thread-high density memory access pattern by using the coefficient of the loop variable.

In the GPU memory control apparatus according to an embodiment of the present invention, the index of the global memory variable may be defined by Equation (1) below.

Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + S X i + C]

(Here, threadIdx.x, threadIdx.y, threadIdx.z are thread IDs for each data dimension, Tx X, Ty X, and Tz X are counts of thread IDs for each data dimension, i is a loop variable, and S is a loop variable Is the coefficient of and C is a constant)

In the GPU memory control apparatus according to an embodiment of the present invention, the thread-high density memory access pattern determination module, if the coefficient of at least one of the coefficients of the thread ID for each data dimension is 0, the corresponding thread is threaded. -It can be judged as a high-density memory access pattern.

In the GPU memory control apparatus according to an embodiment of the present invention, the prefetch target determining module, among the threads having a high-density memory access pattern, prefetches the corresponding thread if the coefficient of the loop variable is not 0. It can be determined as a thread in which data exists.

GPU memory control method according to an embodiment of the present invention,

An index definition step of collecting threads existing in the GPU kernel and defining an index of a global memory variable by using a count of a thread ID and a count of a loop variable for each data dimension of the unique thread IDs of the collected threads; A thread-high-density memory access pattern determining step of calculating a distance between threads according to a coefficient of the thread ID for each data dimension, and determining whether a thread-high density memory access pattern is based on the distance between the threads; And a prefetch target determining step of determining whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficient of the loop variable.

In the GPU memory control method according to an embodiment of the present invention, the index of the global memory variable may be defined by Equation (1) below.

Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + S X i + C]

(Here, threadIdx.x, threadIdx.y, threadIdx.z are thread IDs for each data dimension, Tx X, Ty X, and Tz X are counts of thread IDs for each data dimension, I is a loop variable, and S is a loop variable Is the coefficient of and C is a constant)

In the GPU memory control method according to an embodiment of the present invention, in the step of determining the thread-high density memory access pattern, if the coefficient of at least any one of the coefficients of the thread ID for each data dimension is 0, the corresponding thread is threaded. -It can be judged as a high-density memory access pattern.

In the GPU memory control method according to an embodiment of the present invention, in the step of determining the prefetch target, if the coefficient of the loop variable is not 0, among the threads having a high-density memory access pattern, the corresponding thread is a prefetch target. It can be determined as a thread in which data exists.

A computer program stored in a recording medium according to an embodiment of the present invention,

In the computing means, an index definition step of collecting threads existing in the GPU kernel and defining an index of a memory variable using a coefficient of a thread ID and a coefficient of a loop variable for each data dimension of the unique thread IDs of the collected threads; A thread-high-density memory access pattern determining step of calculating a distance between threads according to a coefficient of the thread ID for each data dimension, and determining whether a thread-high density memory access pattern is based on the distance between the threads; And a computer program stored in a recording medium for executing a prefetch target determination step of determining whether prefetch target data exists among data having a thread-high density memory access pattern by using the coefficients of the loop variable.

Other specific details of implementations according to various aspects of the present invention are included in the detailed description below.

Conventional studies have been on dynamic analysis (e.g., on-line profiling), and dynamic analysis allows accurate program analysis for atypical patterns, but through each execution of the program for multiple input values of the program. There is a downside to doing a new analysis. On the contrary, the GPU memory control apparatus and control method of the present invention is for a static analysis method that analyzes at the source code level, and it is possible to analyze a program without executing a program, thereby reducing the time required for analysis, analysis cost, and analysis manpower. There is an advantage that can be saved.

1 is a block diagram showing a general GPU architecture.
2 is a block diagram illustrating an apparatus for controlling a GPU memory according to an embodiment of the present invention.
3 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention.
FIG. 4 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention, and is a flowchart illustrating FIG. 3.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings so that those skilled in the art to which the present invention pertains can easily practice. As can be easily understood by those of ordinary skill in the art to which the present invention pertains, the embodiments to be described later may be modified in various forms without departing from the concept and scope of the present invention. Where possible, the same or similar parts are indicated by the same reference numerals in the drawings.

The terminology used herein is for the purpose of referring only to specific embodiments, and is not intended to limit the invention. Singular forms as used herein also include plural forms unless the phrases clearly indicate the opposite.

As used herein, the meaning of “comprising” embodies certain properties, regions, integers, steps, actions, elements and/or components, and other specific properties, regions, integers, steps, actions, elements, components and/or It does not exclude the presence or addition of military forces.

All terms including technical terms and scientific terms used in the present specification have the same meaning as those generally understood by those skilled in the art to which the present invention pertains. Terms defined in the dictionary are additionally interpreted as having meanings consistent with related technical documents and currently disclosed contents, and are not interpreted in an ideal or very formal sense unless defined. Hereinafter, an apparatus and a control method for controlling a GPU memory according to an embodiment of the present invention will be described with reference to the drawings.

2 is a block diagram illustrating an apparatus for controlling a GPU memory according to an embodiment of the present invention. The static analysis method of the present invention analyzes at the source code level and determines a prefetch target without program execution. The GPU memory control apparatus for performing this includes an index definition module 100, a thread-high density memory access pattern determination module 200, and a prefetch target determination module 300, as shown in FIG. 2. .

The index definition module 100 collects threads existing inside the GPU kernel. The index definition module 100 collects threads using Another Tool For Language Recognition (Antlr) that can parse a GPU CUDA program and generate an abstract syntax tree (AST). I can. Antler is a parser generator that uses LL(*) for parsing in computer-based language recognition.

For static analysis, the index definition module 100 defines the index of the global memory variable using the count of the thread ID and the count of the loop variable for each data dimension of the unique thread ID of the collected threads.

Data dimension refers to the thread array method. For example, when 100 threads are arrayed in one dimension, they may be arrayed as T 1 , T 2 ... T 100 . Also, for example, when 100 threads are arrayed in two dimensions, they may be arrayed as T (1,1) , T (1,2) , ... T (10,10) .

The index of the memory variable defined in the index definition module 100 is as shown in Equation (1) below.

Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + S X i + C]

Here, threadIdx.x, threadIdx.y, threadIdx.z are the thread IDs for each data dimension, Tx X, Ty X, Tz X are the counts of the thread IDs for each data dimension, I is the loop variable, and S is the loop variable. Coefficient, and C is a constant.

CUDA, a programming language that enables programmer-written code to run on NVIDIA GPUs, has a unique thread ID (threadIdx) for each thread. The index definition module 100 uses a thread ID to request data of an independent memory address for each thread. The thread ID (threadIdx) is expressed as threadIdx.x, threadIdx.y, and threadIdx.z according to the data dimension.

The thread-high density memory access pattern determination module 200 calculates a distance between threads according to a coefficient of a thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between threads.

Specifically, the thread-high-density memory access pattern determination module 200 calculates the distance between threads using Equation (2) below.

Equation (2):

Figure 112018098352325-pat00001

Here, V means the logical operator “or”. In Equation (2), if the coefficient of at least one of the thread ID coefficients (Tx X, Ty X, Tz X) for each data dimension is 0, the result of Equation (2) is "True".

For example, when the data dimension is 1 dimensional, the default index of the global memory instruction has the form [Tx X threadIdx.x + S X i +C]. When the coefficient Tx of ThreadIdx.x is 0, that is, when Equation (2) is true, in the memory access pattern of [S X i + C], all threads access the same address S X i + C in the same loop. This means that it is a thread-dense pattern in which mode threads within the same dimension request data at the same address.

Further, for example, when the data dimension is two-dimensional, when the coefficient of threadIdx.x or threadIdx.y is 0, and when the data dimension is three-dimensional, at least one of thradIdx.x, threadIdx.y, and threadIdx.z. When one coefficient is 0, it means that the corresponding data has a thread-dense memory access pattern.

That is, the thread-high density memory access pattern determination module 200 determines the corresponding data as a thread-high density memory access pattern when the coefficient of at least one of the thread ID's for each data dimension is 0.

The prefetch target determination module 300 determines whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficients of the loop variable.

The prefetch target determination module 300 determines whether the prefetch target data is present using Equation (3) below.

Equation (3):

Figure 112018098352325-pat00002

Here, S is the coefficient of the loop variable (i) described in Equation (1), and also means the distance between the addresses accessed by threads between loops.

According to Equation (3), if the coefficient S value of the loop variable is not 0, that is, if the distance between addresses is more than 1, it means that a thread accesses a different address for each loop, so there is data for prefetching. It can be judged as.

In summary, for static analysis of the index of the global memory instruction in the CUDA code, the index of the global memory variable is defined as Equation (1), and the distance between threads is calculated using Equation (2) to calculate the thread-high density memory access pattern. You can judge whether or not. Finally, it is possible to determine whether there is data for prefetching using Equation (3) for the thread-high density memory access pattern.

Next, a GPU memory control method according to an embodiment of the present invention will be described with reference to FIGS. 3 and 4. FIG. 3 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention, and FIG. 4 is a flowchart illustrating a method of controlling a GPU memory according to an embodiment of the present invention.

3 and 4, the GPU memory control method according to an embodiment of the present invention includes an index definition step (S100), a thread-high density memory access pattern determination step (S200), and a prefetch target determination step It includes step S300.

In the index definition step (S100, S100'), threads existing inside the GPU kernel are collected, and the unique thread IDs of the collected threads are indexed using the thread ID count and the loop variable count for each data dimension. Defines The index of the memory variable defined in the index definition step (S100) is as in Equation (1) described above.

In the step of determining a thread-high density memory access pattern (S200 and S200'), a distance between threads is calculated according to a coefficient of a thread ID for each data dimension, and whether a thread-high density memory access pattern is determined according to the distance between threads. In the thread-high density memory access pattern determination step (S200), the distance between threads is calculated using Equation (2) described above. In step S200 of determining a thread-high density memory access pattern, if the coefficient of at least one of the thread ID's for each data dimension is 0, the data is determined as a thread-high density memory access pattern.

In the prefetch target determination step (S300, S300'), it is determined whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficients of the loop variable. In the prefetch target determination step (S300), it is determined whether or not the prefetch target data is present using Equation (3). In the prefetch target determination step (S300), if the coefficient S value of the loop variable is not 0, that is, if the distance between addresses is 1 or more, it means that a thread accesses different addresses for each loop, so data for prefetching Can be determined to exist.

When the data to be prefetched is determined through the above steps, it is added to the prefetch list. (S400, S400’)

Conventional studies that minimize access to off-chip memory by using on-chip memory are for dynamic analysis (eg, on-line profiling), and dynamic analysis is an accurate program for atypical patterns. Analysis is possible, but there is a disadvantage of having to perform a new analysis through the execution of the program each time for multiple input values of the program. On the other hand, since the static analysis method proposed in the present invention analyzes at the source code level, it is possible to analyze a program without executing a program. Since GPU programs generally show a formal form of memory access, a sufficiently accurate program analysis is possible only with static analysis.

As described above, one embodiment of the present invention has been described, but those skilled in the art may add, change, delete, or add elements within the scope of the present invention as described in the claims. It will be said that the present invention can be variously modified and changed by the like, and this is also included within the scope of the present invention.

100: index definition module
200: thread-high density memory access pattern determination module
300: Prefetch target determination module

Claims (9)

  1. An index definition module that collects threads existing in the GPU kernel and defines an index of a global memory variable by using a count of a thread ID and a count of a loop variable for each data dimension from the unique thread ID of the collected threads;
    A thread-high-density memory access pattern determination module that calculates a distance between threads according to a coefficient of the thread ID for each data dimension, and determines whether a thread-high density memory access pattern is based on the distance between the threads; And,
    And a prefetch target determination module for determining whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficients of the loop variable.
    The index of the global memory variable defined in the index definition module satisfies the following equation (1),
    Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + SX i + C]
    (Here, threadIdx.x, threadIdx.y, threadIdx.z are the thread IDs for each data dimension, Tx, Ty, Tz are the counts of the thread IDs for each data dimension, i is the loop variable, and S is the count of the loop variable. , C is a constant)
    The thread-high density memory access pattern determination module,
    When the coefficient of at least one of the coefficients of the thread IDs for each data dimension is 0, the GPU memory control device determines the thread as a thread-high density memory access pattern.
  2. delete
  3. delete
  4. The method according to claim 1, wherein the prefetch target determination module,
    If the coefficient of the loop variable is not 0 among the threads having the thread-high density memory access pattern, the GPU memory control device determines that the thread is a thread in which prefetch target data exists.
  5. delete
  6. delete
  7. delete
  8. delete
  9. In computing means,
    An index definition step of collecting threads existing in the GPU kernel and defining an index of a global memory variable using a count of a thread ID and a count of a loop variable for each data dimension from the unique thread ID of the collected threads;
    A thread-high-density memory access pattern determining step of calculating a distance between threads according to a coefficient of the thread ID for each data dimension, and determining whether a thread-high density memory access pattern is based on the distance between the threads; And,
    A prefetch target determination step of determining whether prefetch target data exists among data having a thread-high density memory access pattern using the coefficient of the loop variable is executed.
    The index of the global memory variable defined in the index definition step satisfies the following equation (1),
    Equation (1): [Tx X threadIdx.x + Ty X threadIdx.y + Tz X threadIdx.z + SX i + C]
    (Here, threadIdx.x, threadIdx.y, threadIdx.z are the thread IDs for each data dimension, Tx, Ty, Tz are the counts of the thread IDs for each data dimension, i is the loop variable, and S is the loop variable count. , C is a constant)
    The step of determining the thread-high density memory access pattern,
    A computer program stored in a recording medium for determining a corresponding thread as a thread-high density memory access pattern when the coefficient of at least one of the thread ID coefficients for each data dimension is 0.
KR1020180118835A 2018-10-05 2018-10-05 GPU memory controller for GPU prefetching through static analysis and method of control KR102142498B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020180118835A KR102142498B1 (en) 2018-10-05 2018-10-05 GPU memory controller for GPU prefetching through static analysis and method of control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020180118835A KR102142498B1 (en) 2018-10-05 2018-10-05 GPU memory controller for GPU prefetching through static analysis and method of control

Publications (2)

Publication Number Publication Date
KR20200039202A KR20200039202A (en) 2020-04-16
KR102142498B1 true KR102142498B1 (en) 2020-08-10

Family

ID=70454886

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020180118835A KR102142498B1 (en) 2018-10-05 2018-10-05 GPU memory controller for GPU prefetching through static analysis and method of control

Country Status (1)

Country Link
KR (1) KR102142498B1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050022196A1 (en) 2000-04-04 2005-01-27 International Business Machines Corporation Controller for multiple instruction thread processors
WO2006038991A2 (en) 2004-08-17 2006-04-13 Nvidia Corporation System, apparatus and method for managing predictions of various access types to a memory associated with cache
WO2014178683A1 (en) * 2013-05-03 2014-11-06 삼성전자 주식회사 Cache control device for prefetching and prefetching method using cache control device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102100161B1 (en) 2014-02-04 2020-04-14 삼성전자주식회사 Method for caching GPU data and data processing system therefore

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050022196A1 (en) 2000-04-04 2005-01-27 International Business Machines Corporation Controller for multiple instruction thread processors
WO2006038991A2 (en) 2004-08-17 2006-04-13 Nvidia Corporation System, apparatus and method for managing predictions of various access types to a memory associated with cache
WO2014178683A1 (en) * 2013-05-03 2014-11-06 삼성전자 주식회사 Cache control device for prefetching and prefetching method using cache control device

Also Published As

Publication number Publication date
KR20200039202A (en) 2020-04-16

Similar Documents

Publication Publication Date Title
Pichai et al. Architectural support for address translation on gpus: Designing memory management units for cpu/gpus with unified address spaces
Seshadri et al. Gather-scatter DRAM: In-DRAM address translation to improve the spatial locality of non-unit strided accesses
Hsieh et al. Transparent offloading and mapping (TOM) enabling programmer-transparent near-data processing in GPU systems
Cho et al. Active disk meets flash: A case for intelligent ssds
Ubal et al. Multi2Sim: a simulation framework for CPU-GPU computing
Panda et al. Memory issues in embedded systems-on-chip: optimizations and exploration
US10037228B2 (en) Efficient memory virtualization in multi-threaded processing units
Li et al. Adaptive and transparent cache bypassing for GPUs
US9195606B2 (en) Dead block predictors for cooperative execution in the last level cache
KR101839544B1 (en) Automatic load balancing for heterogeneous cores
KR101559090B1 (en) Automatic kernel migration for heterogeneous cores
Gebhart et al. Unifying primary cache, scratch, and register file memories in a throughput processor
Yang et al. CPU-assisted GPGPU on fused CPU-GPU architectures
Lee et al. Decoupled direct memory access: Isolating CPU and IO traffic by leveraging a dual-data-port DRAM
Brodtkorb et al. Graphics processing unit (GPU) programming strategies and trends in GPU computing
Komatsu et al. Evaluating performance and portability of OpenCL programs
Zhang et al. On-the-fly elimination of dynamic irregularities for GPU computing
Wang et al. Using the compiler to improve cache replacement decisions
Yu et al. IMP: Indirect memory prefetcher
JP5422614B2 (en) Simulate multiport memory using low port count memory
EP1639474B1 (en) System and method of enhancing efficiency and utilization of memory bandwidth in reconfigurable hardware
US8205066B2 (en) Dynamically configured coprocessor for different extended instruction set personality specific to application program with shared memory storing instructions invisibly dispatched from host processor
Loh et al. Supporting very large dram caches with compound-access scheduling and missmap
Shen et al. Performance gaps between OpenMP and OpenCL for multi-core CPUs
Farkas et al. Complexity/performance tradeoffs with non-blocking loads

Legal Events

Date Code Title Description
E902 Notification of reason for refusal
E701 Decision to grant or registration of patent right
GRNT Written decision to grant