KR101442643B1

KR101442643B1 - The Cooperation System and the Method between CPU and GPU

Info

Publication number: KR101442643B1
Application number: KR1020130048061A
Authority: KR
Inventors: 황태호; 김동순
Original assignee: 전자부품연구원
Priority date: 2013-04-30
Filing date: 2013-04-30
Publication date: 2014-09-19

Abstract

The present invention relating to an effective cooperation structure between a CPU and a GPU provides a cooperation system which enhances cooperation efficiency between a CPU and a GPU and a method thereof by reducing a CPU load through an extra unit controlling a GPU and providing information on data addresses used for work without direct copying of data when the work is allocated. A method for keeping the cache coherency which is adequate to solve the discord in cache of a CPU and a GPU is provided to keep cache coherency between a CPU and a GPU by providing a protocol used to keep cache coherency between existing multi-CPUs for the cache coherency.

Description

[0001] The present invention relates to a cooperative system between a CPU and a GPU,

The present invention relates to a collaboration system between a CPU and a graphics processor (GPU) and a method thereof, and more particularly, to a memory structure and a management method for efficiently collaborating between a CPU and a GPU.

Recently, we have adopted ARM cortex multi CPU and nVidia or Imagination SGX multi GPU from AP (Application Processor) such as Samsung Exynos, nVidia Tegra and Texas Instrument OMAP, chip.

Traditionally, in the case of multiple CPUs, primary or secondary caches are shared in order to improve system performance. In addition, a protocol such as MESI (Modified, Exclusive, Shared, Invaild) is adopted for the coherency between caches belonging to each CPU, and Snoop Control Unit (SCU) is installed for this. To minimize access to external memory, write-back, write-once, and write allocate methods are applied.

The GP-GPU, which was first started by Intel and AMD, has been expanded to AP and integrated into one chip as mentioned above. Commonly, they share a lower level cache. However, there is a big difference in the way of memory management in mobile APs and PCs.

For example, in the AMD Fusion APU, the CPU and GPU each have a different page table. ARM Mali T604, on the other hand, manages memory with a page table like Cortext A15. It is not yet validated which is better.

Currently, in a CPU / GPU integrated system, the CPU controls the GPU through a bridge (PC) or a bus (AP). Generally, the GPU mainly delegates the code and data of the tasks to be processed through the CPU through the memory interface, and copies the GPU into the GPU local memory, and the GPU processes the data and copies the result back to the main memory of the CPU . To this end, the software driver of the operating system in the CPU-GPU integrated system controls the GPU through the bridge or the bus interface of the CPU, and the memory sharing and cache controller operate independently of the control structure.

However, because of this, system performance is degraded. Therefore, direct inter-processor communication between CPU and GPU is required, and a control unit for this needs to be added separately. It is also necessary to verify that the CPU and GPU have a separate page table and a common page table in cache sharing.

SUMMARY OF THE INVENTION It is an object of the present invention to provide a system and method for cooperation between a CPU and a GPU that can reduce the load of a CPU by controlling a GPU through a separate control module.

It is another object of the present invention to provide a cache coherence control module that is effective in maintaining cache coherence between a CPU and a GPU by extending a conventional protocol for solving a cache coherency problem between multiprocessors.

The present invention provides a collaborative system between a CPU and a GPU, comprising: a task manager for receiving a task requested by the CPU and requesting the GPU to send a task result processed by the GPU to the CPU; An address mapping unit for mapping the address space of the GPU and the address space of the main memory; A prefetcher that fetches data to be processed after the GPU is processing data from the main memory to the cache memory; And a cache coherency controller for matching the data stored in the cache memory of the CPU with the data stored in the cache memory of the GPU.

According to one aspect of the present invention, the task management unit receives a code information corresponding to a task requested by the CPU and address information of data necessary for performing the task from the CPU, and provides a collaboration system between the CPU and the GPU do.

According to another aspect of the present invention, the task management unit loads a table mapping the address space of the GPU and the address information of data required for the task into the address mapping unit, and provides a collaboration system between the CPU and the GPU.

According to another aspect of the present invention, the task management unit provides a collaboration system between the CPU and the GPU, which distributes the task requested by the CPU to each core of the GPU, and monitors the operation status of each core of the GPU .

According to another aspect of the present invention, the prefetcher receives data required for the GPU from the main memory to the cache memory when the operation signal is received from the operation management unit, and removes the processed data from the cache memory And provides a collaborative system between the CPU and the GPU.

According to another aspect of the present invention, the task management unit checks whether the data stored in the cache memory of the CPU and the data stored in the cache memory of the GPU need to be matched. If the data coincidence is necessary, To provide a cooperative system between the CPU and the GPU.

The method includes receiving a job requested by a CPU and requesting a GPU; Mapping an address space of the GPU to an address space of a main memory; Transferring a result of the GPU-processed operation to the CPU; Identifying data to be processed after the data being processed by the GPU; Fetching the verified data from the main memory to a cache memory; And activating a cache coherence control module to match the data of the CPU with the data of the GPU in a case where it is necessary to match the data of the GPU with the data of the CPU.

According to an aspect of the present invention, the step of receiving a job requested by the CPU and requesting the GPU to the GPU includes receiving code information corresponding to a job and address information of data necessary for a job from the CPU; And distributing the received job to each core of the GPU, and monitoring a work status of each core of the GPU. The present invention provides a method of collaborating between a CPU and a GPU.

According to another aspect of the present invention, mapping an address space of the GPU to an address space of a main memory includes generating a table mapping address space of the GPU and address information of data required for the task; And converting the address of the GPU by referring to the table.

The present invention provides a collaborative system between a CPU and a GPU that is synchronized with a control module that manages GPU operations and which shares only a data area to be delegated to a GPU by a CPU. Thus, the virtual address space used by the CPU can be accessed directly from the cache without copying between memories, which greatly improves performance.

In addition, it is possible to effectively control the prefetch from the main memory to the cache by synchronizing with the operation of the task management module in the shared structure at the cache level, thereby minimizing the direct main memory access of the GPU.

And since the control for the coherency of the cache of the CPU and the GPU can be enabled / disabled by the CPU through the task management module according to the task, it provides a structure for optimizing the problem of the performance degradation due to the snooping.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a diagram showing a structure of a conventional collaboration system between a CPU and a GPU; FIG.
2 is a diagram illustrating a structure of a collaboration system between a CPU and a GPU according to an embodiment of the present invention;
3 is a diagram illustrating a structure of a job manager in a collaboration system between a CPU and a GPU according to an embodiment of the present invention.
4 is a diagram illustrating a structure of an address mapping unit in a cooperative system between a CPU and a GPU according to an embodiment of the present invention;
FIG. 5 illustrates a structure of a pre-fetcher in a collaboration system between a CPU and a GPU according to an embodiment of the present invention; FIG.
6 to 10 are views for explaining a structure of a cache coherency controller in a cooperative system between a CPU and a GPU according to an embodiment of the present invention.
11 is a diagram illustrating a structure of an extended collaboration system between a CPU and a GPU according to an embodiment of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is defined by the scope of the claims.

It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. As used herein, the terms " comprises, " and / or "comprising" refer to the presence or absence of one or more other components, steps, operations, and / Or additions. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

2 is a diagram illustrating a structure of a collaboration system between a CPU and a GPU according to an embodiment of the present invention.

The collaboration system between the CPU and the GPU according to an embodiment of the present invention includes a task manager 200, an address mapping unit 210, A pre-fetcher 220, and a cache coherency controller 230. The pre-fetcher 220 and the cache coherency controller 230 are connected to each other.

A Job Manager (CPU / GPU Inter Processor Communication Controller, 200) designates and communicates with each other so that the CPU can directly drive the GPU through a bus or a bridge without driving the GPU.

The task management unit 200 is closely connected with the CPU through a co-processor interface of the CPU, divides the requests generated by the CPU into a plurality of GPU cores, and informs the CPU of the processing results. Therefore, the task management unit 200 includes an interface for exchanging necessary information from the CPU.

The Re-mapper (Memory Management Unit for GPU) 210 assists in mapping the address space of the GPU to the address space of the main memory used by the CPU.

Existing GPUs do not use the virtual address memory space, but directly access the physical address. Even if the GPU uses a virtual address through a separate MMU, it needs a function to map the address space that the GPU sees to the address space using the page table of the main memory used by the CPU because it is different from the address area used by the CPU. , And this function is handled by the address mapping unit 210. The GPU side accesses the Unified Shared Memory through the address mapping unit 210.

The pre-fetcher 220 finds data block patterns from the main memory and the L2 cache, receives them as a pattern for reference, and pre-fetches the necessary data.

A cache coherency controller 230 controls the CPU and the GPU to share a cache. It is designed to extend the existing SCU (Snoop Control Unit) to maintain the coherency with the GPU as well as the CPU.

The collaboration process by the collaboration system between the CPU and the GPU according to an embodiment of the present invention proceeds as follows.

The CPU transfers the code and data compiled for the GPU core and the address and offset information of the data segmented by the GPU core to the designated interface of the task management unit 200. The task management unit 200 remaps data address information of a given main memory into a GPU address space and loads the data address information into the address mapping unit 210.

The task management unit 200 operates the prefetcher 220 based on the given address information to fetch data from the main memory to the L2 cache in advance and operates the cache coherence controller 230 when control of cache coherency is required in the CPU .

The task manager 200 allocates tasks to each core of the GPU, and while the tasks are being processed in the GPU, the data to be processed next is fetched to the L2 via the prefetcher 220, Flush the cache data to main memory.

Upon completion of the delegated task, the GPU sends a completion signal to the task management unit 200, and the task management unit 200 notifies the CPU that the task is completed.

3 is a diagram illustrating a structure of a task management unit in a collaboration system between a CPU and a GPU according to an embodiment of the present invention.

An existing CPU delegates tasks to the GPU is a method in which the CPU directly manages the GPU's host request queue through the system bus. Therefore, the CPU is a structure in which the GPU device driver software continuously manages the operation of the GPU through the interrupt interface of the system bus.

On the other hand, the present invention is a device for delegating the management of the GPU-operated tasks through a separate hardware device of the task management unit to improve this. Through the task manager, the CPU can significantly reduce the administrative load associated with the GPU.

The task manager is connected to the same interface as the co-processor instruction of the CPU, and provides the registers that can set the GPU to execute, the memory address, the offset per core, and the parameters. It can also provide the ability to monitor the status and behavior of each core's work on the GPU.

The task manager is designed to extend (up to 4) additional interfaces as well as a single host CPU interface to manage operations with heterogeneous processors such as multi-core processors and collaboration with other GPU hardware. Can be performed.

4 is a diagram illustrating a structure of an address mapping unit in a collaboration system between a CPU and a GPU according to an embodiment of the present invention.

OpenCL and OpenGL models are designed assuming that the CPU-GPU system operates in a non-unified memory structure. In other words, because it has physically separate memory, the virtual memory address space used by the CPU and the memory address space used by the GPU have been developed to be different. However, since the structure of CPU-GPU has recently been developed as a shared memory-based structure on SoC, CPU and GPU have required addressing and conversion on the Unified Shared Memory. A common way to solve this problem is to have the GPU use the same virtual memory address space by referring to the same page table on the main memory through each TLB like a CPU.

Generally, a GPU is entrusted with processing a large amount of data from a CPU, sequentially dividing the data into parallel processing, and returning the result. Considering this point, there is a problem in that a common address mapping table is shared through the TLB for accessing the unified shared memory. The GPU receives a large range of data, and each core that makes up the GPU translates each corresponding space through the TLB.

However, considering the limited TLB size and the fact that the reuse ratio of the conversion information in the TLB is low due to the segmentation and sequential processing characteristics of the GPU, when the data to be processed by the GPU is large, Have no choice but to. Also, when many GPU cores access the memory bus with each TLB, more traffic will be generated, and the implementation complexity will also increase.

To solve this problem, the present invention is designed in the following approach. Since the scope and location of the necessary data is determined before the CPU delegates work to the GPU, the driver through the OpenCL / OpenGL API in the CPU allocates the memory to be passed to the GPU to the contiguous pages as possible, And loads the table mapping the virtual address of the GPU into the address mapping unit. At this time, if the data is fragmented on page basis rather than on consecutive pages, this page information is remapped into consecutive virtual address space for GPU and reflected in the address mapping table.

The address mapping table includes page address information of all data to be passed to the GPU. The GPU refers to the information of the mapping table loaded in the address mapping unit without further memory access for address conversion, and performs address conversion.

The address translation of the address mapping part is performed by referring to the mapping table by the translator device implemented as many as the number of cores of the GPU, and accesses to the shared memory through the cache controller using the converted address information.

5 is a diagram illustrating a structure of a prefetcher in a collaboration system between a CPU and a GPU according to an embodiment of the present invention. The GPU divides the delegated work into parallel and sequential processes, and the present invention designs a prefetcher with a structure as shown in FIG. 5 to manage the tasks more efficiently.

As the GPU starts to work through the task manager, the prefetchers reserve the L2 cache space and the GPU core reserves twice the space required for a single task and divides it into two windows. The first window loads the data needed for the current GPU operation, and the second windows area is reserved for loading data for subsequent processing.

In the reserved window area, L2 cache controller does not apply existing eviction rule, and two windows are dedicated to memory latency hiding of GPU.

6 is a diagram illustrating a structure of a cache coherence controller in a collaboration system between a CPU and a GPU according to an embodiment of the present invention.

The cache coherence controller is responsible for the coherency between the L1 cache of the multicore CPU and the GPU, as well as the memory-to-cache and cache-to-cache data transfers between the cores according to the protocol, and the L2 cache for pre- As shown in FIG.

The cache coherence control unit is designed in a structure for a single-core CPU and a structure for expanding it. The coherency model for sharing on unified memory between the first single-core CPU and the GPU is as shown in FIG.

In FIG. 7, the protocol for state conversion is shown in FIG. The protocol of FIG. 8 is basically based on data transmission between L1 cache. And since the CPU that delegates the task to the GPU is less likely to access the data again during the operation process of the GPU, snooping is minimized for coherency with the invalidation-based GPU. That is, not only ownership of data but also cached data itself is copied. Therefore, only one copy of data shared with the GPU exists in the L1 cache.

However, the architecture for multicore CPUs and GPUs is more complicated because it must work with the coherency protocol between the CPUs. To this end, we extend the Dragon protocol based on MOESI.

Figure 9 shows definitions of the states required for the extended protocol. The state of the RD is added and an invalidation request of INV_REQ is added. The RD state indicates the state in which the GPU loads data into its cache and then proceeds to write data. In addition, a condition is added to distinguish the sharing between the CPU and the sharing with the GPU, which is provided through the address mapping unit described above. The address mapping unit sets condition r to true for data accessed by referring to its own table. The coherency protocol designed using the state defined in FIG. 9 is shown in FIG.

In FIG. 10, the protocol is basically based on invalidation as data shared with the GPU as in the single-core CPU described above. This basically allows the GPU to invalidate the CPU's shared cache lines in order to minimize the update when the CPU wants to share the data for the task delegated to the CPU.

The schematic structure of the cache coherence control unit including such a protocol is as shown in FIG. 6, and the cache coherence control unit is roughly divided into three parts.

The first is a comparator for adjusting the state change of the protocol described above. The comparator receives the address and line status from the L1 cache controller of the GPU and the CPU, and manages the status of these.

The second is a cache-to-cache data transfer unit. This unit is responsible for transferring data between the L1 cache and the comparator when necessary.

The third is the L2 cache controller. The L2 controller manages L2 by applying a normal cache eviction rule, and performs memory transmission necessary for pre-fetching the GPU by partitioning L2 into a required size area when a request is made from the prefetcher described above.

FIG. 11 illustrates a system in which a collaboration system between a CPU and a GPU is expanded according to an embodiment of the present invention. In the collaboration system shown in FIG. 11, two CPUs and a GPU share a memory.

The above-described structure of the collaboration system between the CPU and the GPU can be extended not only to the L2 but also to the shared structure through the L3 cache, and can be extended to a single CPU as well as a collaboration structure between the multiple CPU and the GPU.

Multiple CPUs and GPUs have L2 cache, respectively, and L3 has a shared structure. The task manager operates through the interface with the CPU as in the above-described structure. However, the cache coherence controller must always operate to share memory between the CPUs.

The foregoing description is merely illustrative of the technical idea of the present invention and various changes and modifications may be made without departing from the essential characteristics of the present invention. Therefore, the embodiments described in the present invention are not intended to limit the scope of the present invention, but are intended to be illustrative, and the scope of the present invention is not limited by these embodiments. It is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents, which fall within the scope of the present invention as claimed.

Claims

In a collaborative system between a CPU and a GPU,
A task manager for receiving a task requested by the CPU and requesting the GPU to send a task result processed by the GPU to the CPU; And
And an address mapping unit for supporting a mapping between an address space of the GPU and an address space of the main memory,
The task management unit
Receiving code information corresponding to a task requested by the CPU and address information of data necessary for performing the task from the CPU
Collaboration system between CPU and GPU.

delete

The apparatus of claim 1, wherein the task management unit
Loading a table mapping the address space of the GPU and address information of data required for the task into the address mapping unit
Collaboration system between CPU and GPU.

The apparatus of claim 1, wherein the task management unit
Connected to the CPU with the same interface as the coprocessor interface
Collaboration system between CPU and GPU.

The apparatus of claim 1, wherein the task management unit
Distributing a task requested by the CPU to each core of the GPU, and monitoring the operation status of each core of the GPU
Collaboration system between CPU and GPU.

The method according to claim 1,
A prefetcher that fetches data to be processed next to data being processed by the GPU from the main memory to the cache memory
And further comprising a CPU and a GPU.

7. The apparatus of claim 6, wherein the prefetcher
When receiving an operation signal from the task management unit, fetching data necessary for the GPU from the main memory to the cache memory and removing processed data from the cache memory
Collaboration system between CPU and GPU.

The method according to claim 1,
A cache coherence controller for matching the data stored in the cache memory of the CPU with the data stored in the cache memory of the GPU;
And further comprising a CPU and a GPU.

The apparatus of claim 8, wherein the task management unit
Checking whether data stored in the cache memory of the CPU needs to be matched with data stored in the cache memory of the GPU, and operating the cache coherence controller when data coincidence is required
Collaboration system between CPU and GPU.

Receiving a job requested by the CPU and requesting the GPU;
Mapping an address space of the GPU to an address space of a main memory; And
And transferring the result of the processing performed by the GPU to the CPU,
The step of receiving the job requested by the CPU and requesting to the GPU
And receiving the code information corresponding to the task and the address information of the data necessary for the task from the CPU
A way to collaborate between a CPU and a GPU.

delete

The method as claimed in claim 10, wherein the step of receiving the job requested by the CPU and requesting to the GPU
Distributing the received job to each core of the GPU, and monitoring the operation status of each core of the GPU
A way to collaborate between a CPU and a GPU.

11. The method of claim 10, wherein mapping the address space of the GPU to an address space of a main memory
Generating a table mapping an address space of the GPU and address information of data necessary for the task; And
And converting the address of the GPU by referring to the table
A way to collaborate between a CPU and a GPU.

11. The method of claim 10,
Identifying data to be processed after the data being processed by the GPU; And
Fetching the verified data from the main memory to the cache memory
And a method for collaborating between a CPU and a GPU.

11. The method of claim 10,
Operating the cache coherence control module to match both data if the data of the CPU and the data of the GPU need to be matched
And a method for collaborating between a CPU and a GPU.