US20170235671A1 - Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium - Google Patents
Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium Download PDFInfo
- Publication number
- US20170235671A1 US20170235671A1 US15/168,423 US201615168423A US2017235671A1 US 20170235671 A1 US20170235671 A1 US 20170235671A1 US 201615168423 A US201615168423 A US 201615168423A US 2017235671 A1 US2017235671 A1 US 2017235671A1
- Authority
- US
- United States
- Prior art keywords
- memory
- coprocessor
- data
- cpu
- gpu
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/06—Addressing a physical block of locations, e.g. base addressing, module addressing, memory dedication
- G06F12/0615—Address space extension
- G06F12/063—Address space extension for I/O modules, e.g. memory mapped I/O
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/38—Information transfer, e.g. on bus
- G06F13/40—Bus structure
- G06F13/4063—Device-to-bus coupling
- G06F13/4068—Electrical coupling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
- G06F2212/1024—Latency reduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/20—Employing a main memory using a specific memory technology
- G06F2212/205—Hybrid memory, e.g. using both volatile and non-volatile memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/20—Employing a main memory using a specific memory technology
- G06F2212/206—Memory mapped I/O
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/21—Employing a record carrier using a specific recording technology
- G06F2212/214—Solid state disk
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/25—Using a specific main memory architecture
- G06F2212/251—Local memory within processor subsystem
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/72—Details relating to flash memory management
- G06F2212/7201—Logical to physical mapping or translation of blocks or pages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/72—Details relating to flash memory management
- G06F2212/7202—Allocation control and policies
Definitions
- the described technology relates to a computing device, a data transfer method between a coprocessor and a non-volatile memory, and a computer-readable recording medium.
- Data processing coprocessors with high computation parallelism and comparatively low power consumption are becoming increasingly popular.
- One example of the coprocessor is a graphic processing unit (GPU).
- GPU graphic processing unit
- many processing cores share execution control and can performing identical operations on numerous pieces of data via thread-level parallelism and data-level parallelism.
- a system using the coprocessor together with a central processing unit (CPU) can exhibit significant speedups compared to a CPU-only system.
- the coprocessors can process more data than they have ever had before, and the volume of such data is expected.
- the coprocessors employ on-board memory whose size is relatively smaller compared to a host memory.
- the coprocessors therefore use a non-volatile memory connected to a host machine to process large sets of data.
- the coprocessor and the non-volatile memory are completely disconnected from each other and are managed by different software stacks. Consequently, many redundant memory allocations/releases and data copies exist between a user-space and a kernel-space in order to read data from the non-volatile memory or write data to the non-volatile memory. Further, since a kernel module cannot directly access the user-space memory, memory management and data copy overheads between the kernel-space and the user-space are unavoidable. Furthermore, kernel-mode and user-mode switching overheads along with the data copies also contribute to long latency of data movements. These overheads causes the speedup improvement to be not significant compared to the coprocessor performance.
- An embodiment of the present invention provides a computing device, a data transfer method between a coprocessor and a non-volatile memory, and a computer-readable recording medium for reducing overheads due to a data movement between a coprocessor and a non-volatile memory.
- a computing device including a CPU, a CPU memory for the CPU, a non-volatile memory, a coprocessor using the non-volatile memory, a coprocessor memory, and a recording medium.
- the coprocessor memory stores data to be processed by the coprocessor or data processed by the coprocessor.
- the recording medium includes a controller driver for the non-volatile memory and a library that are executed by the CPU.
- the controller driver maps the coprocessor memory to a system memory block of the CPU memory.
- the library moves data between the coprocessor and the non-volatile memory via the system memory block mapped to the coprocessor memory.
- the system memory block may include a memory-mapped register and a pinned memory space mapped to the coprocessor memory.
- the memory-mapped register may be managed for the non-volatile memory by the controller driver and may include a plurality of entries for pointing addresses of the pinned memory space.
- a start offset of the system memory block may be indicated by a base address register of an interface connecting the non-volatile memory with the CPU.
- Each entry may point a logical block address of a space with a predetermined size in the pinned memory space, and the logical block address may be mapped to a physical block address of a space with a predetermined size in the coprocessor memory.
- the controller driver may transfer the data from the non-volatile memory to the space of the physical block address that is mapped to the logical block address pointed by a corresponding entry.
- the non-volatile memory may be connected to the CPU through a non-volatile memory express (NVMe) protocol, and each entry may be a physical region page (PRP) entry.
- NVMe non-volatile memory express
- PRP physical region page
- the non-volatile memory may be connected to the CPU through an advanced host controller interface (AHCI) protocol, and each entry may be a physical region descriptor table (PRDT) entry.
- AHCI advanced host controller interface
- PRDT physical region descriptor table
- the library may reside above an application and a native file system in a software stack.
- a method of transferring data between a coprocessor and a non-volatile memory in a computing device includes mapping a coprocessor memory for the coprocessor to a system memory block of a CPU memory for a CPU, and moving data between the coprocessor and the non-volatile memory via the system memory block mapped to the coprocessor memory.
- the system memory block may include a memory-mapped register and a pinned memory space mapped to the coprocessor memory.
- the memory-mapped register may be managed by a controller driver for the non-volatile memory and may include a plurality of entries for pointing addresses of the pinned memory space.
- a start offset of the system memory block may be indicated by a base address register of an interface connecting the non-volatile memory with the CPU.
- Each entry may point a logical block address of a space with a predetermined size in the pinned memory space, and the logical block address may be mapped to a physical block address of a space with a predetermined size in the coprocessor memory.
- moving the data may include transferring the data from the non-volatile memory to the space of the physical block address that is mapped to the logical block address pointed by a corresponding entry.
- the non-volatile memory may be connected to the CPU through a non-volatile memory express (NVMe) protocol, and each entry may be a physical region page (PRP) entry.
- NVMe non-volatile memory express
- PRP physical region page
- the non-volatile memory may be connected to the CPU through an advanced host controller interface (AHCI) protocol, and each entry may be a physical region descriptor table (PRDT) entry.
- AHCI advanced host controller interface
- PRDT physical region descriptor table
- a computer-readable recording medium stores a program to be executed by a computing device including a CPU, a CPU memory for the CPU, a non-volatile memory, a coprocessor using the non-volatile memory, and a coprocessor memory configured to store data to be processed by the coprocessor or data processed by the coprocessor.
- the program includes a controller driver for the non-volatile memory configured to map the coprocessor memory to a system memory block of the CPU memory, and a library configured to move data between the coprocessor and the non-volatile memory via the system memory block mapped to the coprocessor memory.
- FIG. 1 schematically shows a computing device using a coprocessor and a non-volatile memory according to an embodiment of the present invention.
- FIG. 2 schematically shows a software stack for a GPU and an SSD in a conventional computing device.
- FIG. 3 schematically shows a GPU programming model on a software stack in a conventional computing device.
- FIG. 4 schematically shows a data movement between a GPU and an SSD in a conventional computing device.
- FIG. 5 shows performance degradation in a conventional computing device.
- FIG. 6 schematically shows a software stack for a GPU and an SSD in a computing device according to an embodiment of the present invention.
- FIG. 7 schematically shows a data movement between an SSD and a GPU through an NVMe protocol in a computing device according to an embodiment of the present invention.
- FIG. 8 schematically shows a data movement between an SSD and a GPU through an AHCI protocol in a computing device according to an embodiment of the present invention.
- FIG. 9 schematically shows a GPU programming model on a software stack of a computing device according to an embodiment of the present invention.
- FIG. 10 schematically shows a data movement between a GPU and an SSD in a computing device according to an embodiment of the present invention.
- FIG. 11 shows latency values in transferring file data for a GPU application.
- FIG. 12 shows execution times of a GPU application.
- NVMMU A Non-Volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures, in the 24th International Conference on Parallel Architectures and Compilation Techniques, PACT 2015, 2015” is herein incorporated by reference.
- FIG. 1 schematically shows a computing device using a coprocessor and a non-volatile memory according to an embodiment of the present invention.
- FIG. 1 shows one example of the computing device, and the computing device according to an embodiment of the present invention may be implemented by use of various structures.
- a computing device includes a non-volatile memory 110 , a coprocessor 120 , and a CPU 130 .
- the coprocessor 120 may be a computer processor used to supplement functions of a primary processor such as a CPU.
- the non-volatile memory 110 may be, as a file input/output-based non-volatile memory, a computer memory that can retrieve stored information even after having been power cycled (turned off and back on).
- the GPU 120 and the SSD 110 are connected to the CPU 130 via chipsets of a mainboard.
- the computing device may further include a northbridge 140 and a southbridge 150 to connect the GPU 120 and the SSD 110 with the CPU 130 .
- the GPU 120 may be connected to the northbridge 140 that locates at the CPU-side and access a GPU-side memory (hereinafter referred to as a “GPU memory”) 121 via a high performance PCIe (peripheral component interconnect express) link.
- the SSD 110 may be connected to the southbridge 150 that locates at PCI slot-side on the mainboard via a PCIe link or a thin storage interface such as serial AT attachment (SATA).
- the northbridge 140 is also called a memory controller hub (MCH), and the southbridge 150 is also called an input/output controller hub (ICH).
- the computing device further includes a CPU-side memory (hereinafter referred to as a “CPU memory”) 131 corresponding to a system memory for the copy on the CPU 130 .
- the CPU memory 131 may be a random access memory (RAM), particularly a dynamic RAM (DRAM).
- a system including the CPU 130 , the CPU memory 131 , the northbridge 140 , and the southbridge 150 may be called a host machine.
- FIG. 2 schematically shows a software stack for a GPU and an SSD in a conventional computing device.
- the software stack for the GPU 120 and the SSD 110 in the conventional computing device may be divided into a user space 210 and a kernel space 220 .
- the user space 210 operates on a user-level CPU and may be a virtual memory area on which an operating system (OS) executes an application (for example, a GPU application) 210 .
- the kernel space 220 operates on a kernel-level CPU and may be a virtual memory area for running an OS kernel and a device driver.
- I/O runtime library 211 an input/output (I/O) runtime library 211 and a GPU runtime library 221 which coexist on the same user space 210 and are both utilized in the GPU application 200 .
- GPU runtime library 221 an input/output (I/O) runtime library 211 and a GPU runtime library 221 which coexist on the same user space 210 and are both utilized in the GPU application 200 .
- the software stack may be divided into a storage software stack for the SSD 110 and a GPU software stack for the GPU 120 .
- SSD accesses and file services are managed by modules on the storage software stack and GPU-related activities including memory allocations and data transfers are managed by modules on the GPU software stack.
- the I/O runtime library 211 stores user-level contexts and jumps to a virtual file system (VFS) 212 .
- the virtual file system 212 is a kernel module in charge of managing standard file system calls.
- the file system 212 selects an appropriate native file system 213 and initiates a file I/O request.
- the native file system 213 checks an actual physical location associated with the file I/O request, and composes a block level I/O service transaction by calling another function pointer that can be retrieved from a block-device-operation data structure.
- a disk driver 214 issues the I/O request to the SSD 110 .
- the disk driver 214 may issue the I/O request to the SSD 110 through a PCIe or AHCI (advanced host controller interface) controller.
- target data are returned to the GPU application 200 via the aforementioned modules 211 , 212 , 213 , and 214 , but in reverse order.
- a GPU runtime library 221 is mainly responsible for executing a GPU-kernel and copying data between the CPU memory 131 and the GPU memory 121 .
- the GPU runtime library 221 creates a GPU command at the user level and directly submits the GPU command with the target data to a kernel-side GPU driver 222 .
- the GPU driver 222 maps a kernel memory space, i.e., the CPU memory 131 to the GPU memory 121 or translates an address to a physical address of the GPU memory 121 .
- the GPU 120 facilitates a data movement between the CPU memory 131 and the GPU memory 121 .
- FIG. 3 schematically shows a GPU programming model on a software stack in a conventional computing device.
- the GPU application 200 first opens a file descriptor for read/write through an open( ) function.
- the GPU application 200 then allocates a virtual user memory to the CPU memory 131 through a malloc( ) function in order to reads data from the SSD 110 or write data to the SSD 110 .
- the GPU application 200 allocates the GPU memory 121 for data transfers between the GPU 110 and the CPU 130 through a cudaMalloc( ) function.
- the GPU application 200 calls an I/O runtime library API by specifying the file descriptor and the address of the GPU memory 121 as prepared in the previous steps through a read( ) function.
- the GPU application 200 initiates the data transfer from the CPU memory 131 to the GPU memory 121 through a cudaMemcpy( ) function, and executes the GPU kernel through a kernel( ) function by calling the GPU runtime library with a specific number of threads and memory address pointers.
- the GPU application 200 may copy the result data to the virtual user memory of the CPU memory 131 from the GPU memory 121 through a cudaMemcpy( ) function, and sequentially write the data to the SSD 110 through a write( ) function. These processes may be repeated multiple times (loop). After all the processes are completed, the GPU application 200 cleans up the CPU memory and GPU memory allocations [cudafree( )] and the file descriptor [close( )].
- FIG. 4 schematically shows a data movement between a GPU and an SSD in a conventional computing device.
- the GPU application 200 creates on a kernel a file descriptor for a read and/or a write (S 410 ).
- the GPU application 200 then allocates a virtual user memory to the CPU memory 131 for reading data from the SSD 110 or writing data to the SSD 110 (S 415 ).
- the GPU application 200 allocates GPU memory 121 for writing data to the GPU 120 or reading data from the GPU 120 (S 420 ).
- the GPU application 200 then requests a file read to for the SSD 110 (S 425 ).
- the kernel space 220 allocates a physical memory to the CPU memory 131 and copies data for the file read from the virtual user memory to the physical memory (S 430 ), and request file data for the SSD 110 (S 435 ).
- the file data are transferred from the SSD 110 to the CPU memory 131 , i.e., the physical memory of the CPU memory 131 , and the file data are copied from the physical memory of the CPU memory 131 to the virtual user memory (S 440 ).
- the GPU application 200 then transfers the file data from the CPU memory 131 to the GPU memory 121 (S 445 ). Consequently, the GPU 120 processes the file data.
- the GPU application 200 In a case where the GPU application 200 needs to store a result that the GPU 120 has generated after processing the file data, the GPU application 200 transfers the result data from the GPU memory 121 to the virtual user memory of the CPU memory 131 (S 450 ). The GPU application 200 then requests a file write for the SSD 110 (S 455 ). The kernel space 220 allocates a physical memory to the CPU memory 131 and copies the result data from the virtual user memory to the physical memory (S 460 ), and transfers the result data from the physical memory of the CPU memory 131 to the SSD 110 (S 465 ).
- the GPU application 200 releases the virtual user memory of the CPU memory 131 allocated for the read and/or write (S 470 ), and releases the GPU memory 121 allocated for the write and/or read (S 475 ). Further, the GPU application 200 deletes the file descriptor created for the read and/or write in the kernel (S 480 ).
- the steps S 410 , S 415 , S 425 , S 430 , S 435 , S 455 , S 460 , and S 465 may be processes associated with the I/O runtime library, and the steps S 420 and S 445 may be processes associated with the GPU runtime library.
- the steps S 440 , S 470 , and S 480 may be responses of devices for the storage software stack, i.e., the SSD 110 and CPU memory 131 , and the step S 450 and S 475 may be responses of the GPU 120 .
- the application working on the user-level CPU needs to request the I/O or memory operations from the underlying kernel-level modules.
- a disk driver exchanges the file data between the SSD 110 and the GPU 120 , using the CPU memory 131 as an intermediate storage.
- the numerous hops can make overheads according to a data movement among the GPU 120 , the CPU 130 , and the SSD 110 , and further make unnecessary activities, for example communication overheads, redundant data copies, and CPU intervention overheads. These may take as much as 4.21 times and 1.68 times, respectively, of CPU execution time taken by the GPU 120 and the SSD 130 . Accordingly, the processing speed of the GPU 120 that can offer high bandwidth through the parallelism may be slowed down.
- GPUDirectTM is one of the protocols.
- GPUDirectTM supports a direct path for communication between the GPU and a peer high performance device using a standard PCIe interface. GPUDirectTM is typically used to handle peer-to-peer data transfers between multiple GPU devices. Further, GPUDirectTM offers non-uniform memory access (NUMA) and remote direct memory access (RDMA), which can be used for accelerating data communication with other devices such as a network device and a storage device.
- NUMA non-uniform memory access
- RDMA remote direct memory access
- GPUDirectTM can be used for managing the GPU memory in transferring a large data set between the GPU and the SSD, it has shortcomings: i) all the SSD and GPU devices should use PCIe and should exist under the same root complex, ii) GPUDirectTM is incompatible with the aforementioned data transfer protocol in the conventional computing device, and iii) file data accesses should still pass through all the components in the storage software stack.
- NVMe non-volatile memory express
- AHCI advance host controller interface
- the NVMe is a scalable and high performance interface for a non-volatile memory (NVM) system and offers an optimized register interface, command, and feature sets.
- the NVMe can accommodate standard-sized PCIe-based SSDs and SATA express (SATAe) SSDs connected to either the northbridge or the southbridge.
- SATAe SATA express
- the NVMe does not require the SSD and GPU to exist under the same root complex like what GPUDirect requires.
- an embodiment of the present invention may allow a system memory block of the NVMe, referred to as a physical page region (PRP) to be shared by the SSD 110 and the GPU 120 .
- PRP physical page region
- the AHCI is an advanced storage interface that employs both SATA and PCIe links in the southbridge.
- the AHCI defines a system memory structure which allows the OS to move data from the CPU memory to the SSD without significant CPU intervention.
- the AHCI can expose high bandwidth of the underlying SSD to the northbridge controller through direct media interface (DMI) that shares many characteristics with PCIe.
- DMI direct media interface
- a system memory block of the AHCI is pointed by a physical region descriptor (PRD) whose capabilities are similar to those of the PRP. Accordingly, an embodiment of the present invention may allow the system memory block of the AHCI to be shared by the SSD 110 and the GPU 120 .
- PRD physical region descriptor
- FIG. 6 schematically shows a software stack for a GPU and an SSD in a computing device according to an embodiment of the present invention.
- a software stack for a GPU 120 and an SSD 110 may be divided into a user space 610 and a kernel space 620 .
- the user space 610 operates on a user-level CPU and may be a virtual area on which an OS executes an application (for example, a GPU application) 600 .
- the kernel space 620 operates on a kernel-level CPU and may be a virtual memory area for running an OS kernel and a device driver.
- a GPU software stack and an SSD software stack are unified via kernel components in the kernel space.
- the kernel components include a library 621 and a controller driver 622 .
- the library 621 and controller driver 622 may be collectively referred to as a non-volatile memory management unit (NVMMU).
- NVMMU non-volatile memory management unit
- the NVMMU may be a program to be executed by the CPU 130 , which may be stored in a computer-readable recording medium.
- the computer-readable recording medium may be a non-transitory recording medium.
- the library 621 may be referred to as a unified interface library (UIL) because it is an interface library for unifying the SSD software stack and the GPU software stack.
- the controller driver 622 may be referred to as a non-volatile direct memory access (NDMA) because it makes a coprocessor directly access a non-volatile memory.
- the library 621 and the controller driver 622 are referred to as the ULI and the NDMA, respectively, for convenience.
- the UIL 621 may be a virtual file system driver for directly transferring data between the SSD 110 and the GPU 120 .
- the UIL 621 directly transfers target data from the SSD 110 to a GPU memory 121 or from the GPU memory 121 to the SSD 110 via a system memory block (kernel buffer) mapped to the GPU memory 121 .
- the UIL 621 may reside on top of a native file system and may read/write target file contents from the native file system via the system memory block. That is, the UIL 621 may handle a file access and a memory buffer that the NDMA 622 provides by overriding a conventional virtual file system switch.
- the UIL 621 can remove the unnecessary user mode and kernel mode switching overheads between the user space and the kernel space. Further, the UIL 621 may not use a user-level memory and may not copy the data between the user space and the kernel space during the data movement between the GPU 120 and the CPU 130 .
- the NDMA 622 may be a control driver which modifies a disk controller driver that manages a file read/write of the SSD 110 .
- the NDMA 622 manages a physical memory mapping which is shared by the SSD 110 and the GPU 120 for the data movement between the SSD 110 and the GPU 120 . That is, the NDMA 622 manages a memory mapping between the GPU memory 121 and the system memory block.
- the mapped system memory block may be exposed to the UIL 621 .
- the UIL 621 may recompose user data of an I/O request using the system memory block if the I/O request is related to a data transfer between the GPU 120 and the SSD 110 . Otherwise, the UIL 621 may bypass the I/O request to the underlying kernel module (i.e., the native file system).
- a mapping method in the NDMA 622 may be reconfigured based on an interface or controller employed (for example, NVMe or AHCI).
- the mapping method in the NDMA 622 is described using various interfaces or controllers.
- FIG. 7 schematically shows a data movement between an SSD and a GPU through an NVMe protocol in a computing device according to an embodiment of the present invention.
- an NDMA 622 uses a system memory block 700 mapped to a GPU memory 121 .
- the system memory block 700 is a kernel buffer allocated to a CPU memory 131 , and includes a memory-mapped register 710 and a GPU pinned memory space 720 .
- the memory-mapped register 710 is a register which a disk driver controller (for example, an NVMe controller) for an SSD 110 manages, and the GPU pinned memory space 720 is a space mapped to the GPU memory 121 .
- the memory-mapped register 710 includes I/O submission queues (an I/O submission region) 711 of the NVMe SSD 110 , and a start offset of the memory-mapped register 710 may be indicated by a baseline address register (BAR) of the PCIe.
- a submission command 711 a may be input to the I/O submission queue 711 , and the submission command 711 a may have various items. Each item may have two physical region pages (PRPs), a PRP 1 entry and a PRP 2 .
- PRPs physical region pages
- Each of the PRP 1 entry and PRP 2 entry points a physical page of the GPU memory 121 for the data movement between the SSD 110 and the GPU 120 .
- the NDMA 622 may map block addresses of the GPU pinned memory 720 to block addresses of the GPU memory 121 in the system memory block 700 .
- each of the PRP 1 entry and PRP 2 entry may map point a logical block address (LBA) mapped to a space (i.e., a memory block) with a predetermined size in the GPU pinned memory 720 .
- the logical block address is a device-visible virtual address and indicates a predetermined space in the system memory block 700 .
- an address, i.e., a physical block address (PBA) of the space with the predetermined size in the GPU memory 121 which is mapped to the logical block address can be automatically pointed.
- PBA physical block address
- the PRP 1 entry may directly point the memory block of the system memory block 700 and the PRP 2 entry may point a PRP list.
- the PRP list may include one or more PRP entries, each pointing the memory block.
- each PRP entry may point the memory block with a predetermined size, for example the memory block with 4 KB.
- the amount of data to be transferred between the SSD 110 and the GPU 120 is greater than 4 KB, they may be referred by the pointers on the PRP list which is indicated by the PRP 2 entry.
- the NDMA 622 when data are transferred from the GPU 120 to the SSD 110 , the NDMA 622 generates the PRP 1 entry for pointing the logical block address of the system memory block 700 , which is mapped to the GPU memory 121 including the data to be transferred to the SSD 110 .
- the NDMA 622 In a case where the amount of data to be transferred to the SSD 110 is greater than 4 KB, the NDMA 622 generates the PRP entries for pointing the logical block addresses of the system memory block 700 , which are mapped to the GPU memory 121 including the remaining data, and generates the PRP 2 entry for pointing the PRP list including these PRP entries. Since the NDMA 622 exports such the allocated memory spaces to the UIL, it can directly move the data from the GPU memory 121 to the SSD 110 .
- the NDMA 622 when data are transferred from the SSD 110 to the GPU 120 , the NDMA 622 generates the PRP 1 entry for pointing the logical block address of the system memory block 700 , which is mapped to the GPU memory 121 for writing the data to be transferred to the GPU 120 . In a case where the amount of data to be transferred to the GPU 120 is greater than 4 KB, the NDMA 622 generates the PRP entries for pointing the logical block addresses of the system memory block 700 , which are mapped to the GPU memory 121 for writing the remaining data, and generates the PRP 2 entry for pointing the PRP list including these PRP entries. Since the NDMA 622 exports such the allocated memory spaces to the UIL, it can directly move the data from the SSD 110 to the GPU memory 121 .
- the memory-mapped register 710 may further include a control register set above the I/O submission region 711 .
- the control register set may start from the BAR.
- the control register set may be used for managing an NVMe work such as updating a doorbell register and interrupt management.
- the memory-mapped register 710 may further include I/O completion queues (an I/O completion region) below the I/O submission region 711 and a data region below the I/O completion region.
- the GPU application 600 notifies the disk driver controller of the submission command using the doorbell register of the control register set, and the disk driver controller brings the submission command of the I/O submission queue and processes it.
- the submission command including the PRP entries may be transferred to the disk drive controller and be used for the read/write of the SSD 110 .
- the disk drive controller can transfer the data of the SSD 110 to the GPU memory 121 pointed by the PRP entries of the item in the submission command or transfer the data of the GPU memory 121 pointed by the PRP entries to the SSD 110 .
- the NDMA 622 can directly upload or download the GPU data while letting the other kernel components serve a file-related work such as LBA translation in an appropriate manner. Since the kernel buffers of the NDMA 622 are managed as a pre-allocated memory pool, they may not be released until all data movement activities involving the file data are over. To implement this, an interrupt service routine (ISR) registered at a driver's NVMe initialization time may be modified.
- ISR interrupt service routine
- the AHCI has a different data management structure but employs a similar strategy for the data transfer between the GPU and the SSD.
- FIG. 8 schematically shows a data movement between an SSD and a GPU through an AHCI protocol in a computing device according to an embodiment of the present invention.
- an NDMA 622 uses a system memory block 800 mapped to a GPU memory 121 .
- the system memory block 800 is a kernel buffer allocated to a CPU memory 131 , and includes a memory-mapped register 810 and a GPU pinned memory space 820 .
- the memory-mapped register 810 is a register which a disk driver controller (for example, an AHCI controller) for an SSD 110 manages, and the GPU pinned memory space 820 is a space mapped to the GPU memory 121 .
- the memory-mapped register 810 includes a generic host control 811 and multiple port registers 812 , and a start offset of the memory-mapped register 810 may be indicated by an AHCI base address register (ABAR).
- the multiple port registers 812 indicate a plurality of ports, and each port may represent an individual SSD in an SDD array.
- the multiple port registers 812 includes two meta-data structures 812 a and 812 b for each port.
- the two meta-data structures 812 a and 812 b includes a command list 812 a and a received FIS (frame information structure) structure 812 b .
- the command list 812 a includes a plurality of command headers, for example 32 command headers.
- the received FIS 812 b is used for handshaking control such as a device-to-host (D2H) acknowledge FIS, and each command header refers to a physical region descriptor table (PRDT).
- D2H device-to-host
- PRDT physical region descriptor
- each PRDP entry points a system memory block managed by the NDMA 622 .
- Each PRDP entry may point a logical block address corresponding to addresses of the GPU pinned memory 820 .
- a maximum buffer size of each PRDT entry may be 4 MB.
- the buffer may be split into multiple physical pages with a predetermined size (for example, multiple 4 KB physical pages) to make them compatible with a PRP management policy employed by the GPU 120 .
- DMI direct media interface
- interrupts delivered by the FIS are converted to a PCIe interrupt packet which allows the NDMA 622 to manage an interrupt service routine (ISR) in a similar fashion to what is done in the NVMe.
- FIG. 9 schematically shows a GPU programming model on a software stack of a computing device according to an embodiment of the present invention.
- a GPU application 200 creates a file descriptor for initializing an UIL 621 and an NDMA 622 .
- the GPU application 200 may use, for example, an nvmmuBegin( ) function as the file descriptor to initialize the UIL 621 and NDMA 622 .
- a thread ID (tid) of a requester and a file name (w_filename) to be moved may be, as parameters, input to the nvmmuBegin( ) function like nvmmuBegin(tid, w_filename).
- the nvmmuBegin( ) function may keep the thread id (tid) of the requester for internal resource management, and may send piggyback information about parity block pipelining before starting the movement of the file data.
- the GPU application 200 allocates a GPU memory 121 for read/write of data.
- the GPU application 200 may use, for example, a cudaMalloc( ) function.
- an address (&pGPUInP2P) of the GPU memory for writing the data and an amount (nImageDataSize) of the data to be written may be, as parameters, input to the cudaMalloc( ) function like cudaMalloc(&pGPUInP2P, nImageDataSize).
- an address (&pGPUOutP2P) of the GPU memory for reading the data and an amount (nImageDataSize) of the data to be read may be, as parameters, input to the cudaMalloc( ) function like cudaMalloc(&pGPUOutP2P, nImageDataSize).
- the GPU application 200 moves data by specifying a file name, an offset, and a number of bytes (length) of the data to be transferred from the SSD 110 to the GPU 120 .
- the GPU application 200 may call, for example, a nvmmuMove( ) function for the data movement.
- the nvmmuMove( ) function may create a data path between the SSD 110 and the GPU 120 based on the allocated addresses of the GPU memory 121 and the PRP entries pointing the addresses of the GPU memory 121 , and may move the data taking into account the file name, the offset, and the amount of data.
- the file name (r_filename) of the data, the address (pGPUInP2P) of the GPU memory 121 for writing the data, offset 0, the amount of data (nImageDataSize), and a data movement direction (D2H) may be, as parameters, input to the nVmmuMove( ) function like nVmmuMove(r_filename, pGPUInP2P, 0, nImageDataSize, D2H).
- the D2H parameter indicates a device-to-host direction, i.e., the data movement from the SSD 110 to the GPU 120 .
- the GPU application 200 executes a GPU kernel .
- the GPU application 200 may call, for example, a kernel( ) function.
- the GPU application 200 moves the result data by specifying a file name, an offset, and the number of bytes (length) of the data to be transferred from the GPU 120 to the SSD 110 .
- the GPU application 200 may call, for example, a nvmmuMove( ) function for the data movement.
- the file name (r_filename) of the data, the address (pGPUOutP2P) of the GPU memory 121 for reading the data, offset 0, the amount of data (nImageDataSize), and a data movement direction (H2D) may be, as parameters, input to the nVmmuMove( ) function like nVmmuMove(r_filename, pGPUOutP2P, 0, nImageDataSize, H2D).
- the D2H parameter indicates a host-to-device direction, i.e., the data movement from the GPU 120 to the SSD 110 .
- the GPU application 200 cleans up resources which the UIL 621 and the NDMA 622 use for the thread.
- the GPU application 200 may clean up the resources through, for example, an nvmmuEnd( ) function.
- the thread ID (tid) may be, as a parameter, input to the nvmmuEnd( ) function like nvmmuEnd(tid).
- FIG. 10 schematically shows a data movement between a GPU and an SSD in a computing device according to an embodiment of the present invention.
- a GPU application 200 creates on a kernel a file descriptor for a read and/or a write (S 1010 ).
- the GPU application 200 then allocates a GPU memory 121 for writing data to the GPU 120 or reading data from the GPU 120 (S 1020 ). Accordingly, physical block addresses of the allocated GPU memory 121 are mapped to logical block addresses of a system memory block associated with addresses of the SSD 110
- the GPU application 200 requests a file read to for the SSD 110 (S 1030 ). Then, the file data are transferred from the SSD 110 to the GPU memory 121 through mappings of the system memory block (S 1040 ). Consequently, the GPU 120 processes the file data.
- the GPU application 200 requests a file write for the GPU 120 (S 1050 ). Then, the file data are transferred from the GPU memory 121 to the SSD 110 through mappings of the system memory block (S 1060 ).
- the steps S 1010 , S 1020 , S 1030 and, S 1050 may be processes associated with NVMMU.
- the step S 1040 may be a response of the SSD 110
- the step S 1060 may be a response of the GPU 120 .
- a data transfer method described above may be applied to a redundant array of independent disks (RAID)-based SSD array.
- RAID-based SSD array a software-based array controller driver may be modified to abstract multiple SSDs as a single virtual storage device. Since a GPU has neither an OS nor resource management capabilities, a host-side GPU application may in practice have all of the information regarding file data movement, such as a target data size, a file location, and timing for data download prior to beginning GPU-kernel execution.
- the nvmmuBegin( ) function may pass a file name to be downloaded from the SSD 110 to the UIL 621 , and the UIL 621 may feed this information to an array controller driver, i.e., the NDMA 622 .
- the array controller driver may read an old version of the target file data and the corresponding parity blocks at an early stage of GPU body code-segments using the information. Consequently, the array controller driver may load the old data and prepare new parity blocks while the GPU 120 and the CPU 130 prepare for a data movement and execution of the GPU kernel.
- This parity block pipelining strategy can enable all parity block preparations to be done in parallel with performing a data movement between the GPU 120 and the CPU 130 and/or executing GPU-kernel. Accordingly, performance degradation exhibited by conventional RAID systems can be eliminated.
- a UIL-assisted GPU application can be complied just like a normal GPU program and then no compiler modification is required. Further, the computing device can still use all functionality of I/O runtime and GPU runtime libraries, which means that the NVMMU is fully compatible with all existing GPU applications.
- FIG. 11 shows latency values in transferring file data for a GPU application
- FIG. 12 shows execution times of a GPU application.
- an NVMMU using an NVMe protocol reduces latency values of data movement, compared to an NVMe-IOMMU, by 202%, 70%, 112% and 108%, for PolyBench, Mars, Rodinia and Parboil benchmarks, respectively.
- the NVMe-IOMMU means a memory management unit that uses an NVMe protocol in a conventional computing device as described with reference to FIG. 2 to FIG. 5 . As shown in FIG.
- the NVMe-NVMMU reduces the application execution times, compared to the NVMe-IOMMU, by 192%, 14%, 69% and 37%, for PolyBench, Mars, Rodinia and Parboil benchmarks, respectively.
- NVMMU can reduce the redundant memory copies and the user mode and kernel mode switching overheads as described above.
- a data transfer method i.e., NVMME
- NVMME data transfer method
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Human Computer Interaction (AREA)
- Advance Control (AREA)
- Memory System (AREA)
- Stored Programmes (AREA)
Abstract
Description
- This application claims priority to and the benefit of Korean Patent Application No. 10-2016-0017233 filed in the Korean Intellectual Property Office on Feb. 15, 2016, the entire contents of which are incorporated herein by reference.
- (a) Field of the Invention
- The described technology relates to a computing device, a data transfer method between a coprocessor and a non-volatile memory, and a computer-readable recording medium.
- (b) Description of the Related Art
- Data processing coprocessors with high computation parallelism and comparatively low power consumption are becoming increasingly popular. One example of the coprocessor is a graphic processing unit (GPU). In such the coprocessor, many processing cores share execution control and can performing identical operations on numerous pieces of data via thread-level parallelism and data-level parallelism. A system using the coprocessor together with a central processing unit (CPU) can exhibit significant speedups compared to a CPU-only system.
- The coprocessors can process more data than they have ever had before, and the volume of such data is expected. However, the coprocessors employ on-board memory whose size is relatively smaller compared to a host memory. The coprocessors therefore use a non-volatile memory connected to a host machine to process large sets of data.
- However, the coprocessor and the non-volatile memory are completely disconnected from each other and are managed by different software stacks. Consequently, many redundant memory allocations/releases and data copies exist between a user-space and a kernel-space in order to read data from the non-volatile memory or write data to the non-volatile memory. Further, since a kernel module cannot directly access the user-space memory, memory management and data copy overheads between the kernel-space and the user-space are unavoidable. Furthermore, kernel-mode and user-mode switching overheads along with the data copies also contribute to long latency of data movements. These overheads causes the speedup improvement to be not significant compared to the coprocessor performance.
- An embodiment of the present invention provides a computing device, a data transfer method between a coprocessor and a non-volatile memory, and a computer-readable recording medium for reducing overheads due to a data movement between a coprocessor and a non-volatile memory.
- According to an embodiment of the present invention, a computing device including a CPU, a CPU memory for the CPU, a non-volatile memory, a coprocessor using the non-volatile memory, a coprocessor memory, and a recording medium is provided. The coprocessor memory stores data to be processed by the coprocessor or data processed by the coprocessor. The recording medium includes a controller driver for the non-volatile memory and a library that are executed by the CPU. The controller driver maps the coprocessor memory to a system memory block of the CPU memory. The library moves data between the coprocessor and the non-volatile memory via the system memory block mapped to the coprocessor memory.
- The system memory block may include a memory-mapped register and a pinned memory space mapped to the coprocessor memory. The memory-mapped register may be managed for the non-volatile memory by the controller driver and may include a plurality of entries for pointing addresses of the pinned memory space.
- A start offset of the system memory block may be indicated by a base address register of an interface connecting the non-volatile memory with the CPU.
- Each entry may point a logical block address of a space with a predetermined size in the pinned memory space, and the logical block address may be mapped to a physical block address of a space with a predetermined size in the coprocessor memory.
- When the coprocessor reads data from the non-volatile memory, the controller driver may transfer the data from the non-volatile memory to the space of the physical block address that is mapped to the logical block address pointed by a corresponding entry.
- The non-volatile memory may be connected to the CPU through a non-volatile memory express (NVMe) protocol, and each entry may be a physical region page (PRP) entry.
- The non-volatile memory may be connected to the CPU through an advanced host controller interface (AHCI) protocol, and each entry may be a physical region descriptor table (PRDT) entry.
- The library may reside above an application and a native file system in a software stack.
- According to another embodiment of the present invention, a method of transferring data between a coprocessor and a non-volatile memory in a computing device is provided. The method includes mapping a coprocessor memory for the coprocessor to a system memory block of a CPU memory for a CPU, and moving data between the coprocessor and the non-volatile memory via the system memory block mapped to the coprocessor memory.
- The system memory block may include a memory-mapped register and a pinned memory space mapped to the coprocessor memory. The memory-mapped register may be managed by a controller driver for the non-volatile memory and may include a plurality of entries for pointing addresses of the pinned memory space.
- A start offset of the system memory block may be indicated by a base address register of an interface connecting the non-volatile memory with the CPU.
- Each entry may point a logical block address of a space with a predetermined size in the pinned memory space, and the logical block address may be mapped to a physical block address of a space with a predetermined size in the coprocessor memory.
- When the coprocessor reads data from the non-volatile memory, moving the data may include transferring the data from the non-volatile memory to the space of the physical block address that is mapped to the logical block address pointed by a corresponding entry.
- The non-volatile memory may be connected to the CPU through a non-volatile memory express (NVMe) protocol, and each entry may be a physical region page (PRP) entry.
- The non-volatile memory may be connected to the CPU through an advanced host controller interface (AHCI) protocol, and each entry may be a physical region descriptor table (PRDT) entry.
- According to yet another embodiment of the present invention, a computer-readable recording medium is provided. The computer-readable recording medium stores a program to be executed by a computing device including a CPU, a CPU memory for the CPU, a non-volatile memory, a coprocessor using the non-volatile memory, and a coprocessor memory configured to store data to be processed by the coprocessor or data processed by the coprocessor. The program includes a controller driver for the non-volatile memory configured to map the coprocessor memory to a system memory block of the CPU memory, and a library configured to move data between the coprocessor and the non-volatile memory via the system memory block mapped to the coprocessor memory.
-
FIG. 1 schematically shows a computing device using a coprocessor and a non-volatile memory according to an embodiment of the present invention. -
FIG. 2 schematically shows a software stack for a GPU and an SSD in a conventional computing device. -
FIG. 3 schematically shows a GPU programming model on a software stack in a conventional computing device. -
FIG. 4 schematically shows a data movement between a GPU and an SSD in a conventional computing device. -
FIG. 5 shows performance degradation in a conventional computing device. -
FIG. 6 schematically shows a software stack for a GPU and an SSD in a computing device according to an embodiment of the present invention. -
FIG. 7 schematically shows a data movement between an SSD and a GPU through an NVMe protocol in a computing device according to an embodiment of the present invention. -
FIG. 8 schematically shows a data movement between an SSD and a GPU through an AHCI protocol in a computing device according to an embodiment of the present invention. -
FIG. 9 schematically shows a GPU programming model on a software stack of a computing device according to an embodiment of the present invention. -
FIG. 10 schematically shows a data movement between a GPU and an SSD in a computing device according to an embodiment of the present invention. -
FIG. 11 shows latency values in transferring file data for a GPU application. -
FIG. 12 shows execution times of a GPU application. - In the following detailed description, only certain embodiments of the present invention have been shown and described, simply by way of illustration. As those skilled in the art would realize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature and not restrictive. Like reference numerals designate like elements throughout the specification.
- The disclosure of the inventor's treatise, “NVMMU: A Non-Volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures, in the 24th International Conference on Parallel Architectures and Compilation Techniques, PACT 2015, 2015” is herein incorporated by reference.
-
FIG. 1 schematically shows a computing device using a coprocessor and a non-volatile memory according to an embodiment of the present invention.FIG. 1 shows one example of the computing device, and the computing device according to an embodiment of the present invention may be implemented by use of various structures. - Referring to
FIG. 1 , a computing device according to an embodiment of the present invention includes anon-volatile memory 110, acoprocessor 120, and aCPU 130. - While it is described in an embodiment of the present invention that a graphic processing unit (GPU) and a solid state disk (SSD) are examples of the
coprocessor 120 and thenon-volatile memory 110, the present invention is not limited thereto. Thecoprocessor 120 may be a computer processor used to supplement functions of a primary processor such as a CPU. Thenon-volatile memory 110 may be, as a file input/output-based non-volatile memory, a computer memory that can retrieve stored information even after having been power cycled (turned off and back on). - The
GPU 120 and theSSD 110 are connected to theCPU 130 via chipsets of a mainboard. The computing device may further include anorthbridge 140 and asouthbridge 150 to connect theGPU 120 and theSSD 110 with theCPU 130. - The
GPU 120 may be connected to thenorthbridge 140 that locates at the CPU-side and access a GPU-side memory (hereinafter referred to as a “GPU memory”) 121 via a high performance PCIe (peripheral component interconnect express) link. TheSSD 110 may be connected to thesouthbridge 150 that locates at PCI slot-side on the mainboard via a PCIe link or a thin storage interface such as serial AT attachment (SATA). Thenorthbridge 140 is also called a memory controller hub (MCH), and thesouthbridge 150 is also called an input/output controller hub (ICH). - Even though the
GPU 120 and theSSD 110 can offer extremely high bandwidth compared with other external devices, they are considered like conventional peripheral devices from a CPU viewpoint. Therefore, the conventional computing devices use data transfer protocols between the peripheral devices to transfer data between theGPU 120 and theSSD 110. That is, the conventional computing devices can transfer the data between theCPU 130 and theGPU 120 and/or between theCPU 130 and theSSD 110 through a memory copy technique, but cannot directly forward the date between theGPU 120 and theSSD 110. The computing device further includes a CPU-side memory (hereinafter referred to as a “CPU memory”) 131 corresponding to a system memory for the copy on theCPU 130. For example, theCPU memory 131 may be a random access memory (RAM), particularly a dynamic RAM (DRAM). - In some embodiments, a system including the
CPU 130, theCPU memory 131, thenorthbridge 140, and thesouthbridge 150 may be called a host machine. - First, a data movement between a GPU and an
SSD 110 in a conventional computing device is described with reference toFIG. 2 toFIG. 5 . -
FIG. 2 schematically shows a software stack for a GPU and an SSD in a conventional computing device. - Referring to
FIG. 2 , the software stack for theGPU 120 and theSSD 110 in the conventional computing device may be divided into auser space 210 and akernel space 220. Theuser space 210 operates on a user-level CPU and may be a virtual memory area on which an operating system (OS) executes an application (for example, a GPU application) 210. Thekernel space 220 operates on a kernel-level CPU and may be a virtual memory area for running an OS kernel and a device driver. - Because of the different functionalities and purposes of the
GPU 120 and theSSD 110, there are two discrete libraries, i.e., an input/output (I/O)runtime library 211 and aGPU runtime library 221 which coexist on thesame user space 210 and are both utilized in theGPU application 200. - The software stack may be divided into a storage software stack for the
SSD 110 and a GPU software stack for theGPU 120. SSD accesses and file services are managed by modules on the storage software stack and GPU-related activities including memory allocations and data transfers are managed by modules on the GPU software stack. - In the storage software stack, when the
GPU application 200 calls I/O runtime library 211 through an interface, for example a POSIX (portable operating system interface), the I/O runtime library 211 stores user-level contexts and jumps to a virtual file system (VFS) 212. Thevirtual file system 212 is a kernel module in charge of managing standard file system calls. Thefile system 212 selects an appropriatenative file system 213 and initiates a file I/O request. Next, thenative file system 213 checks an actual physical location associated with the file I/O request, and composes a block level I/O service transaction by calling another function pointer that can be retrieved from a block-device-operation data structure. Finally, a disk driver 214 issues the I/O request to theSSD 110. For example, the disk driver 214 may issue the I/O request to theSSD 110 through a PCIe or AHCI (advanced host controller interface) controller. When the I/O service is completed, target data are returned to theGPU application 200 via theaforementioned modules - In the GPU software stack, a
GPU runtime library 221 is mainly responsible for executing a GPU-kernel and copying data between theCPU memory 131 and theGPU memory 121. Differently from the storage software stack, theGPU runtime library 221 creates a GPU command at the user level and directly submits the GPU command with the target data to a kernel-side GPU driver 222. Depending on the GPU command, theGPU driver 222 maps a kernel memory space, i.e., theCPU memory 131 to theGPU memory 121 or translates an address to a physical address of theGPU memory 121. When the address translation or mapping is completed, theGPU 120 facilitates a data movement between theCPU memory 131 and theGPU memory 121. - Next, a GPU programming model on the software stack is described with reference to
FIG. 3 . -
FIG. 3 schematically shows a GPU programming model on a software stack in a conventional computing device. - Referring to
FIG. 3 , theGPU application 200 first opens a file descriptor for read/write through an open( ) function. TheGPU application 200 then allocates a virtual user memory to theCPU memory 131 through a malloc( ) function in order to reads data from theSSD 110 or write data to theSSD 110. Further, theGPU application 200 allocates theGPU memory 121 for data transfers between theGPU 110 and theCPU 130 through a cudaMalloc( ) function. Next, theGPU application 200 calls an I/O runtime library API by specifying the file descriptor and the address of theGPU memory 121 as prepared in the previous steps through a read( ) function. Once the target data is brought into theCPU memory 131 from theSSD 110, theGPU application 200 initiates the data transfer from theCPU memory 131 to theGPU memory 121 through a cudaMemcpy( ) function, and executes the GPU kernel through a kernel( ) function by calling the GPU runtime library with a specific number of threads and memory address pointers. In a case where theGPU application 200 needs to store a result generated by theGPU 120, theGPU application 200 may copy the result data to the virtual user memory of theCPU memory 131 from theGPU memory 121 through a cudaMemcpy( ) function, and sequentially write the data to theSSD 110 through a write( ) function. These processes may be repeated multiple times (loop). After all the processes are completed, theGPU application 200 cleans up the CPU memory and GPU memory allocations [cudafree( )] and the file descriptor [close( )]. - Next, a procedure in which the
GPU application 200 transfers data between theGPU 120 and theSSD 110 is described with reference toFIG. 4 . -
FIG. 4 schematically shows a data movement between a GPU and an SSD in a conventional computing device. - Referring to
FIG. 4 , theGPU application 200 creates on a kernel a file descriptor for a read and/or a write (S410). TheGPU application 200 then allocates a virtual user memory to theCPU memory 131 for reading data from theSSD 110 or writing data to the SSD 110 (S415). TheGPU application 200 allocatesGPU memory 121 for writing data to theGPU 120 or reading data from the GPU 120 (S420). - The
GPU application 200 then requests a file read to for the SSD 110 (S425). Thekernel space 220 allocates a physical memory to theCPU memory 131 and copies data for the file read from the virtual user memory to the physical memory (S430), and request file data for the SSD 110 (S435). Then, the file data are transferred from theSSD 110 to theCPU memory 131, i.e., the physical memory of theCPU memory 131, and the file data are copied from the physical memory of theCPU memory 131 to the virtual user memory (S440). TheGPU application 200 then transfers the file data from theCPU memory 131 to the GPU memory 121 (S445). Consequently, theGPU 120 processes the file data. - In a case where the
GPU application 200 needs to store a result that theGPU 120 has generated after processing the file data, theGPU application 200 transfers the result data from theGPU memory 121 to the virtual user memory of the CPU memory 131 (S450). TheGPU application 200 then requests a file write for the SSD 110 (S455). Thekernel space 220 allocates a physical memory to theCPU memory 131 and copies the result data from the virtual user memory to the physical memory (S460), and transfers the result data from the physical memory of theCPU memory 131 to the SSD 110 (S465). - After completing all the processes, the
GPU application 200 releases the virtual user memory of theCPU memory 131 allocated for the read and/or write (S470), and releases theGPU memory 121 allocated for the write and/or read (S475). Further, theGPU application 200 deletes the file descriptor created for the read and/or write in the kernel (S480). - In
FIG. 4 , the steps S410, S415, S425, S430, S435, S455, S460, and S465 may be processes associated with the I/O runtime library, and the steps S420 and S445 may be processes associated with the GPU runtime library. The steps S440, S470, and S480 may be responses of devices for the storage software stack, i.e., theSSD 110 andCPU memory 131, and the step S450 and S475 may be responses of theGPU 120. - As such, the application working on the user-level CPU needs to request the I/O or memory operations from the underlying kernel-level modules. Once the modules are done with the file-related operations, a disk driver exchanges the file data between the
SSD 110 and theGPU 120, using theCPU memory 131 as an intermediate storage. In this case, as shown inFIG. 5 , the numerous hops can make overheads according to a data movement among theGPU 120, theCPU 130, and theSSD 110, and further make unnecessary activities, for example communication overheads, redundant data copies, and CPU intervention overheads. These may take as much as 4.21 times and 1.68 times, respectively, of CPU execution time taken by theGPU 120 and theSSD 130. Accordingly, the processing speed of theGPU 120 that can offer high bandwidth through the parallelism may be slowed down. - Data transfer protocols for reducing the data movement overheads between the
GPU 120 and theSSD 110 that can occur in the conventional computing device are being developed. GPUDirect™ is one of the protocols. - GPUDirect™ supports a direct path for communication between the GPU and a peer high performance device using a standard PCIe interface. GPUDirect™ is typically used to handle peer-to-peer data transfers between multiple GPU devices. Further, GPUDirect™ offers non-uniform memory access (NUMA) and remote direct memory access (RDMA), which can be used for accelerating data communication with other devices such as a network device and a storage device. While GPUDirect™ can be used for managing the GPU memory in transferring a large data set between the GPU and the SSD, it has shortcomings: i) all the SSD and GPU devices should use PCIe and should exist under the same root complex, ii) GPUDirect™ is incompatible with the aforementioned data transfer protocol in the conventional computing device, and iii) file data accesses should still pass through all the components in the storage software stack.
- Further, there are protocols such as non-volatile memory express (NVMe) and advance host controller interface (AHCI) as the protocols for the interface.
- The NVMe is a scalable and high performance interface for a non-volatile memory (NVM) system and offers an optimized register interface, command, and feature sets. The NVMe can accommodate standard-sized PCIe-based SSDs and SATA express (SATAe) SSDs connected to either the northbridge or the southbridge. As a consequence, the NVMe does not require the SSD and GPU to exist under the same root complex like what GPUDirect requires. While the NVMe is originally oriented towards managing data transfers between the CPU and the SSD, an embodiment of the present invention may allow a system memory block of the NVMe, referred to as a physical page region (PRP) to be shared by the
SSD 110 and theGPU 120. - The AHCI is an advanced storage interface that employs both SATA and PCIe links in the southbridge. The AHCI defines a system memory structure which allows the OS to move data from the CPU memory to the SSD without significant CPU intervention. Unlike traditional host controller interfaces, the AHCI can expose high bandwidth of the underlying SSD to the northbridge controller through direct media interface (DMI) that shares many characteristics with PCIe. Further, a system memory block of the AHCI is pointed by a physical region descriptor (PRD) whose capabilities are similar to those of the PRP. Accordingly, an embodiment of the present invention may allow the system memory block of the AHCI to be shared by the
SSD 110 and theGPU 120. - Hereinafter, a data transfer method according to an embodiment of the present invention is described with reference to
FIG. 6 toFIG. 11 . - In the above-described conventional computing device, there is a problem that the SSD and the GPU are completely disconnected from each other and are managed by different software stacks. Accordingly, many redundant memory allocations/releases and data copies exist between the user space and the kernel space on the SSD and GPU system stacks. Further, since the kernel module cannot directly access the user space, the memory management and data copy overheads between the kernel space and the user space are unavoidable. Furthermore, the kernel mode and user mode switching overheads along with the data copies contribute to long latency of the data movements.
-
FIG. 6 schematically shows a software stack for a GPU and an SSD in a computing device according to an embodiment of the present invention. - Referring to
FIG. 6 , in a computing device according to an embodiment of the present invention, a software stack for aGPU 120 and anSSD 110 may be divided into auser space 610 and akernel space 620. Theuser space 610 operates on a user-level CPU and may be a virtual area on which an OS executes an application (for example, a GPU application) 600. Thekernel space 620 operates on a kernel-level CPU and may be a virtual memory area for running an OS kernel and a device driver. - A GPU software stack and an SSD software stack are unified via kernel components in the kernel space. The kernel components include a
library 621 and acontroller driver 622. In some embodiments, thelibrary 621 andcontroller driver 622 may be collectively referred to as a non-volatile memory management unit (NVMMU). In some embodiments, the NVMMU may be a program to be executed by theCPU 130, which may be stored in a computer-readable recording medium. In some embodiment, the computer-readable recording medium may be a non-transitory recording medium. - In some embodiments, the
library 621 may be referred to as a unified interface library (UIL) because it is an interface library for unifying the SSD software stack and the GPU software stack. In some embodiments, thecontroller driver 622 may be referred to as a non-volatile direct memory access (NDMA) because it makes a coprocessor directly access a non-volatile memory. Hereinafter, thelibrary 621 and thecontroller driver 622 are referred to as the ULI and the NDMA, respectively, for convenience. - The
UIL 621 may be a virtual file system driver for directly transferring data between theSSD 110 and theGPU 120. TheUIL 621 directly transfers target data from theSSD 110 to aGPU memory 121 or from theGPU memory 121 to theSSD 110 via a system memory block (kernel buffer) mapped to theGPU memory 121. In some embodiments, theUIL 621 may reside on top of a native file system and may read/write target file contents from the native file system via the system memory block. That is, theUIL 621 may handle a file access and a memory buffer that theNDMA 622 provides by overriding a conventional virtual file system switch. - As a consequence, the
UIL 621 can remove the unnecessary user mode and kernel mode switching overheads between the user space and the kernel space. Further, theUIL 621 may not use a user-level memory and may not copy the data between the user space and the kernel space during the data movement between theGPU 120 and theCPU 130. - The
NDMA 622 may be a control driver which modifies a disk controller driver that manages a file read/write of theSSD 110. TheNDMA 622 manages a physical memory mapping which is shared by theSSD 110 and theGPU 120 for the data movement between theSSD 110 and theGPU 120. That is, theNDMA 622 manages a memory mapping between theGPU memory 121 and the system memory block. The mapped system memory block may be exposed to theUIL 621. TheUIL 621 may recompose user data of an I/O request using the system memory block if the I/O request is related to a data transfer between theGPU 120 and theSSD 110. Otherwise, theUIL 621 may bypass the I/O request to the underlying kernel module (i.e., the native file system). - A mapping method in the
NDMA 622 may be reconfigured based on an interface or controller employed (for example, NVMe or AHCI). The mapping method in theNDMA 622 is described using various interfaces or controllers. - First, an example of an NVMe SSD is described with reference to
FIG. 7 . -
FIG. 7 schematically shows a data movement between an SSD and a GPU through an NVMe protocol in a computing device according to an embodiment of the present invention. - Referring to
FIG. 7 , anNDMA 622 uses asystem memory block 700 mapped to aGPU memory 121. Thesystem memory block 700 is a kernel buffer allocated to aCPU memory 131, and includes a memory-mappedregister 710 and a GPU pinnedmemory space 720. The memory-mappedregister 710 is a register which a disk driver controller (for example, an NVMe controller) for anSSD 110 manages, and the GPU pinnedmemory space 720 is a space mapped to theGPU memory 121. - The memory-mapped
register 710 includes I/O submission queues (an I/O submission region) 711 of theNVMe SSD 110, and a start offset of the memory-mappedregister 710 may be indicated by a baseline address register (BAR) of the PCIe. Asubmission command 711 a may be input to the I/O submission queue 711, and thesubmission command 711 a may have various items. Each item may have two physical region pages (PRPs), a PRP1 entry and a PRP2. - Each of the PRP1 entry and PRP2 entry points a physical page of the
GPU memory 121 for the data movement between theSSD 110 and theGPU 120. In some embodiments, theNDMA 622 may map block addresses of the GPU pinnedmemory 720 to block addresses of theGPU memory 121 in thesystem memory block 700. In this case, each of the PRP1 entry and PRP2 entry may map point a logical block address (LBA) mapped to a space (i.e., a memory block) with a predetermined size in the GPU pinnedmemory 720. The logical block address is a device-visible virtual address and indicates a predetermined space in thesystem memory block 700. Then, an address, i.e., a physical block address (PBA) of the space with the predetermined size in theGPU memory 121, which is mapped to the logical block address can be automatically pointed. - In some embodiments, the PRP1 entry may directly point the memory block of the
system memory block 700 and the PRP2 entry may point a PRP list. The PRP list may include one or more PRP entries, each pointing the memory block. In this case, each PRP entry may point the memory block with a predetermined size, for example the memory block with 4 KB. In a case where the amount of data to be transferred between theSSD 110 and theGPU 120 is greater than 4 KB, they may be referred by the pointers on the PRP list which is indicated by the PRP2 entry. - Accordingly, when data are transferred from the
GPU 120 to theSSD 110, theNDMA 622 generates the PRP1 entry for pointing the logical block address of thesystem memory block 700, which is mapped to theGPU memory 121 including the data to be transferred to theSSD 110. In a case where the amount of data to be transferred to theSSD 110 is greater than 4 KB, theNDMA 622 generates the PRP entries for pointing the logical block addresses of thesystem memory block 700, which are mapped to theGPU memory 121 including the remaining data, and generates the PRP2 entry for pointing the PRP list including these PRP entries. Since theNDMA 622 exports such the allocated memory spaces to the UIL, it can directly move the data from theGPU memory 121 to theSSD 110. - Similarly, when data are transferred from the
SSD 110 to theGPU 120, theNDMA 622 generates the PRP1 entry for pointing the logical block address of thesystem memory block 700, which is mapped to theGPU memory 121 for writing the data to be transferred to theGPU 120. In a case where the amount of data to be transferred to theGPU 120 is greater than 4 KB, theNDMA 622 generates the PRP entries for pointing the logical block addresses of thesystem memory block 700, which are mapped to theGPU memory 121 for writing the remaining data, and generates the PRP2 entry for pointing the PRP list including these PRP entries. Since theNDMA 622 exports such the allocated memory spaces to the UIL, it can directly move the data from theSSD 110 to theGPU memory 121. - In some embodiments, the memory-mapped
register 710 may further include a control register set above the I/O submission region 711. The control register set may start from the BAR. The control register set may be used for managing an NVMe work such as updating a doorbell register and interrupt management. The memory-mappedregister 710 may further include I/O completion queues (an I/O completion region) below the I/O submission region 711 and a data region below the I/O completion region. - In this case, the
GPU application 600 notifies the disk driver controller of the submission command using the doorbell register of the control register set, and the disk driver controller brings the submission command of the I/O submission queue and processes it. The submission command including the PRP entries may be transferred to the disk drive controller and be used for the read/write of theSSD 110. Accordingly, the disk drive controller can transfer the data of theSSD 110 to theGPU memory 121 pointed by the PRP entries of the item in the submission command or transfer the data of theGPU memory 121 pointed by the PRP entries to theSSD 110. - Since the pre-allocated memory space is exported to the
UIL 621, theNDMA 622 can directly upload or download the GPU data while letting the other kernel components serve a file-related work such as LBA translation in an appropriate manner. Since the kernel buffers of theNDMA 622 are managed as a pre-allocated memory pool, they may not be released until all data movement activities involving the file data are over. To implement this, an interrupt service routine (ISR) registered at a driver's NVMe initialization time may be modified. - Next, an example of an AHCI SSD is described with reference to
FIG. 8 . Compared with the NVMe, the AHCI has a different data management structure but employs a similar strategy for the data transfer between the GPU and the SSD. -
FIG. 8 schematically shows a data movement between an SSD and a GPU through an AHCI protocol in a computing device according to an embodiment of the present invention. - Referring to
FIG. 8 , anNDMA 622 uses asystem memory block 800 mapped to aGPU memory 121. Thesystem memory block 800 is a kernel buffer allocated to aCPU memory 131, and includes a memory-mappedregister 810 and a GPU pinnedmemory space 820. The memory-mappedregister 810 is a register which a disk driver controller (for example, an AHCI controller) for anSSD 110 manages, and the GPU pinnedmemory space 820 is a space mapped to theGPU memory 121. - The memory-mapped
register 810 includes ageneric host control 811 and multiple port registers 812, and a start offset of the memory-mappedregister 810 may be indicated by an AHCI base address register (ABAR). The multiple port registers 812 indicate a plurality of ports, and each port may represent an individual SSD in an SDD array. The multiple port registers 812 includes two meta-data structures data structures command list 812 a and a received FIS (frame information structure)structure 812 b. Thecommand list 812 a includes a plurality of command headers, for example 32 command headers. The receivedFIS 812 b is used for handshaking control such as a device-to-host (D2H) acknowledge FIS, and each command header refers to a physical region descriptor table (PRDT). - There are a plurality of entries, for example 65536 entries in the PRDT, and each PRDP entry points a system memory block managed by the
NDMA 622. Each PRDP entry may point a logical block address corresponding to addresses of the GPU pinnedmemory 820. - In the AHCI, a maximum buffer size of each PRDT entry may be 4 MB. In some embodiments, the buffer may be split into multiple physical pages with a predetermined size (for example, multiple 4 KB physical pages) to make them compatible with a PRP management policy employed by the
GPU 120. As a direct media interface (DMI) of the AHCI shares physical characteristics of the PCIe links, interrupts delivered by the FIS are converted to a PCIe interrupt packet which allows theNDMA 622 to manage an interrupt service routine (ISR) in a similar fashion to what is done in the NVMe. -
FIG. 9 schematically shows a GPU programming model on a software stack of a computing device according to an embodiment of the present invention. - Referring to
FIG. 9 , aGPU application 200 creates a file descriptor for initializing anUIL 621 and anNDMA 622. TheGPU application 200 may use, for example, an nvmmuBegin( ) function as the file descriptor to initialize theUIL 621 andNDMA 622. - A thread ID (tid) of a requester and a file name (w_filename) to be moved may be, as parameters, input to the nvmmuBegin( ) function like nvmmuBegin(tid, w_filename). The nvmmuBegin( ) function may keep the thread id (tid) of the requester for internal resource management, and may send piggyback information about parity block pipelining before starting the movement of the file data.
- The
GPU application 200 allocates aGPU memory 121 for read/write of data. For this, theGPU application 200 may use, for example, a cudaMalloc( ) function. In a case of the write, an address (&pGPUInP2P) of the GPU memory for writing the data and an amount (nImageDataSize) of the data to be written may be, as parameters, input to the cudaMalloc( ) function like cudaMalloc(&pGPUInP2P, nImageDataSize). In a case of the read, an address (&pGPUOutP2P) of the GPU memory for reading the data and an amount (nImageDataSize) of the data to be read may be, as parameters, input to the cudaMalloc( ) function like cudaMalloc(&pGPUOutP2P, nImageDataSize). - After allocating the
GPU memory 121, theGPU application 200 moves data by specifying a file name, an offset, and a number of bytes (length) of the data to be transferred from theSSD 110 to theGPU 120. TheGPU application 200 may call, for example, a nvmmuMove( ) function for the data movement. The nvmmuMove( ) function may create a data path between theSSD 110 and theGPU 120 based on the allocated addresses of theGPU memory 121 and the PRP entries pointing the addresses of theGPU memory 121, and may move the data taking into account the file name, the offset, and the amount of data. The file name (r_filename) of the data, the address (pGPUInP2P) of theGPU memory 121 for writing the data, offset 0, the amount of data (nImageDataSize), and a data movement direction (D2H) may be, as parameters, input to the nVmmuMove( ) function like nVmmuMove(r_filename, pGPUInP2P, 0, nImageDataSize, D2H). The D2H parameter indicates a device-to-host direction, i.e., the data movement from theSSD 110 to theGPU 120. -
- Next, when the
GPU application 200 needs to store a result generated by theGPU 120, theGPU application 200 moves the result data by specifying a file name, an offset, and the number of bytes (length) of the data to be transferred from theGPU 120 to theSSD 110. TheGPU application 200 may call, for example, a nvmmuMove( ) function for the data movement. The file name (r_filename) of the data, the address (pGPUOutP2P) of theGPU memory 121 for reading the data, offset 0, the amount of data (nImageDataSize), and a data movement direction (H2D) may be, as parameters, input to the nVmmuMove( ) function like nVmmuMove(r_filename, pGPUOutP2P, 0, nImageDataSize, H2D). The D2H parameter indicates a host-to-device direction, i.e., the data movement from theGPU 120 to theSSD 110. - After all of the processes are completed, the
GPU application 200 cleans up resources which theUIL 621 and theNDMA 622 use for the thread. TheGPU application 200 may clean up the resources through, for example, an nvmmuEnd( ) function. The thread ID (tid) may be, as a parameter, input to the nvmmuEnd( ) function like nvmmuEnd(tid). -
FIG. 10 schematically shows a data movement between a GPU and an SSD in a computing device according to an embodiment of the present invention. - Referring to
FIG. 10 , aGPU application 200 creates on a kernel a file descriptor for a read and/or a write (S1010). TheGPU application 200 then allocates aGPU memory 121 for writing data to theGPU 120 or reading data from the GPU 120 (S1020). Accordingly, physical block addresses of the allocatedGPU memory 121 are mapped to logical block addresses of a system memory block associated with addresses of theSSD 110 - The
GPU application 200 requests a file read to for the SSD 110 (S1030). Then, the file data are transferred from theSSD 110 to theGPU memory 121 through mappings of the system memory block (S1040). Consequently, theGPU 120 processes the file data. - In a case where the
GPU application 200 needs to store a result that theGPU 120 has generated after processing the file data, theGPU application 200 requests a file write for the GPU 120 (S1050). Then, the file data are transferred from theGPU memory 121 to theSSD 110 through mappings of the system memory block (S1060). - In
FIG. 10 , the steps S1010, S1020, S1030 and, S1050 may be processes associated with NVMMU. The step S1040 may be a response of theSSD 110, and the step S1060 may be a response of theGPU 120. - In some embodiments, a data transfer method described above may be applied to a redundant array of independent disks (RAID)-based SSD array. For the RAID-based SSD array, a software-based array controller driver may be modified to abstract multiple SSDs as a single virtual storage device. Since a GPU has neither an OS nor resource management capabilities, a host-side GPU application may in practice have all of the information regarding file data movement, such as a target data size, a file location, and timing for data download prior to beginning GPU-kernel execution. The nvmmuBegin( ) function may pass a file name to be downloaded from the
SSD 110 to theUIL 621, and theUIL 621 may feed this information to an array controller driver, i.e., theNDMA 622. Then, the array controller driver may read an old version of the target file data and the corresponding parity blocks at an early stage of GPU body code-segments using the information. Consequently, the array controller driver may load the old data and prepare new parity blocks while theGPU 120 and theCPU 130 prepare for a data movement and execution of the GPU kernel. This parity block pipelining strategy can enable all parity block preparations to be done in parallel with performing a data movement between theGPU 120 and theCPU 130 and/or executing GPU-kernel. Accordingly, performance degradation exhibited by conventional RAID systems can be eliminated. - As described above, according to an embodiment of the present invention, since data can be directly moved between the GPU and the SSD without significant CPU intervention, redundant data copies according to virtual memory allocation of the CPU memory can be reduced, and overheads due to the copies and switching between the user mode and the kernel mode for the copies can be reduced. Accordingly, application execution times through the GPU can be reduced, and the data movement overheads can be reduced.
- Since file-associated GPU operations are implemented as a virtual file system extension, a UIL-assisted GPU application can be complied just like a normal GPU program and then no compiler modification is required. Further, the computing device can still use all functionality of I/O runtime and GPU runtime libraries, which means that the NVMMU is fully compatible with all existing GPU applications.
- Next, performance improvement of an NVMMU according to an embodiment of the present invention is described with reference to
FIG. 11 andFIG. 12 . -
FIG. 11 shows latency values in transferring file data for a GPU application, andFIG. 12 shows execution times of a GPU application. - As shown in
FIG. 11 , it is noted that an NVMMU using an NVMe protocol (hereinafter referred to as an “NVMe-NVMMU”) reduces latency values of data movement, compared to an NVMe-IOMMU, by 202%, 70%, 112% and 108%, for PolyBench, Mars, Rodinia and Parboil benchmarks, respectively. The NVMe-IOMMU means a memory management unit that uses an NVMe protocol in a conventional computing device as described with reference toFIG. 2 toFIG. 5 . As shown inFIG. 12 , it is noted that the NVMe-NVMMU reduces the application execution times, compared to the NVMe-IOMMU, by 192%, 14%, 69% and 37%, for PolyBench, Mars, Rodinia and Parboil benchmarks, respectively. - These performance improvements can be provided because the NVMMU can reduce the redundant memory copies and the user mode and kernel mode switching overheads as described above.
- While it has been described in above embodiments of the present invention that the GPU and SSD are examples of the coprocessor and non-volatile memory, respectively, a data transfer method (i.e., NVMME) according to an embodiment of the present invention may be applied to other coprocessors and/or other file I/O-based non-volatile memories.
- While this invention has been described in connection with what is presently considered to be practical embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/010,583 US10303597B2 (en) | 2016-02-15 | 2018-06-18 | Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2016-0017233 | 2016-02-15 | ||
KR1020160017233A KR101936950B1 (en) | 2016-02-15 | 2016-02-15 | Computing device, data transfer method between coprocessor and non-volatile memory, and program including the same |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/010,583 Continuation US10303597B2 (en) | 2016-02-15 | 2018-06-18 | Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium |
Publications (2)
Publication Number | Publication Date |
---|---|
US20170235671A1 true US20170235671A1 (en) | 2017-08-17 |
US10013342B2 US10013342B2 (en) | 2018-07-03 |
Family
ID=59561531
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/168,423 Active 2036-11-17 US10013342B2 (en) | 2016-02-15 | 2016-05-31 | Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium |
US16/010,583 Active US10303597B2 (en) | 2016-02-15 | 2018-06-18 | Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/010,583 Active US10303597B2 (en) | 2016-02-15 | 2018-06-18 | Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium |
Country Status (2)
Country | Link |
---|---|
US (2) | US10013342B2 (en) |
KR (1) | KR101936950B1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109062929A (en) * | 2018-06-11 | 2018-12-21 | 上海交通大学 | A kind of query task communication means and system |
JP2019133662A (en) * | 2018-02-02 | 2019-08-08 | 三星電子株式会社Samsung Electronics Co.,Ltd. | System and method for machine learning including key value access |
US10379745B2 (en) * | 2016-04-22 | 2019-08-13 | Samsung Electronics Co., Ltd. | Simultaneous kernel mode and user mode access to a device using the NVMe interface |
CN111158898A (en) * | 2019-11-25 | 2020-05-15 | 国网浙江省电力有限公司建设分公司 | BIM data processing method and device aiming at power transmission and transformation project site arrangement standardization |
CN112513887A (en) * | 2018-08-03 | 2021-03-16 | 西门子股份公司 | Neural logic controller |
US20210288793A1 (en) * | 2017-08-30 | 2021-09-16 | Intel Corporation | Technologies for providing streamlined provisioning of accelerated functions in a disaggregated architecture |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101936950B1 (en) * | 2016-02-15 | 2019-01-11 | 주식회사 맴레이 | Computing device, data transfer method between coprocessor and non-volatile memory, and program including the same |
KR101923661B1 (en) | 2016-04-04 | 2018-11-29 | 주식회사 맴레이 | Flash-based accelerator and computing device including the same |
KR101943312B1 (en) * | 2017-09-06 | 2019-01-29 | 주식회사 맴레이 | Flash-based accelerator and computing device including the same |
US10929059B2 (en) | 2016-07-26 | 2021-02-23 | MemRay Corporation | Resistance switching memory-based accelerator |
US11748418B2 (en) * | 2018-07-31 | 2023-09-05 | Marvell Asia Pte, Ltd. | Storage aggregator controller with metadata computation control |
KR102355374B1 (en) | 2019-09-27 | 2022-01-25 | 에스케이하이닉스 주식회사 | Memory management unit capable of managing address translation table using heterogeneous memory, and address management method thereof |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080184273A1 (en) * | 2007-01-30 | 2008-07-31 | Srinivasan Sekar | Input/output virtualization through offload techniques |
US20130141443A1 (en) * | 2011-12-01 | 2013-06-06 | Michael L. Schmit | Software libraries for heterogeneous parallel processing platforms |
US20140164732A1 (en) * | 2012-12-10 | 2014-06-12 | International Business Machines Corporation | Translation management instructions for updating address translation data structures in remote processing nodes |
US20160041917A1 (en) * | 2014-08-05 | 2016-02-11 | Diablo Technologies, Inc. | System and method for mirroring a volatile memory of a computer system |
US20160170849A1 (en) * | 2014-12-16 | 2016-06-16 | Intel Corporation | Leverage offload programming model for local checkpoints |
US20170206169A1 (en) * | 2016-01-15 | 2017-07-20 | Stmicroelectronics (Grenoble 2) Sas | Apparatus and methods implementing dispatch mechanisms for offloading executable functions |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060075164A1 (en) | 2004-09-22 | 2006-04-06 | Ooi Eng H | Method and apparatus for using advanced host controller interface to transfer data |
KR101936950B1 (en) * | 2016-02-15 | 2019-01-11 | 주식회사 맴레이 | Computing device, data transfer method between coprocessor and non-volatile memory, and program including the same |
-
2016
- 2016-02-15 KR KR1020160017233A patent/KR101936950B1/en active IP Right Grant
- 2016-05-31 US US15/168,423 patent/US10013342B2/en active Active
-
2018
- 2018-06-18 US US16/010,583 patent/US10303597B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080184273A1 (en) * | 2007-01-30 | 2008-07-31 | Srinivasan Sekar | Input/output virtualization through offload techniques |
US20130141443A1 (en) * | 2011-12-01 | 2013-06-06 | Michael L. Schmit | Software libraries for heterogeneous parallel processing platforms |
US20140164732A1 (en) * | 2012-12-10 | 2014-06-12 | International Business Machines Corporation | Translation management instructions for updating address translation data structures in remote processing nodes |
US20160041917A1 (en) * | 2014-08-05 | 2016-02-11 | Diablo Technologies, Inc. | System and method for mirroring a volatile memory of a computer system |
US20160170849A1 (en) * | 2014-12-16 | 2016-06-16 | Intel Corporation | Leverage offload programming model for local checkpoints |
US20170206169A1 (en) * | 2016-01-15 | 2017-07-20 | Stmicroelectronics (Grenoble 2) Sas | Apparatus and methods implementing dispatch mechanisms for offloading executable functions |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10379745B2 (en) * | 2016-04-22 | 2019-08-13 | Samsung Electronics Co., Ltd. | Simultaneous kernel mode and user mode access to a device using the NVMe interface |
US20210288793A1 (en) * | 2017-08-30 | 2021-09-16 | Intel Corporation | Technologies for providing streamlined provisioning of accelerated functions in a disaggregated architecture |
US11522682B2 (en) * | 2017-08-30 | 2022-12-06 | Intel Corporation | Technologies for providing streamlined provisioning of accelerated functions in a disaggregated architecture |
JP2019133662A (en) * | 2018-02-02 | 2019-08-08 | 三星電子株式会社Samsung Electronics Co.,Ltd. | System and method for machine learning including key value access |
US11907814B2 (en) | 2018-02-02 | 2024-02-20 | Samsung Electronics Co., Ltd. | Data path for GPU machine learning training with key value SSD |
CN109062929A (en) * | 2018-06-11 | 2018-12-21 | 上海交通大学 | A kind of query task communication means and system |
CN112513887A (en) * | 2018-08-03 | 2021-03-16 | 西门子股份公司 | Neural logic controller |
CN111158898A (en) * | 2019-11-25 | 2020-05-15 | 国网浙江省电力有限公司建设分公司 | BIM data processing method and device aiming at power transmission and transformation project site arrangement standardization |
Also Published As
Publication number | Publication date |
---|---|
US20180300230A1 (en) | 2018-10-18 |
US10303597B2 (en) | 2019-05-28 |
KR20170095607A (en) | 2017-08-23 |
KR101936950B1 (en) | 2019-01-11 |
US10013342B2 (en) | 2018-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10303597B2 (en) | Computing device, data transfer method between coprocessor and non-volatile memory, and computer-readable recording medium | |
US11550477B2 (en) | Processing host write transactions using a non-volatile memory express controller memory manager | |
Zhang et al. | Nvmmu: A non-volatile memory management unit for heterogeneous gpu-ssd architectures | |
US10831376B2 (en) | Flash-based accelerator and computing device including the same | |
US11379374B2 (en) | Systems and methods for streaming storage device content | |
US9092426B1 (en) | Zero-copy direct memory access (DMA) network-attached storage (NAS) file system block writing | |
KR20140001924A (en) | Controller and method for performing background operations | |
JP2008527496A (en) | Intelligent storage engine for disk drive operation with reduced local bus traffic | |
US9116809B2 (en) | Memory heaps in a memory model for a unified computing system | |
US8930596B2 (en) | Concurrent array-based queue | |
CN112416250A (en) | NVMe (network video Me) -based command processing method for solid state disk and related equipment | |
EP3270293B1 (en) | Two stage command buffers to overlap iommu map and second tier memory reads | |
US10459662B1 (en) | Write failure handling for a memory controller to non-volatile memory | |
CN114356598A (en) | Data interaction method and device for Linux kernel mode and user mode | |
US10831684B1 (en) | Kernal driver extension system and method | |
US20180107619A1 (en) | Method for shared distributed memory management in multi-core solid state drive | |
KR102000721B1 (en) | Computing device, data transfer method between coprocessor and non-volatile memory, and program including the same | |
US20230359392A1 (en) | Non-volatile memory-based storage device, device controller and method thereof | |
TW202307672A (en) | Storage controller, computational storage device, and operational method of computational storage device | |
CN112486410A (en) | Method, system, device and storage medium for reading and writing persistent memory file | |
US11689621B2 (en) | Computing device and storage card | |
US10430220B1 (en) | Virtual devices as protocol neutral communications mediators | |
US20170031633A1 (en) | Method of operating object-oriented data storage device and method of operating system including the same | |
US20220027295A1 (en) | Non-volatile memory controller device and non-volatile memory device | |
Chan et al. | Rethinking network stack design with memory snapshots |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: MEMRAY CORPORATION, KOREA, REPUBLIC OF Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:JUNG, MYOUNGSOO;REEL/FRAME:038748/0335 Effective date: 20160524 Owner name: YONSEI UNIVERSITY, UNIVERSITY - INDUSTRY FOUNDATIO Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:JUNG, MYOUNGSOO;REEL/FRAME:038748/0335 Effective date: 20160524 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2551); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 4 |