US20200265543A9 - Unified memory systems and methods - Google Patents
Unified memory systems and methods Download PDFInfo
- Publication number
- US20200265543A9 US20200265543A9 US16/237,010 US201816237010A US2020265543A9 US 20200265543 A9 US20200265543 A9 US 20200265543A9 US 201816237010 A US201816237010 A US 201816237010A US 2020265543 A9 US2020265543 A9 US 2020265543A9
- Authority
- US
- United States
- Prior art keywords
- processing unit
- memory location
- pointer
- managed
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000015654 memory Effects 0.000 title claims abstract description 158
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000012545 processing Methods 0.000 claims description 63
- 230000004044 response Effects 0.000 claims 9
- 230000008569 process Effects 0.000 abstract description 16
- 230000000977 initiatory effect Effects 0.000 abstract description 3
- 238000013459 approach Methods 0.000 description 15
- 238000007726 management method Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 9
- 238000013507 mapping Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 description 3
- 239000000872 buffer Substances 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 239000011800 void material Substances 0.000 description 3
- 230000004888 barrier function Effects 0.000 description 2
- 230000001427 coherent effect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/10—Address translation
- G06F12/109—Address translation for multiple virtual address spaces, e.g. segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/60—Memory management
Definitions
- the present invention relates to the field of memory configuration.
- virtual addressing associated with unified memory is implemented with pointer coordination across multiple entities.
- Electronic systems and devices facilitate increased productivity and reduced costs in analyzing and communicating various types of data.
- These electronic systems e.g., digital computers, calculators, audio devices, video equipment, telephone systems, etc.
- These electronic systems typically include various components that need access to memory to implement their desired functionality or operations.
- Conventional attempts at utilizing virtual addresses and pointers across various components of a system are typically complicated and can have undesirable impacts.
- VA virtual address
- Some traditional approaches attempt to reserve a large CPU VA chunk from the OS and have the GPU driver allocate only in this VA range. However, this approach often has a number of drawbacks including possible waste of CPU VA space if a large chunk is reserved initially but the actual amount of space that is required or utilized is much less. In some systems (e.g., on 32 bit CPU, etc.) the VA space can be considered relatively small and reserving large chunks of CPU VA space for the GPU can result in lower system utilization and inadequate VA space remaining available for operations of the other components (e.g., CPU, etc.).
- Some programs e.g., a CUDA program, etc.
- Some programs often need to maintain two copies of data and need fast access to the data from both the CPU and the GPU.
- the user or programmer usually has to take explicit actions to ensure both copies of data associated with the pointers or addresses are consistent.
- This can become a very complicated and extensive task which increases the workload and effort required by a user and in turn can also increase the barrier to entry for novice users of the programs.
- These added burdens and difficulties increase the likelihood of programming mistakes that adversely impact system performance (e.g., increased faults, non-coherent data, etc.).
- Presented embodiments facilitate efficient and effective utilization of unified virtual addresses across multiple components.
- the presented new approach or solution uses Operating System (OS) allocation on the central processing unit (CPU) combined with graphics processing unit (GPU) driver mappings to provide a unified virtual address (VA) across both GPU and CPU.
- OS Operating System
- GPU graphics processing unit
- VA unified virtual address
- the new approach helps ensure that a GPU VA pointer does not collide with a CPU pointer provided by OS CPU allocation (e.g., like one returned by “malloc” C runtime API, etc.).
- an address allocation process comprises: establishing space for managed pointers across a plurality of memories, including allocating one of the managed pointers with a first portion of memory associated with a first one of a plurality of processors; and performing a process of automatically managing accesses to the managed pointers across the plurality of processors and corresponding memories.
- the automated management can include ensuring consistent information associated with the managed pointers is copied from the first portion of memory to a second portion of memory associated with a second one of the plurality of processors based upon initiation of an access to the managed pointers from the second one of the plurality of processors.
- Establishing space for managed pointers can include reserving a region from the first processor's virtual address space and reserving a region from the second processor's virtual address space, wherein the regions are reserved for allocations of the managed pointers. Data coherency and concurrency across the memories can be automatically maintained.
- the managed memory can be part of a unified memory.
- the second one of the plurality of processors is a central processing unit (CPU) and the first one of a plurality of processors is a graphics processing unit (GPU).
- CPU PA central processing unit physical addresses
- a system comprises: a first processor for processing information and a second processor for processing information, wherein accesses associated with a pointer are automatically managed across multiple memories associated with the first processor and the second processor. Accesses associated with the pointer can be automatically managed back and forth between the first processor and the second processor according to which processor is accessing the pointer.
- an API managed memory allocation call triggers the automatic management of the pointer and a driver manages the memories associated with the pointer.
- the pointer can be accessed and used across multiple different processors. Movement or copying of information between processors can be automated and transparent to the user utilizing a single managed pointer without having to be concerned about concurrency or coherency of data between the different processors or memories.
- the pointer is accessible from multiple entities.
- a tangible computer readable medium includes instructions for directing a processor in the implementation of an address allocation process.
- the address allocation process comprises: allocating a pointer to a first portion of memory associated with a first processor, wherein the pointer is also utilized by a second processor, and managing accesses to the pointer automatically.
- Managing the accesses includes making sure appropriate consistent information associated with the pointer is copied to a second portion of physical memory associated with the second processor, wherein the copying is done based on attempts to access the information by the second processor. The copying can be done based on accesses.
- Allocation of managed memory can include utilization of an API.
- a device variable can have the same restrictions as a returned allocation.
- allocation of managed memory includes utilization of a keyword that can be applied to device variables. There can be support for page faults to a pointer associated with accesses by the second processor.
- FIG. 1 is a flow chart of an exemplary automated unified memory management method in accordance with one embodiment of the present invention.
- FIG. 2 is a block diagram of exemplary memory space reservation n accordance with one embodiment.
- FIG. 3 is a block diagram of exemplary memory spaces associated with an API managed pointer memory allocation call in accordance with one embodiment.
- FIG. 4 is a block diagram of exemplary memory spaces associated with an access call from a different entity in accordance with one embodiment.
- FIG. 5 is a block diagram of exemplary memory space associated with a launch in accordance with one embodiment.
- FIG. 6 is a block diagram of an exemplary computer system, one embodiment of a computer system upon which embodiments of the present invention can be implemented.
- an automatically managed unified memory allows an application to use a single pointer to access data associated with the pointer from multiple locations.
- the “managed” pointer can be accessed or used across multiple different entities (e.g., a kernel, a processor, CPU, GPU, etc.).
- the single pointer can be associated with automatically managed memory.
- managed refers to a memory space that is automatically managed by a driver (e.g., graphics device driver, etc.).
- an automatically managed unified memory differs from a conventional unified memory by allowing virtual address spaces associated with different entities (e.g., different processors, GPU, CPU, etc.) to be treated as if it is one memory space.
- Treating multiple memories as single memory space relieves a user from having to explicitly direct many of the multiple memory management activities.
- a unified set of page tables is not necessarily used and there can actually be multiple different sets of page tables.
- memory space associated with a pointer is reserved and accesses by multiple entities to the pointer are automatically managed.
- an automatically managed unified memory creates a management memory space to be used in unified memory.
- management memory space is created by allocating unified memory space as managed memory. At times the management memory space can automatically be made local enabling “local” access to associated data.
- the managed address space can be in memory associated with a host (e.g., CPU) or memory associated with a device (e.g., GPU). Having data be present locally typically increases the performance of those accesses, as opposed to using remote memory access (e.g., over PCI, etc.).
- the automated management of the memory spaces enables the system to take care of putting the data where it is necessary or appropriate based on accesses.
- a page fault handler manages migration of pages belonging to the managed memory allocations, migrating them back and forth between CPU memory and GPU memory. Consistency is also automatically maintained across multiple memories (e.g., latest concurrence, etc.). Normally the address range representing an allocation for managed memory is not mapped in the CPU's virtual address space.
- the page fault handler upon CPU access of unified memory data, copies the appropriate page from GPU memory to CPU memory and maps it into the CPU's virtual address space, allowing the CPU to access the data.
- the managing can include various activities.
- a GPU when a GPU is accessing a pointer the automated management makes sure the appropriate consistent information or data associated with the pointer is put on or moved to the GPU
- the CPU when the CPU is accessing the pointer the automated management makes sure the appropriate consistent information or data associated with the pointer is put on or moved to the CPU.
- the movement or copying of information between the processors can be automated and transparent to the user by utilizing the single “managed” pointer.
- a user or programmer can utilize the single managed pointer without having to be concerned about the concurrency or coherency of data between the different processors or memories (e.g., CPU, GPU, etc.).
- the automatic managed memory approach can enable a CPU access to GPU data.
- CPU page faults to the same location or single pointer in the unified memory can also be automatically handled, even though there may be two distinct or discrete physical memories (e.g., the CPU and the GPU, etc.).
- FIG. 1 is a flow chart of an exemplary automated unified memory management method in accordance with one embodiment of the present invention.
- space for managed pointers is established across a plurality of memories.
- one of the managed pointers is allocated to a first portion of memory associated with a first one of a plurality of processors.
- the managed memory is part of a unified memory.
- establishing a managed memory includes a processor reserving one or more regions from the processors' virtual address space.
- a GPU physical address (GPU PA) can be mapped to an allocated central processing address (CPU VA).
- accesses associated with the single pointer are automatically managed across a plurality of processors and memories.
- the automatic management includes ensuring consistent information associated with the managed pointers is copied from the first portion of memory to a second portion of memory associated with a second one of the plurality of processors based upon initiation of an accesses to the managed pointers from the second one of the plurality of processors.
- a CPU attempts to access the pointer
- physical space in the CPU PA is allocated
- the portion of the GPU PA is automatically copied to the CPU PA, and the address in the CPU VA is mapped to the newly allocated CPU physical memory.
- a novel API managed memory allocation call triggers an automated unified memory management method.
- the API managed memory allocation call can instruct a driver (e.g., GPU driver, etc.) to automatically manage the memory.
- the novel API call includes a GPU cudaMallocManaged call.
- a cudaMallocManaged call returns pointers within a reserved VA range associated with managed memory. Reserving a certain VA range for use by a pointer in multiple VA spaces ensures the pointer can be used in multiple VA spaces (e.g., CPU and GPU memory spaces, etc.).
- FIGS. 2 through 5 are block diagrams of exemplary memory spaces associated with an automated unified memory management process in accordance with one embodiment.
- FIG. 2 is a block diagram of exemplary memory space reservation n accordance with one embodiment.
- Managed memory chunks or addresses 1511 in GPU VA 1510 and corresponding managed memory chunks or addresses 1591 in CPU VA 1590 are reserved.
- the reserved managed memory chunks or addresses 1511 and 1591 are the same size.
- a pointer managed by a particular driver can be used and accessed by multiple processors because accesses to the reserved managed memory space by other “non-managed” pointers (e.g., pointers not managed by the particular device) is prevented.
- code includes an allocation call associated with a non-managed pointer, (e.g., if GPU code calls cudaMalloc, if CPU code calls Malloc, etc.) the system will use or allocate a part of the VA space that has not been reserved for managed pointer memory (e.g., an address outside the reserved range is returned for allocation to the “non-managed” pointer, etc.).
- a non-managed pointer e.g., if GPU code calls cudaMalloc, if CPU code calls Malloc, etc.
- the reservation can be initiated by a GPU driver.
- the driver provides an opt-in allocator to the application to allocate out of these regions.
- the processor examines how much memory is in the system between the CPU and GPUs and a large enough total of managed memory is reserved.
- a matching range is reserved in the VA space of multiple GPUs.
- the reserved VA ranges do not initially map to any physical memory. Normally, the address range representing an allocation is not initially mapped in the GPU's or CPU's virtual address space. The physical pages backing the VA allocations are created or mapped in GPU and CPU memory.
- FIG. 3 is a block diagram of exemplary memory spaces associated with an API managed pointer memory allocation call in accordance with one embodiment.
- addresses or locations within the reserved VA range are returned and a chunk 1522 from the reserved range 1511 in the GPU VA space 1510 is allocated to the managed pointer Ptr.
- Pages or addresses “A” from the GPU PA 1530 are allocated and mapped to GPU VA 1522 in GPU page table 1520 map entry 1521 .
- a GPU kernel mode driver is also notified of the new managed allocation. Now the GPU side mapping is set up and a GPU kernel which accesses the allocation can use the physical memory mapped under it.
- FIG. 4 is a block diagram of exemplary memory spaces associated with an access call from a different entity in accordance with one embodiment.
- the kernel mode driver which was previously notified of the allocation handles the page fault.
- a physical page or address “B” is allocated from the CPU PA 1570 .
- the driver copies the data contents of the corresponding GPU physical page or address “A” into the CPU physical page or address “B”.
- the CPU virtual page or address 1592 is mapped to the physical page “B” by the mapping 1581 in the CPU page table 1580 . Control returns to the user code on the CPU which triggered the fault.
- the virtual address 1592 is now valid, and the access which faulted is retried and operations are directed to the CPU physical memory page or address “B”.
- FIG. 5 is a block diagram of exemplary memory space associated with a launch in accordance with one embodiment.
- any pages that were migrated to CPU memory are flushed back to GPU memory, and the CPU's virtual address mappings may be unmapped.
- data′ is flushed back from CPU PA 1570 to GPU PA 1530 and the 1592 previously mapped to B in map 1581 (shown in FIG. 4 ) is unmapped (in FIG. 5 ).
- Data′ may be the same as the data copied or moved to the CPU in FIG. 4 or data′ may be the result of modification of the data by the CPU. After this point, the CPU needs to synchronize on the pending GPU work before it can access the same data from the CPU again.
- the need for h_pointer is eliminated.
- memory otherwise associated with the h_pointer can be freed up for other use as compared to when the h-pointer is included.
- the need for including a specific copy instruction (e.g., cudaMemcpy call, etc.) in the code to copy data from host to device or device to host is eliminated, saving processing resources and time.
- the system automatically takes care of actually copying the data.
- the automated copying can offer subtle benefits. In the past, even if only part of a range needed to be copied, the conventional approaches copied the whole range (e.g., with an unconditional cudaMemcpy call, etc.).
- the copy is done based on accesses.
- a page fault handler e.g., as part of a kernel mode driver, etc.
- the ranges have already been resolved (e.g., with the kernel mode driver, etc.). It sees that the access is directed to a particular managed page and copies the data being accessed without excess data.
- it knows exactly what to copy. It can copy at a smaller granularity based on access (e.g., copies a limited or smaller amount of data associated with an access as opposed conventional approaches that copy a larger amount such as a whole allocation or array, etc.).
- a device variable has the same restrictions as an allocation returned by cudaMalloc. So a device variable cannot be accessed from the CPU.
- a user wishing to access the data from the CPU can use a special API such as cudaMemcpy to copy from the GPU memory to a separate CPU memory location.
- the managed memory space allows use of the keyword managed that can be applied to device variables. For example, one can directly reference a managed device variable in the CPU code without having to worry about copy operations, which are now done automatically for the user. Using managed memory, a user does not have to track or worry as much about coherence and copies between the two different pointers.
- the above code can use a qualified variable rather than a dynamic allocation:
- the CPU access actually copies data over from the GPU.
- CPU code may then modify the contents of this memory in the CPU physical copy.
- the kernel mode driver is first notified that a kernel launch is being performed. The driver examines information about managed memory that has been copied to the CPU physical memory, and copies the contents of certain CPU physical memory back to the GPU physical memory. Then the kernel is launched and the kernel can use the data because it is up to date. In one exemplary implementation, during the kernel launch is when there is a copy back to the GPU and the GPU can use it.
- a cudaDeviceSynchronize call is performed.
- the cudaDeviceSynchronize can be called before accessing data from the CPU again. If a synchronize call is not made the data may not be coherent and this can cause data corruption.
- the data programming model does not allow concurrent access to the data by both the GPU and CPU at the same time and that is why a cudaDeviceSynchronize is included, ensuring work on the GPU which may be accessing the data has completed.
- kernel launches are asynchronous and the only way to know a kernel has completed is by making a synchronize call.
- a device synchronize can be performed which means synchronize the work launched on the device or GPU.
- a subset of GPU work can also be synchronized such as a CUDA stream.
- CUDA stream approaches are set forth in later portions of the detailed description.
- the synchronize is performed before the data can be accessed from the CPU again. If the synchronize is not performed and an attempt to access a managed region from the CPU is made, the page fault handler is aware of the outstanding GPU work and the user process is signaled rather than handle the page fault, as the user code has violated the requirements of the programming model. It is appreciated that disallowing concurrent access is not the only approach to provide coherence.
- a kernel is running and using the data actively when there is an access to the managed data on the CPU. It will create a backup copy of the page contents at the time of the access, and then set up mappings to separate physical copies in both locations so the CPU and GPU code can continue and access the data concurrently. A three-way merge of the three copies is later performed and a new page that contains the merged data from the three pages is created.
- page merging is used and segmentation faults are not issued for concurrent access.
- Computer system 900 includes central processor unit 901 , main memory 902 (e.g., random access memory), chip set 903 with north bridge 909 and south bridge 905 , removable data storage device 904 , input device 907 , signal communications port 908 , and graphics subsystem 910 which is coupled to display 920 .
- Computer system 900 includes several busses for communicatively coupling the components of computer system 900 .
- Communication bus 991 e.g., a front side bus
- Communication bus 992 (e.g., a main memory bus) couples north bridge 909 of chipset 903 to main memory 902 .
- Communication bus 993 (e.g., the Advanced Graphics Port interface) couples north bridge of chipset 903 to graphic subsystem 910 .
- Communication buses 994 , 995 and 997 (e.g., a PCI bus) couple south bridge 905 of chip set 903 to removable data storage device 904 , input device 907 , signal communications port 908 respectively.
- Graphics subsystem 910 includes graphics processor 911 and frame buffer 915 .
- the components of computer system 900 cooperatively operate to provide versatile functionality and performance.
- the components of computer system 900 cooperatively operate to provide predetermined types of functionality.
- Communications bus 991 , 992 , 993 , 994 , 995 , and 997 communicate information.
- Central processor 901 processes information.
- Main memory 902 stores information and instructions for the central processor 901 .
- Removable data storage device 904 also stores information and instructions (e.g., functioning as a large information reservoir).
- Input device 907 provides a mechanism for inputting information and/or for pointing to or highlighting information on display 920 .
- Signal communication port 908 provides a communication interface to exterior devices (e.g., an interface with a network).
- Display device 920 displays information in accordance with data stored in frame buffer 915 .
- Graphics processor 911 processes graphics commands from central processor 901 and provides the resulting data to video buffers 915 for storage and retrieval by display monitor 920 .
- embodiments of the present invention can be compatible and implemented with a variety of different types of tangible memory or storage (e.g., RAM, DRAM, flash, hard drive, CD, DVD, etc.).
- the memory or storage while able to be changed or rewritten, can be considered a non-transitory storage medium.
- a non-transitory storage medium it is not intend to limit characteristics of the medium, and can include a variety of storage mediums (e.g., programmable, erasable, nonprogrammable, read/write, read only, etc.) and “non-transitory” computer-readable media comprises all computer-readable media, with the sole exception being a transitory, propagating signal.
Abstract
Description
- This application is a continuation of and claims the benefit of and priority to:
- non-provisional application Ser. No. 15/709,397 entitled “Unified Memory Systems and Methods” filed Sep. 19, 2017; which in turn claims priority to and benefit of:
- non-provisional application Ser. No. 14/601,223 (Attorney docket NVID-PBG-13-1649-US1.1) entitled “Unified Memory Systems and Methods” filed Jan. 20, 2015; which in turn claims priority to and benefit of:
- provisional application 61/929,496 (Attorney docket NVID-P-SC-1649US0A) entitled “Unified Memory Systems and Methods” filed Jan. 20, 2014;
- provisional application 61/965,089 (Attorney docket NVID-P-SC-1653RUS0) entitled “Unified Memory Systems and Methods” filed Jan. 21, 2014; and
- provisional application 61/929,913 (Attorney docket NVID-P-BG-13-1649US0C) entitled “Inline Parallelism and Re-targetable Parallel Algorithms” filed Jan. 21, 2014; which are all incorporated herein by reference.
- The present invention relates to the field of memory configuration. In one embodiment, virtual addressing associated with unified memory is implemented with pointer coordination across multiple entities.
- Electronic systems and devices facilitate increased productivity and reduced costs in analyzing and communicating various types of data. These electronic systems (e.g., digital computers, calculators, audio devices, video equipment, telephone systems, etc.) typically include various components that need access to memory to implement their desired functionality or operations. Conventional attempts at utilizing virtual addresses and pointers across various components of a system are typically complicated and can have undesirable impacts.
- Many computing systems often have multiple processors (e.g., a central processing unit (CPU), a graphics processing unit (GPU), etc.) and respective memories with their own respective memory management units (MMUs). This potentially leads to a scenario where there are two distinct address spaces, one that is setup by the OS for the CPU and the other that is setup by the GPU driver for the GPU. These are often distinct virtual address (VA) spaces setup by different software components and can potentially lead to pointer collision or overlap. The various conventional approaches that attempt to handle virtual addresses and pointer tracking typically have a number of problems. Some traditional attempts at resolving these issues are directed at having applications try to explicitly track which VA space a pointer belongs to. Some traditional approaches attempt to reserve a large CPU VA chunk from the OS and have the GPU driver allocate only in this VA range. However, this approach often has a number of drawbacks including possible waste of CPU VA space if a large chunk is reserved initially but the actual amount of space that is required or utilized is much less. In some systems (e.g., on 32 bit CPU, etc.) the VA space can be considered relatively small and reserving large chunks of CPU VA space for the GPU can result in lower system utilization and inadequate VA space remaining available for operations of the other components (e.g., CPU, etc.).
- Some programs (e.g., a CUDA program, etc.) often need to maintain two copies of data and need fast access to the data from both the CPU and the GPU. This traditionally puts a significant burden on a developer or user to maintain and keep two pointers. For example, the user or programmer usually has to take explicit actions to ensure both copies of data associated with the pointers or addresses are consistent. This can become a very complicated and extensive task which increases the workload and effort required by a user and in turn can also increase the barrier to entry for novice users of the programs. These added burdens and difficulties increase the likelihood of programming mistakes that adversely impact system performance (e.g., increased faults, non-coherent data, etc.). Traditional approaches can also make widespread adoption of associated components (e.g., CPUs, GPUs, etc.) harder, because it's more difficult to port existing code written for one processor (e.g., the CPU) over to a heterogeneous system that has multiple processors (e.g., both a CPU and a GPU).
- Presented embodiments facilitate efficient and effective utilization of unified virtual addresses across multiple components. In one embodiment, the presented new approach or solution uses Operating System (OS) allocation on the central processing unit (CPU) combined with graphics processing unit (GPU) driver mappings to provide a unified virtual address (VA) across both GPU and CPU. The new approach helps ensure that a GPU VA pointer does not collide with a CPU pointer provided by OS CPU allocation (e.g., like one returned by “malloc” C runtime API, etc.). In one exemplary implementation, an address allocation process comprises: establishing space for managed pointers across a plurality of memories, including allocating one of the managed pointers with a first portion of memory associated with a first one of a plurality of processors; and performing a process of automatically managing accesses to the managed pointers across the plurality of processors and corresponding memories. The automated management can include ensuring consistent information associated with the managed pointers is copied from the first portion of memory to a second portion of memory associated with a second one of the plurality of processors based upon initiation of an access to the managed pointers from the second one of the plurality of processors.
- Establishing space for managed pointers can include reserving a region from the first processor's virtual address space and reserving a region from the second processor's virtual address space, wherein the regions are reserved for allocations of the managed pointers. Data coherency and concurrency across the memories can be automatically maintained. In one embodiment, the managed memory can be part of a unified memory. In one exemplary implementation, the second one of the plurality of processors is a central processing unit (CPU) and the first one of a plurality of processors is a graphics processing unit (GPU). When the CPU attempts to access the pointer, space in the central processing unit physical addresses (CPU PA) is allocated, the portion of the GPU PA is automatically copied to the CPU PA, and the address in the CPU VA is mapped to the allocated CPU PA. The CPU PA is copied to the GPU PA when a kernel utilizing the managed pointers is launched in the GPU.
- In one embodiment, a system comprises: a first processor for processing information and a second processor for processing information, wherein accesses associated with a pointer are automatically managed across multiple memories associated with the first processor and the second processor. Accesses associated with the pointer can be automatically managed back and forth between the first processor and the second processor according to which processor is accessing the pointer. In one embodiment, an API managed memory allocation call triggers the automatic management of the pointer and a driver manages the memories associated with the pointer. The pointer can be accessed and used across multiple different processors. Movement or copying of information between processors can be automated and transparent to the user utilizing a single managed pointer without having to be concerned about concurrency or coherency of data between the different processors or memories. The pointer is accessible from multiple entities.
- In one embodiment, a tangible computer readable medium includes instructions for directing a processor in the implementation of an address allocation process. The address allocation process comprises: allocating a pointer to a first portion of memory associated with a first processor, wherein the pointer is also utilized by a second processor, and managing accesses to the pointer automatically. Managing the accesses includes making sure appropriate consistent information associated with the pointer is copied to a second portion of physical memory associated with the second processor, wherein the copying is done based on attempts to access the information by the second processor. The copying can be done based on accesses. Allocation of managed memory can include utilization of an API. A device variable can have the same restrictions as a returned allocation. In one exemplary implementation, allocation of managed memory includes utilization of a keyword that can be applied to device variables. There can be support for page faults to a pointer associated with accesses by the second processor.
- The accompanying drawings, which are incorporated in and form a part of this specification, are included for exemplary illustration of the principles of the present invention and not intended to limit the present invention to the particular implementations illustrated therein. The drawings are not to scale unless otherwise specifically indicated.
-
FIG. 1 is a flow chart of an exemplary automated unified memory management method in accordance with one embodiment of the present invention. -
FIG. 2 is a block diagram of exemplary memory space reservation n accordance with one embodiment. -
FIG. 3 is a block diagram of exemplary memory spaces associated with an API managed pointer memory allocation call in accordance with one embodiment. -
FIG. 4 is a block diagram of exemplary memory spaces associated with an access call from a different entity in accordance with one embodiment. -
FIG. 5 is a block diagram of exemplary memory space associated with a launch in accordance with one embodiment. -
FIG. 6 is a block diagram of an exemplary computer system, one embodiment of a computer system upon which embodiments of the present invention can be implemented. - Reference will now be made in detail to the preferred embodiments of the invention, examples of which are illustrated in the accompanying drawings. While the invention will be described in conjunction with the preferred embodiments, it will be understood that they are not intended to limit the invention to these embodiments. On the contrary, the invention is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be obvious to one ordinarily skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the current invention.
- In one embodiment, an automatically managed unified memory allows an application to use a single pointer to access data associated with the pointer from multiple locations. The “managed” pointer can be accessed or used across multiple different entities (e.g., a kernel, a processor, CPU, GPU, etc.). The single pointer can be associated with automatically managed memory. In one exemplary implementation, managed refers to a memory space that is automatically managed by a driver (e.g., graphics device driver, etc.). In one embodiment, an automatically managed unified memory differs from a conventional unified memory by allowing virtual address spaces associated with different entities (e.g., different processors, GPU, CPU, etc.) to be treated as if it is one memory space. Treating multiple memories as single memory space relieves a user from having to explicitly direct many of the multiple memory management activities. In one exemplary implementation, a unified set of page tables is not necessarily used and there can actually be multiple different sets of page tables. In one embodiment, memory space associated with a pointer is reserved and accesses by multiple entities to the pointer are automatically managed.
- In one embodiment, an automatically managed unified memory creates a management memory space to be used in unified memory. In one exemplary implementation, management memory space is created by allocating unified memory space as managed memory. At times the management memory space can automatically be made local enabling “local” access to associated data. For example, the managed address space can be in memory associated with a host (e.g., CPU) or memory associated with a device (e.g., GPU). Having data be present locally typically increases the performance of those accesses, as opposed to using remote memory access (e.g., over PCI, etc.). The automated management of the memory spaces enables the system to take care of putting the data where it is necessary or appropriate based on accesses. In one embodiment, a page fault handler manages migration of pages belonging to the managed memory allocations, migrating them back and forth between CPU memory and GPU memory. Consistency is also automatically maintained across multiple memories (e.g., latest concurrence, etc.). Normally the address range representing an allocation for managed memory is not mapped in the CPU's virtual address space. In one exemplary implementation, upon CPU access of unified memory data, the page fault handler copies the appropriate page from GPU memory to CPU memory and maps it into the CPU's virtual address space, allowing the CPU to access the data.
- The managing can include various activities. In one exemplary implementation, when a GPU is accessing a pointer the automated management makes sure the appropriate consistent information or data associated with the pointer is put on or moved to the GPU, and when the CPU is accessing the pointer the automated management makes sure the appropriate consistent information or data associated with the pointer is put on or moved to the CPU. The movement or copying of information between the processors can be automated and transparent to the user by utilizing the single “managed” pointer. In one embodiment, a user or programmer can utilize the single managed pointer without having to be concerned about the concurrency or coherency of data between the different processors or memories (e.g., CPU, GPU, etc.). Thus, the automatic managed memory approach can enable a CPU access to GPU data. CPU page faults to the same location or single pointer in the unified memory can also be automatically handled, even though there may be two distinct or discrete physical memories (e.g., the CPU and the GPU, etc.).
-
FIG. 1 is a flow chart of an exemplary automated unified memory management method in accordance with one embodiment of the present invention. - In
block 1410, space for managed pointers is established across a plurality of memories. In one embodiment, one of the managed pointers is allocated to a first portion of memory associated with a first one of a plurality of processors. In one embodiment, the managed memory is part of a unified memory. In one exemplary implementation, establishing a managed memory includes a processor reserving one or more regions from the processors' virtual address space. In one exemplary implementation, a GPU physical address (GPU PA) can be mapped to an allocated central processing address (CPU VA). - In
block 1420, accesses associated with the single pointer are automatically managed across a plurality of processors and memories. In one embodiment, the automatic management includes ensuring consistent information associated with the managed pointers is copied from the first portion of memory to a second portion of memory associated with a second one of the plurality of processors based upon initiation of an accesses to the managed pointers from the second one of the plurality of processors. In one exemplary implementation, when a CPU attempts to access the pointer, physical space in the CPU PA is allocated, the portion of the GPU PA is automatically copied to the CPU PA, and the address in the CPU VA is mapped to the newly allocated CPU physical memory. - In one embodiment, a novel API managed memory allocation call triggers an automated unified memory management method. The API managed memory allocation call can instruct a driver (e.g., GPU driver, etc.) to automatically manage the memory. In one exemplary implementation, the novel API call includes a GPU cudaMallocManaged call. In one embodiment, a cudaMallocManaged call returns pointers within a reserved VA range associated with managed memory. Reserving a certain VA range for use by a pointer in multiple VA spaces ensures the pointer can be used in multiple VA spaces (e.g., CPU and GPU memory spaces, etc.).
FIGS. 2 through 5 are block diagrams of exemplary memory spaces associated with an automated unified memory management process in accordance with one embodiment. - In one embodiment, regions from a GPU's virtual address space are reserved and a similar set of regions are also reserved in a CPUs virtual address space.
FIG. 2 is a block diagram of exemplary memory space reservation n accordance with one embodiment. Managed memory chunks or addresses 1511 inGPU VA 1510 and corresponding managed memory chunks or addresses 1591 inCPU VA 1590 are reserved. In one embodiment, the reserved managed memory chunks or addresses 1511 and 1591 are the same size. A pointer managed by a particular driver can be used and accessed by multiple processors because accesses to the reserved managed memory space by other “non-managed” pointers (e.g., pointers not managed by the particular device) is prevented. In one exemplary implementation, if code includes an allocation call associated with a non-managed pointer, (e.g., if GPU code calls cudaMalloc, if CPU code calls Malloc, etc.) the system will use or allocate a part of the VA space that has not been reserved for managed pointer memory (e.g., an address outside the reserved range is returned for allocation to the “non-managed” pointer, etc.). - The reservation can be initiated by a GPU driver. The driver provides an opt-in allocator to the application to allocate out of these regions. In one embodiment, when initializing a CUDA driver the processor examines how much memory is in the system between the CPU and GPUs and a large enough total of managed memory is reserved. In one exemplary implementation, a matching range is reserved in the VA space of multiple GPUs.
- In one embodiment, the reserved VA ranges do not initially map to any physical memory. Normally, the address range representing an allocation is not initially mapped in the GPU's or CPU's virtual address space. The physical pages backing the VA allocations are created or mapped in GPU and CPU memory.
-
FIG. 3 is a block diagram of exemplary memory spaces associated with an API managed pointer memory allocation call in accordance with one embodiment. In one embodiment, when an API managed pointer memory allocation call is encountered, addresses or locations within the reserved VA range are returned and achunk 1522 from the reservedrange 1511 in theGPU VA space 1510 is allocated to the managed pointer Ptr. Pages or addresses “A” from theGPU PA 1530 are allocated and mapped toGPU VA 1522 in GPU page table 1520map entry 1521. A GPU kernel mode driver is also notified of the new managed allocation. Now the GPU side mapping is set up and a GPU kernel which accesses the allocation can use the physical memory mapped under it. -
FIG. 4 is a block diagram of exemplary memory spaces associated with an access call from a different entity in accordance with one embodiment. When there is an access to the same pointer Ptr from the CPU, initially there is not a CPU virtual address that maps to the pointer and a page fault is triggered. The kernel mode driver which was previously notified of the allocation handles the page fault. A physical page or address “B” is allocated from theCPU PA 1570. The driver copies the data contents of the corresponding GPU physical page or address “A” into the CPU physical page or address “B”. The CPU virtual page oraddress 1592 is mapped to the physical page “B” by themapping 1581 in the CPU page table 1580. Control returns to the user code on the CPU which triggered the fault. Thevirtual address 1592 is now valid, and the access which faulted is retried and operations are directed to the CPU physical memory page or address “B”. - If a later access from the CPU code happens to be in the same page, there is no fault because the page has already been paged in and it will be a relatively fast access. But if a later access crosses a page boundary, a new fault occurs. If a fault occurs within the reserved VA range but the address requested is not inside any allocation the kernel mode driver has been notified about, the fault is not handled and the user process receives a signal for the invalid access.
-
FIG. 5 is a block diagram of exemplary memory space associated with a launch in accordance with one embodiment. When work is launched on the GPU, any pages that were migrated to CPU memory are flushed back to GPU memory, and the CPU's virtual address mappings may be unmapped. In one exemplary implementation, data′ is flushed back fromCPU PA 1570 toGPU PA 1530 and the 1592 previously mapped to B in map 1581 (shown inFIG. 4 ) is unmapped (inFIG. 5 ). Data′ may be the same as the data copied or moved to the CPU inFIG. 4 or data′ may be the result of modification of the data by the CPU. After this point, the CPU needs to synchronize on the pending GPU work before it can access the same data from the CPU again. Otherwise the application could be accessing the same data from both the CPU and the GPU, violating the programming model and possibly resulting in data corruption. One way the page fault handler can prevent such coherency violations is by throwing a segmentation fault on CPU access to data that is potentially being used by the GPU. However, the programming model doesn't require this, and this is meant as a convenience to the developer to know when a concurrency violation occurred. There are other ways in which coherency violations can be prevented that may be part of the driver implementation. - The following is one exemplary utilization of two pointers and an explicit copy instruction:
-
global_k(int *ptr){ //use ptr } void ( ){ int *d_ptr, *h_ptr; size_t size=100; cudaMalloc (& d_ptr, size); k<<<1,1>>>(d_ptr); h_ptr=malloc(size); cudaMemcpy (h_ptr, d_ptr, size); //verify h_ptr on CPU printf(“%d\n”, h_ptr[0]); } - In one embodiment of a single pointer approach, the need for h_pointer is eliminated. In one exemplary implementation, memory otherwise associated with the h_pointer can be freed up for other use as compared to when the h-pointer is included. The need for including a specific copy instruction (e.g., cudaMemcpy call, etc.) in the code to copy data from host to device or device to host is eliminated, saving processing resources and time. The system automatically takes care of actually copying the data. The automated copying can offer subtle benefits. In the past, even if only part of a range needed to be copied, the conventional approaches copied the whole range (e.g., with an unconditional cudaMemcpy call, etc.). In contrast, in one embodiment of a single pointer automated managed memory approach the copy is done based on accesses. In one exemplary implementation, when the CPU accesses a pointer there is actually a page fault handler (e.g., as part of a kernel mode driver, etc.) and the ranges have already been resolved (e.g., with the kernel mode driver, etc.). It sees that the access is directed to a particular managed page and copies the data being accessed without excess data. In one embodiment, it knows exactly what to copy. It can copy at a smaller granularity based on access (e.g., copies a limited or smaller amount of data associated with an access as opposed conventional approaches that copy a larger amount such as a whole allocation or array, etc.).
- There are a variety of ways to create or allocate managed memory. One way is through an API call. Another way is an added keyword managed that can be applied to device variables. It can be part of the language itself. Prior to the novel managed API, users could only declare device variables here. In one embodiment, a device variable has the same restrictions as an allocation returned by cudaMalloc. So a device variable cannot be accessed from the CPU. A user wishing to access the data from the CPU can use a special API such as cudaMemcpy to copy from the GPU memory to a separate CPU memory location. The managed memory space allows use of the keyword managed that can be applied to device variables. For example, one can directly reference a managed device variable in the CPU code without having to worry about copy operations, which are now done automatically for the user. Using managed memory, a user does not have to track or worry as much about coherence and copies between the two different pointers.
- The following is one exemplary utilization of a single unified pointer:
-
global_k (int*ptr) { //use ptr } void main( ) { int *ptr; size_t size =100; cudaMallocManaged (&ptr, size); k<<<1,1>>>(ptr); cudaDeviceSynchronize ( ); printf (“%d\n”, ptr[0]); } - Alternatively, the above code can use a qualified variable rather than a dynamic allocation:
-
_device_ _managed_int foo[100]; global_k ( ){ //use foo } void main( ) { k<<<1,1>>>( ); cudaDeviceSynchronize ( ); printf (“%d\n”, foo[0]); } - The described approach significantly reduces the barrier to entry for novice users. It also makes porting of code and the use of GPUs easier.
- In one embodiment, on a CPU access the CPU access actually copies data over from the GPU. CPU code may then modify the contents of this memory in the CPU physical copy. When doing a kernel launch, the kernel mode driver is first notified that a kernel launch is being performed. The driver examines information about managed memory that has been copied to the CPU physical memory, and copies the contents of certain CPU physical memory back to the GPU physical memory. Then the kernel is launched and the kernel can use the data because it is up to date. In one exemplary implementation, during the kernel launch is when there is a copy back to the GPU and the GPU can use it.
- In one embodiment, a cudaDeviceSynchronize call is performed. The cudaDeviceSynchronize can be called before accessing data from the CPU again. If a synchronize call is not made the data may not be coherent and this can cause data corruption. In one exemplary implementation, the data programming model does not allow concurrent access to the data by both the GPU and CPU at the same time and that is why a cudaDeviceSynchronize is included, ensuring work on the GPU which may be accessing the data has completed. In one exemplary implementation, kernel launches are asynchronous and the only way to know a kernel has completed is by making a synchronize call.
- There are various ways to synchronize. A device synchronize can be performed which means synchronize the work launched on the device or GPU. A subset of GPU work can also be synchronized such as a CUDA stream.
- Additional explanation of CUDA stream approaches is set forth in later portions of the detailed description. The synchronize is performed before the data can be accessed from the CPU again. If the synchronize is not performed and an attempt to access a managed region from the CPU is made, the page fault handler is aware of the outstanding GPU work and the user process is signaled rather than handle the page fault, as the user code has violated the requirements of the programming model. It is appreciated that disallowing concurrent access is not the only approach to provide coherence.
- Another way to provide coherence is utilizing page merging. In one embodiment, a kernel is running and using the data actively when there is an access to the managed data on the CPU. It will create a backup copy of the page contents at the time of the access, and then set up mappings to separate physical copies in both locations so the CPU and GPU code can continue and access the data concurrently. A three-way merge of the three copies is later performed and a new page that contains the merged data from the three pages is created. In one exemplary implementation, page merging is used and segmentation faults are not issued for concurrent access.
- With reference to
FIG. 6 , a block diagram of anexemplary computer system 900 is shown, one embodiment of a computer system upon which embodiments of the present invention can be implemented.Computer system 900 includescentral processor unit 901, main memory 902 (e.g., random access memory), chip set 903 withnorth bridge 909 andsouth bridge 905, removabledata storage device 904, input device 907,signal communications port 908, and graphics subsystem 910 which is coupled todisplay 920.Computer system 900 includes several busses for communicatively coupling the components ofcomputer system 900. Communication bus 991 (e.g., a front side bus) couplesnorth bridge 909 ofchipset 903 tocentral processor unit 901. Communication bus 992 (e.g., a main memory bus) couplesnorth bridge 909 ofchipset 903 tomain memory 902. Communication bus 993 (e.g., the Advanced Graphics Port interface) couples north bridge ofchipset 903 tographic subsystem 910.Communication buses data storage device 904, input device 907,signal communications port 908 respectively. Graphics subsystem 910 includesgraphics processor 911 andframe buffer 915. - The components of
computer system 900 cooperatively operate to provide versatile functionality and performance. In one exemplary implementation, the components ofcomputer system 900 cooperatively operate to provide predetermined types of functionality.Communications bus Central processor 901 processes information.Main memory 902 stores information and instructions for thecentral processor 901. Removabledata storage device 904 also stores information and instructions (e.g., functioning as a large information reservoir). Input device 907 provides a mechanism for inputting information and/or for pointing to or highlighting information ondisplay 920.Signal communication port 908 provides a communication interface to exterior devices (e.g., an interface with a network).Display device 920 displays information in accordance with data stored inframe buffer 915.Graphics processor 911 processes graphics commands fromcentral processor 901 and provides the resulting data tovideo buffers 915 for storage and retrieval bydisplay monitor 920. - Some portions of the detailed descriptions are presented in terms of procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer memory. These descriptions and representations are the means generally used by those skilled in data processing arts to effectively convey the substance of their work to others skilled in the art. A procedure, logic block, process, etc., is here, and generally, conceived to be a self-consistent sequence of steps or instructions leading to a desired result. The steps include physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical, magnetic, optical, or quantum signals capable of being stored, transferred, combined, compared, and otherwise manipulated in a computer system. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
- It should be borne in mind, however, that all of these and similar terms are associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussions, it is appreciated that throughout the present application, discussions utilizing terms such as “processing”, “computing”, “calculating”, “determining”, “displaying” or the like, refer to the action and processes of a computer system, or similar processing device (e.g., an electrical, optical, or quantum, computing device), that manipulates and transforms data represented as physical (e.g., electronic) quantities. The terms refer to actions and processes of the processing devices that manipulate or transform physical quantities within a computer system's component (e.g., registers, memories, other such information storage, transmission or display devices, etc.) into other data similarly represented as physical quantities within other components.
- It is appreciated that embodiments of the present invention can be compatible and implemented with a variety of different types of tangible memory or storage (e.g., RAM, DRAM, flash, hard drive, CD, DVD, etc.). The memory or storage, while able to be changed or rewritten, can be considered a non-transitory storage medium. By indicating a non-transitory storage medium it is not intend to limit characteristics of the medium, and can include a variety of storage mediums (e.g., programmable, erasable, nonprogrammable, read/write, read only, etc.) and “non-transitory” computer-readable media comprises all computer-readable media, with the sole exception being a transitory, propagating signal.
- It is appreciated that the following is a listing of exemplary concepts or embodiments associated with the novel approach. It is also appreciated that the listing is not exhaustive and does not necessarily include all possible implementation. The following concepts and embodiments can be implemented in hardware. In one embodiment, the following methods or process describe operations performed by various processing components or units. In one exemplary implementation, instructions or directions associated with the methods, processes, operations etc. can be stored in a memory and cause a processor to implement the operations, functions, actions, etc.
- The foregoing descriptions of specific embodiments of the present invention have been presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed, and obviously many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the Claims appended hereto and their equivalents. The listing of steps within method claims do not imply any particular order to performing the steps, unless explicitly stated in the claim.
Claims (20)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/237,010 US10762593B2 (en) | 2014-01-20 | 2018-12-31 | Unified memory systems and methods |
US16/919,954 US20200364821A1 (en) | 2014-01-20 | 2020-07-02 | Unified memory systems and methods |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201461929496P | 2014-01-20 | 2014-01-20 | |
US201461965089P | 2014-01-21 | 2014-01-21 | |
US201461929913P | 2014-01-21 | 2014-01-21 | |
US14/601,223 US10319060B2 (en) | 2014-01-20 | 2015-01-20 | Unified memory systems and methods |
US15/709,397 US10546361B2 (en) | 2014-01-20 | 2017-09-19 | Unified memory systems and methods |
US16/237,010 US10762593B2 (en) | 2014-01-20 | 2018-12-31 | Unified memory systems and methods |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/709,397 Continuation US10546361B2 (en) | 2014-01-20 | 2017-09-19 | Unified memory systems and methods |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/919,954 Continuation US20200364821A1 (en) | 2014-01-20 | 2020-07-02 | Unified memory systems and methods |
Publications (3)
Publication Number | Publication Date |
---|---|
US20190147561A1 US20190147561A1 (en) | 2019-05-16 |
US20200265543A9 true US20200265543A9 (en) | 2020-08-20 |
US10762593B2 US10762593B2 (en) | 2020-09-01 |
Family
ID=53543561
Family Applications (6)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/481,802 Active US9886736B2 (en) | 2014-01-20 | 2014-09-09 | Selectively killing trapped multi-process service clients sharing the same hardware context |
US14/601,223 Active US10319060B2 (en) | 2014-01-20 | 2015-01-20 | Unified memory systems and methods |
US15/709,397 Active - Reinstated US10546361B2 (en) | 2014-01-20 | 2017-09-19 | Unified memory systems and methods |
US16/237,010 Active US10762593B2 (en) | 2014-01-20 | 2018-12-31 | Unified memory systems and methods |
US16/408,173 Active US11893653B2 (en) | 2014-01-20 | 2019-05-09 | Unified memory systems and methods |
US16/919,954 Pending US20200364821A1 (en) | 2014-01-20 | 2020-07-02 | Unified memory systems and methods |
Family Applications Before (3)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/481,802 Active US9886736B2 (en) | 2014-01-20 | 2014-09-09 | Selectively killing trapped multi-process service clients sharing the same hardware context |
US14/601,223 Active US10319060B2 (en) | 2014-01-20 | 2015-01-20 | Unified memory systems and methods |
US15/709,397 Active - Reinstated US10546361B2 (en) | 2014-01-20 | 2017-09-19 | Unified memory systems and methods |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/408,173 Active US11893653B2 (en) | 2014-01-20 | 2019-05-09 | Unified memory systems and methods |
US16/919,954 Pending US20200364821A1 (en) | 2014-01-20 | 2020-07-02 | Unified memory systems and methods |
Country Status (3)
Country | Link |
---|---|
US (6) | US9886736B2 (en) |
DE (1) | DE112015000430T5 (en) |
WO (2) | WO2015108708A2 (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9075647B2 (en) * | 2011-08-01 | 2015-07-07 | International Business Machines Corporation | Preemptive guest merging for virtualization hypervisors |
US9619364B2 (en) | 2013-03-14 | 2017-04-11 | Nvidia Corporation | Grouping and analysis of data access hazard reports |
US9886736B2 (en) | 2014-01-20 | 2018-02-06 | Nvidia Corporation | Selectively killing trapped multi-process service clients sharing the same hardware context |
US10152312B2 (en) | 2014-01-21 | 2018-12-11 | Nvidia Corporation | Dynamic compiler parallelism techniques |
WO2015130282A1 (en) * | 2014-02-27 | 2015-09-03 | Hewlett-Packard Development Company, L. P. | Communication between integrated graphics processing units |
US10120832B2 (en) * | 2014-05-27 | 2018-11-06 | Mellanox Technologies, Ltd. | Direct access to local memory in a PCI-E device |
WO2016205976A1 (en) * | 2015-06-26 | 2016-12-29 | Intel Corporation | Apparatus and method for efficient communication between virtual machines |
KR102491622B1 (en) * | 2015-11-17 | 2023-01-25 | 삼성전자주식회사 | Method for operating virtual address generator and method for operating system having the same |
US10733695B2 (en) * | 2016-09-16 | 2020-08-04 | Intel Corporation | Priming hierarchical depth logic within a graphics processor |
US11150943B2 (en) * | 2017-04-10 | 2021-10-19 | Intel Corporation | Enabling a single context hardware system to operate as a multi-context system |
US10489881B2 (en) * | 2017-06-30 | 2019-11-26 | H3 Platform Inc. | Direct memory access for co-processor memory |
US10922203B1 (en) * | 2018-09-21 | 2021-02-16 | Nvidia Corporation | Fault injection architecture for resilient GPU computing |
US11038749B2 (en) * | 2018-12-24 | 2021-06-15 | Intel Corporation | Memory resource allocation in an end-point device |
US11232533B2 (en) * | 2019-03-15 | 2022-01-25 | Intel Corporation | Memory prefetching in multiple GPU environment |
WO2020237460A1 (en) * | 2019-05-27 | 2020-12-03 | 华为技术有限公司 | Graphics processing method and apparatus |
US11321068B2 (en) | 2019-09-05 | 2022-05-03 | International Business Machines Corporation | Utilizing memory coherency to improve bandwidth performance |
US11593157B2 (en) | 2020-02-05 | 2023-02-28 | Nec Corporation | Full asynchronous execution queue for accelerator hardware |
EP3862874B1 (en) * | 2020-02-05 | 2023-08-16 | Nec Corporation | Full asynchronous execution queue for accelerator hardware |
US11080111B1 (en) | 2020-02-24 | 2021-08-03 | Nvidia Corporation | Technique for sharing context among multiple threads |
EP3961469A1 (en) * | 2020-08-27 | 2022-03-02 | Siemens Industry Software and Services B.V. | Computing platform for simulating an industrial system and method of managing the simulation |
Family Cites Families (82)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU3999595A (en) | 1994-10-06 | 1996-05-02 | Vrc, Inc. | Shared memory system |
US6324683B1 (en) | 1996-02-23 | 2001-11-27 | International Business Machines Corporation | System, method and program for debugging external programs in client/server-based relational database management systems |
US5970241A (en) | 1997-11-19 | 1999-10-19 | Texas Instruments Incorporated | Maintaining synchronism between a processor pipeline and subsystem pipelines during debugging of a data processing system |
WO1999059069A1 (en) | 1998-05-07 | 1999-11-18 | Infineon Technologies Ag | Cache memory for two-dimensional data fields |
US6343371B1 (en) | 1999-01-14 | 2002-01-29 | Compaq Computer Corporation | System and method for statically detecting potential race conditions in multi-threaded computer programs |
US7617240B2 (en) | 1999-05-04 | 2009-11-10 | Accenture Llp | Component based task handling during claim processing |
US6874039B2 (en) | 2000-09-08 | 2005-03-29 | Intel Corporation | Method and apparatus for distributed direct memory access for systems on chip |
US6851075B2 (en) | 2002-01-04 | 2005-02-01 | International Business Machines Corporation | Race detection for parallel software |
US6891543B2 (en) | 2002-05-08 | 2005-05-10 | Intel Corporation | Method and system for optimally sharing memory between a host processor and graphics processor |
US7516446B2 (en) | 2002-06-25 | 2009-04-07 | International Business Machines Corporation | Method and apparatus for efficient and precise datarace detection for multithreaded object-oriented programs |
US6947051B2 (en) * | 2003-02-18 | 2005-09-20 | Microsoft Corporation | Video memory management |
US20050015752A1 (en) | 2003-07-15 | 2005-01-20 | International Business Machines Corporation | Static analysis based error reduction for software applications |
US7065630B1 (en) | 2003-08-27 | 2006-06-20 | Nvidia Corporation | Dynamically creating or removing a physical-to-virtual address mapping in a memory of a peripheral device |
US7549150B2 (en) | 2004-03-24 | 2009-06-16 | Microsoft Corporation | Method and system for detecting potential races in multithreaded programs |
US7206915B2 (en) | 2004-06-03 | 2007-04-17 | Emc Corp | Virtual space manager for computer having a physical address extension feature |
US7366956B2 (en) | 2004-06-16 | 2008-04-29 | Hewlett-Packard Development Company, L.P. | Detecting data races in multithreaded computer programs |
US7757237B2 (en) | 2004-06-16 | 2010-07-13 | Hewlett-Packard Development Company, L.P. | Synchronization of threads in a multithreaded computer program |
US20060218553A1 (en) | 2005-03-25 | 2006-09-28 | Dolphin Software, Inc. | Potentially hazardous material request and approve methods and apparatuses |
US7661097B2 (en) | 2005-04-05 | 2010-02-09 | Cisco Technology, Inc. | Method and system for analyzing source code |
US7743233B2 (en) * | 2005-04-05 | 2010-06-22 | Intel Corporation | Sequencer address management |
US7764289B2 (en) | 2005-04-22 | 2010-07-27 | Apple Inc. | Methods and systems for processing objects in memory |
US7805708B2 (en) | 2005-05-13 | 2010-09-28 | Texas Instruments Incorporated | Automatic tool to eliminate conflict cache misses |
US7663635B2 (en) * | 2005-05-27 | 2010-02-16 | Ati Technologies, Inc. | Multiple video processor unit (VPU) memory mapping |
US7784035B2 (en) | 2005-07-05 | 2010-08-24 | Nec Laboratories America, Inc. | Method for the static analysis of concurrent multi-threaded software |
US7584332B2 (en) | 2006-02-17 | 2009-09-01 | University Of Notre Dame Du Lac | Computer systems with lightweight multi-threaded architectures |
JP5045666B2 (en) | 2006-02-20 | 2012-10-10 | 富士通株式会社 | Program analysis method, program analysis apparatus, and program analysis program |
US8028133B2 (en) | 2006-02-22 | 2011-09-27 | Oracle America, Inc. | Globally incremented variable or clock based methods and apparatus to implement parallel transactions |
US7752605B2 (en) | 2006-04-12 | 2010-07-06 | Microsoft Corporation | Precise data-race detection using locksets |
US7673181B1 (en) | 2006-06-07 | 2010-03-02 | Replay Solutions, Inc. | Detecting race conditions in computer programs |
US7814486B2 (en) | 2006-06-20 | 2010-10-12 | Google Inc. | Multi-thread runtime system |
US8108844B2 (en) | 2006-06-20 | 2012-01-31 | Google Inc. | Systems and methods for dynamically choosing a processing element for a compute kernel |
US8375368B2 (en) | 2006-06-20 | 2013-02-12 | Google Inc. | Systems and methods for profiling an application running on a parallel-processing computer system |
US8136102B2 (en) | 2006-06-20 | 2012-03-13 | Google Inc. | Systems and methods for compiling an application for a parallel-processing computer system |
US8146066B2 (en) | 2006-06-20 | 2012-03-27 | Google Inc. | Systems and methods for caching compute kernels for an application running on a parallel-processing computer system |
DE102006032832A1 (en) | 2006-07-14 | 2008-01-17 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Network system and method for controlling distributed memory |
US20080028181A1 (en) | 2006-07-31 | 2008-01-31 | Nvidia Corporation | Dedicated mechanism for page mapping in a gpu |
US20080109795A1 (en) | 2006-11-02 | 2008-05-08 | Nvidia Corporation | C/c++ language extensions for general-purpose graphics processing unit |
US7992146B2 (en) | 2006-11-22 | 2011-08-02 | International Business Machines Corporation | Method for detecting race conditions involving heap memory access |
US8860741B1 (en) * | 2006-12-08 | 2014-10-14 | Nvidia Corporation | Graphics processor with memory management unit and cache coherent link |
GB2459409A (en) | 2007-01-24 | 2009-10-28 | Inventanet Ltd | Method and system for searching for patterns in data |
US8286196B2 (en) | 2007-05-03 | 2012-10-09 | Apple Inc. | Parallel runtime execution on multiple processors |
US8095750B2 (en) | 2007-05-14 | 2012-01-10 | International Business Machines Corporation | Transactional memory system with fast processing of common conflicts |
US8321637B2 (en) | 2007-05-14 | 2012-11-27 | International Business Machines Corporation | Computing system with optimized support for transactional memory |
US8117403B2 (en) | 2007-05-14 | 2012-02-14 | International Business Machines Corporation | Transactional memory system which employs thread assists using address history tables |
US8839218B2 (en) | 2007-06-04 | 2014-09-16 | International Business Machines Corporation | Diagnosing alias violations in memory access commands in source code |
US8452541B2 (en) | 2007-06-18 | 2013-05-28 | Microsoft Corporation | Vaccine design methodology |
JP4937022B2 (en) | 2007-07-13 | 2012-05-23 | 株式会社東芝 | Order relation analysis apparatus, method and program |
US8296743B2 (en) | 2007-12-17 | 2012-10-23 | Intel Corporation | Compiler and runtime for heterogeneous multiprocessor systems |
US8397241B2 (en) | 2008-11-13 | 2013-03-12 | Intel Corporation | Language level support for shared virtual memory |
US20100153934A1 (en) | 2008-12-12 | 2010-06-17 | Peter Lachner | Prefetch for systems with heterogeneous architectures |
US20100156888A1 (en) | 2008-12-23 | 2010-06-24 | Intel Corporation | Adaptive mapping for heterogeneous processing systems |
US8392925B2 (en) | 2009-03-26 | 2013-03-05 | Apple Inc. | Synchronization mechanisms based on counters |
US9547535B1 (en) * | 2009-04-30 | 2017-01-17 | Nvidia Corporation | Method and system for providing shared memory access to graphics processing unit processes |
US8553040B2 (en) | 2009-06-30 | 2013-10-08 | Apple Inc. | Fingerprinting of fragment shaders and use of same to perform shader concatenation |
US8522000B2 (en) * | 2009-09-29 | 2013-08-27 | Nvidia Corporation | Trap handler architecture for a parallel processing unit |
CN102741828B (en) | 2009-10-30 | 2015-12-09 | 英特尔公司 | To the two-way communication support of the heterogeneous processor of computer platform |
US9098616B2 (en) | 2009-12-11 | 2015-08-04 | International Business Machines Corporation | Analyzing computer programs to identify errors |
US8719543B2 (en) | 2009-12-29 | 2014-05-06 | Advanced Micro Devices, Inc. | Systems and methods implementing non-shared page tables for sharing memory resources managed by a main operating system with accelerator devices |
US8769499B2 (en) | 2010-01-06 | 2014-07-01 | Nec Laboratories America, Inc. | Universal causality graphs for bug detection in concurrent programs |
US8364909B2 (en) | 2010-01-25 | 2013-01-29 | Hewlett-Packard Development Company, L.P. | Determining a conflict in accessing shared resources using a reduced number of cycles |
CN102262557B (en) * | 2010-05-25 | 2015-01-21 | 运软网络科技(上海)有限公司 | Method for constructing virtual machine monitor by bus architecture and performance service framework |
US8756590B2 (en) | 2010-06-22 | 2014-06-17 | Microsoft Corporation | Binding data parallel device source code |
US8639889B2 (en) | 2011-01-31 | 2014-01-28 | International Business Machines Corporation | Address-based hazard resolution for managing read/write operations in a memory cache |
US8566537B2 (en) | 2011-03-29 | 2013-10-22 | Intel Corporation | Method and apparatus to facilitate shared pointers in a heterogeneous platform |
US8972694B1 (en) * | 2012-03-26 | 2015-03-03 | Emc Corporation | Dynamic storage allocation with virtually provisioned devices |
US8789026B2 (en) | 2011-08-02 | 2014-07-22 | International Business Machines Corporation | Technique for compiling and running high-level programs on heterogeneous computers |
US20130086564A1 (en) | 2011-08-26 | 2013-04-04 | Cognitive Electronics, Inc. | Methods and systems for optimizing execution of a program in an environment having simultaneously parallel and serial processing capability |
US8719464B2 (en) * | 2011-11-30 | 2014-05-06 | Advanced Micro Device, Inc. | Efficient memory and resource management |
US9116809B2 (en) | 2012-03-29 | 2015-08-25 | Ati Technologies Ulc | Memory heaps in a memory model for a unified computing system |
US20130262736A1 (en) * | 2012-03-30 | 2013-10-03 | Ati Technologies Ulc | Memory types for caching policies |
US9038080B2 (en) | 2012-05-09 | 2015-05-19 | Nvidia Corporation | Method and system for heterogeneous filtering framework for shared memory data access hazard reports |
US20130304996A1 (en) | 2012-05-09 | 2013-11-14 | Nvidia Corporation | Method and system for run time detection of shared memory data access hazards |
US9378572B2 (en) | 2012-08-17 | 2016-06-28 | Intel Corporation | Shared virtual memory |
US9116738B2 (en) | 2012-11-13 | 2015-08-25 | International Business Machines Corporation | Method and apparatus for efficient execution of concurrent processes on a multithreaded message passing system |
US9582848B2 (en) | 2012-12-28 | 2017-02-28 | Apple Inc. | Sprite Graphics rendering system |
US8931108B2 (en) * | 2013-02-18 | 2015-01-06 | Qualcomm Incorporated | Hardware enforced content protection for graphics processing units |
US9619364B2 (en) | 2013-03-14 | 2017-04-11 | Nvidia Corporation | Grouping and analysis of data access hazard reports |
US9886736B2 (en) | 2014-01-20 | 2018-02-06 | Nvidia Corporation | Selectively killing trapped multi-process service clients sharing the same hardware context |
US10152312B2 (en) | 2014-01-21 | 2018-12-11 | Nvidia Corporation | Dynamic compiler parallelism techniques |
US9471289B2 (en) * | 2014-03-25 | 2016-10-18 | Nec Corporation | Compiler optimization for many integrated core processors |
US9563571B2 (en) | 2014-04-25 | 2017-02-07 | Apple Inc. | Intelligent GPU memory pre-fetching and GPU translation lookaside buffer management |
US20160188251A1 (en) | 2014-07-15 | 2016-06-30 | Nvidia Corporation | Techniques for Creating a Notion of Privileged Data Access in a Unified Virtual Memory System |
-
2014
- 2014-09-09 US US14/481,802 patent/US9886736B2/en active Active
-
2015
- 2015-01-20 US US14/601,223 patent/US10319060B2/en active Active
- 2015-01-20 DE DE112015000430.0T patent/DE112015000430T5/en active Pending
- 2015-01-20 WO PCT/US2015/000014 patent/WO2015108708A2/en active Application Filing
- 2015-01-21 WO PCT/US2015/012109 patent/WO2015109338A1/en active Application Filing
-
2017
- 2017-09-19 US US15/709,397 patent/US10546361B2/en active Active - Reinstated
-
2018
- 2018-12-31 US US16/237,010 patent/US10762593B2/en active Active
-
2019
- 2019-05-09 US US16/408,173 patent/US11893653B2/en active Active
-
2020
- 2020-07-02 US US16/919,954 patent/US20200364821A1/en active Pending
Also Published As
Publication number | Publication date |
---|---|
US20150206277A1 (en) | 2015-07-23 |
US20190147561A1 (en) | 2019-05-16 |
WO2015109338A1 (en) | 2015-07-23 |
US10762593B2 (en) | 2020-09-01 |
DE112015000430T5 (en) | 2016-10-06 |
US20180018750A1 (en) | 2018-01-18 |
US20200364821A1 (en) | 2020-11-19 |
US10546361B2 (en) | 2020-01-28 |
US10319060B2 (en) | 2019-06-11 |
US20190266695A1 (en) | 2019-08-29 |
US20150206272A1 (en) | 2015-07-23 |
WO2015108708A2 (en) | 2015-07-23 |
US11893653B2 (en) | 2024-02-06 |
WO2015108708A3 (en) | 2015-10-08 |
US9886736B2 (en) | 2018-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10762593B2 (en) | Unified memory systems and methods | |
US10157146B2 (en) | Local access DMA with shared memory pool | |
US9529611B2 (en) | Cooperative memory resource management via application-level balloon | |
US9286101B2 (en) | Free page hinting | |
US9280486B2 (en) | Managing memory pages based on free page hints | |
US9367478B2 (en) | Controlling direct memory access page mappings | |
US9875132B2 (en) | Input output memory management unit based zero copy virtual machine to virtual machine communication | |
US10599565B2 (en) | Hypervisor managing memory addressed above four gigabytes | |
US8458434B2 (en) | Unified virtual contiguous memory manager | |
KR20130079865A (en) | Shared virtual memory management apparatus for securing cache-coherent | |
US20210182191A1 (en) | Free memory page hinting by virtual machines | |
US11960410B2 (en) | Unified kernel virtual address space for heterogeneous computing | |
US9251100B2 (en) | Bitmap locking using a nodal lock | |
US20240119006A1 (en) | Dual personality memory for autonomous multi-tenant cloud environment | |
TW201317781A (en) | Method for sharing memory of virtual machine and computer system using the same | |
CN115756742A (en) | Performance optimization design method, system, medium and device for direct I/O virtualization | |
US20130262790A1 (en) | Method, computer program and device for managing memory access in a multiprocessor architecture of numa type | |
CN115221073A (en) | Memory management method and device for physical server for running cloud service instance | |
CN117971716A (en) | Cache management method, equipment, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
FEPP | Fee payment procedure |
Free format text: PETITION RELATED TO MAINTENANCE FEES GRANTED (ORIGINAL EVENT CODE: PTGR); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT RECEIVED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |