CN104049905A - Migrating pages of different sizes between heterogeneous processors - Google Patents
Migrating pages of different sizes between heterogeneous processors Download PDFInfo
- Publication number
- CN104049905A CN104049905A CN201310752862.5A CN201310752862A CN104049905A CN 104049905 A CN104049905 A CN 104049905A CN 201310752862 A CN201310752862 A CN 201310752862A CN 104049905 A CN104049905 A CN 104049905A
- Authority
- CN
- China
- Prior art keywords
- memory
- page
- ppu
- cpu
- pages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/10—Address translation
- G06F12/1027—Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/65—Details of virtual memory and virtual address translation
- G06F2212/652—Page size control
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
本发明的一个实施例提出一种由计算机实施的、用于从第一存储器将存储器页迁移至第二存储器的方法。该方法包括:确定所述第一存储器所支持的第一页尺寸。该方法还包括:确定所述第二存储器所支持的第二页尺寸。该方法进一步包括:基于与所述存储器页相关联的页状态目录中的条目,来确定所述存储器页的使用历史。该方法还包括:基于所述第一页尺寸、所述第二页尺寸和所述使用历史,来在所述第一存储器和所述第二存储器之间迁移所述存储器页。
One embodiment of the invention proposes a computer-implemented method for migrating memory pages from a first memory to a second memory. The method includes: determining a first page size supported by the first memory. The method also includes determining a second page size supported by the second memory. The method further includes determining a usage history of the memory page based on an entry in a page state directory associated with the memory page. The method also includes migrating the memory page between the first memory and the second memory based on the first page size, the second page size, and the usage history.
Description
技术领域technical field
本发明总体上涉及计算机科学,且更具体地,涉及在异构(heterogeneous)处理器之间迁移不同尺寸的页。The present invention relates generally to computer science, and more particularly to migrating pages of different sizes between heterogeneous processors.
背景技术Background technique
典型的计算机系统包括中央处理单元(CPU)和一个或多个并行处理单元(GPU)。一些先进的计算机系统实施为CPU和GPU所共用的统一虚拟存储器架构。除此之外,该架构还使得CPU和GPU能够使用共用(例如,同一)虚拟存储器地址来访问物理存储器位置,而不管该物理存储器位置是在系统存储器还是GPU本地的存储器内。A typical computer system includes a central processing unit (CPU) and one or more parallel processing units (GPU). Some advanced computer systems implement a unified virtual memory architecture shared by the CPU and GPU. Among other things, the architecture enables the CPU and GPU to use a common (eg, same) virtual memory address to access a physical memory location, whether that physical memory location is in system memory or GPU-local memory.
在这类统一虚拟存储器架构中,存储器页可有利地依据存储器页是存储在与CPU还是GPU相关联的存储器单元中而被不同地尺寸化。具有不同尺寸化的存储器页的一个缺点在于,在那些不同的存储器单元之间迁移存储器页变得较为复杂。例如,会出现的一个难题是,将大存储器页迁移至仅存储小存储器页的存储器单元。在这种情形下,统一虚拟存储器架构必须决定如何容许页尺寸的这种差异。In such a unified virtual memory architecture, memory pages may advantageously be sized differently depending on whether the memory page is stored in a memory unit associated with a CPU or a GPU. One disadvantage of having memory pages of different sizes is that it becomes more complicated to migrate memory pages between those different memory cells. For example, one challenge that can arise is migrating large memory pages to memory cells that only store small memory pages. In this case, the unified virtual memory architecture must decide how to tolerate this difference in page size.
如前所述,本领域所需要的是一种更加有效的方法,以在实施统一虚拟存储器架构的系统中迁移不同尺寸的存储器页。As previously stated, what is needed in the art is a more efficient method for migrating memory pages of different sizes in systems implementing a unified virtual memory architecture.
发明内容Contents of the invention
本发明的一个实施例提出一种由计算机实施的、用于从第一存储器将存储器页迁移至第二存储器的方法。该方法包括:确定所述第一存储器所支持的第一页尺寸。该方法还包括:确定所述第二存储器所支持的第二页尺寸。该方法进一步包括:基于与所述存储器页相关联的页状态目录中的条目,来确定所述存储器页的使用历史。该方法还包括:基于所述第一页尺寸、所述第二页尺寸和所述使用历史,来在所述第一存储器和所述第二存储器之间迁移所述存储器页。One embodiment of the invention proposes a computer-implemented method for migrating memory pages from a first memory to a second memory. The method includes: determining a first page size supported by the first memory. The method also includes determining a second page size supported by the second memory. The method further includes determining a usage history of the memory page based on an entry in a page state directory associated with the memory page. The method also includes migrating the memory page between the first memory and the second memory based on the first page size, the second page size, and the usage history.
所公开的技术的一个优点在于,可以在虚拟存储器架构中的不同存储器单元之间有效地来回迁移不同尺寸的存储器页。该技术通过允许统一虚拟存储器系统与许多不同类型的存储器架构一起工作来提高统一虚拟存储器系统的灵活性。另一相关优点在于,通过允许大存储器页被分割成较小存储器页并且允许小存储器页被合并成较大存储器页,具有不同尺寸的存储器页可以存储在配置为存储不同存储器页尺寸的不同存储器单元中。此特征允许统一虚拟存储器系统在可能的情况下将页归组(group),以便减小在页表和/或转译后备缓存器(TLB)中占用的空间量。该特征还允许存储器页被分割开并迁移至不同的存储器单元,只要这种分割将会改进存储器本地性(locality)并减少存储器访问时间。One advantage of the disclosed technique is that memory pages of different sizes can be efficiently migrated back and forth between different memory units in a virtual memory architecture. This technology increases the flexibility of the unified virtual memory system by allowing it to work with many different types of memory architectures. Another related advantage is that, by allowing large memory pages to be split into smaller memory pages and allowing small memory pages to be merged into larger memory pages, memory pages having different sizes can be stored in different memories configured to store different memory page sizes. in the unit. This feature allows the unified virtual memory system to group pages where possible in order to reduce the amount of space taken up in page tables and/or translation lookaside buffers (TLBs). This feature also allows memory pages to be split and migrated to different memory units, as long as such splitting will improve memory locality and reduce memory access time.
附图说明Description of drawings
因此,可以详细地理解本发明的上述特征,并且可以参考示范性实施例得到对如上面所简要概括的本发明更具体的描述,其中一些实施例在附图中示出。然而,应当注意的是,附图仅示出了本发明的典型实施例,因此不应被认为是对其范围的限制,本发明可以具有其他等效的实施例。So that the above recited features of the present invention can be understood in detail, and a more particular description of the invention, briefly summarized above, may be had by reference to exemplary embodiments, some of which are shown in the accompanying drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may have other equally effective embodiments.
图1是示出了配置为实现本发明的一个或多个方面的计算机系统的框图;Figure 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the present invention;
图2是根据本发明的一个实施例的、示出统一虚拟存储器系统(UVM)的框图;Figure 2 is a block diagram illustrating a unified virtual memory system (UVM), according to one embodiment of the present invention;
图3示出了根据本发明的一个实施例的、用于从系统存储器将小存储器页传送至PPU存储器的操作;Figure 3 illustrates operations for transferring a small memory page from system memory to PPU memory, according to one embodiment of the present invention;
图4示出了根据本发明的一个实施例的、用于从系统存储器将小存储器页和相关的“同级”存储器页传送至PPU存储器的操作;Figure 4 illustrates operations for transferring small memory pages and associated "sibling" memory pages from system memory to PPU memory, according to one embodiment of the present invention;
图5示出了根据本发明的一个实施例的、用于从PPU存储器将小存储器页传送至系统存储器的操作;Figure 5 illustrates operations for transferring a small memory page from PPU memory to system memory, according to one embodiment of the present invention;
图6示出了根据本发明的一个实施例的、用于从PPU存储器204将小存储器页以及同级存储器页传送至系统存储器104的操作;以及FIG. 6 illustrates operations for transferring small memory pages and sibling memory pages from PPU memory 204 to system memory 104, according to one embodiment of the invention; and
图7是根据本发明的一个实施例的、用于在虚拟存储器架构中的存储器单元之间迁移不同尺寸的存储器页的方法步骤的流程图。7 is a flowchart of method steps for migrating memory pages of different sizes between memory units in a virtual memory architecture, according to one embodiment of the present invention.
具体实施方式Detailed ways
在下面的描述中,将阐述大量的具体细节以提供对本发明更透彻的理解。然而,本领域的技术人员应该清楚,本发明可以在没有一个或多个这些具体细节的情况下得以实施。In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without one or more of these specific details.
系统概述System Overview
图1为示出了配置为实现本发明的一个或多个方面的计算机系统100的框图。计算机系统100包括经由可以包括存储器桥105的互连路径通信的中央处理单元(CPU)102和系统存储器104。存储器桥105可以是例如北桥芯片,经由总线或其他通信路径106(例如超传输(HyperTransport)链路)连接到I/O(输入/输出)桥107。I/O桥107,其可以是例如南桥芯片,从一个或多个用户输入设备108(例如键盘、鼠标)接收用户输入并且经由通信路径106和存储器桥105将该输入转发到CPU102。并行处理子系统112经由总线或第二通信路径113(例如外围部件互连(PCI)Express、加速图形端口或超传输链路)连接到存储器桥105;在一个实施例中,并行处理子系统112是将像素传送到显示设备110的图形子系统,显示设备110可以是任何常规的阴极射线管、液晶显示器、发光二极管显示器等。系统盘114也可连接到I/O桥107,并且可配置为存储由CPU102和并行处理子系统112所使用的应用程序和数据以及内容。系统盘114为应用程序和数据提供非易失性存储空间,并且可包含固定式或可移除式硬盘驱动器、闪存驱动器和CD-ROM(压缩光盘只读存储器)、DVD-ROM(数字通用光盘-ROM)、蓝光、HD-DVD(高分辨率DVD)或者其他磁性、光学或固态存储设备。FIG. 1 is a block diagram illustrating a computer system 100 configured to implement one or more aspects of the present invention. Computer system 100 includes a central processing unit (CPU) 102 and system memory 104 in communication via an interconnection path that may include a memory bridge 105 . The memory bridge 105 may be, for example, a north bridge chip, connected to an I/O (input/output) bridge 107 via a bus or other communication path 106 (eg, a HyperTransport link). I/O bridge 107 , which may be, for example, a south bridge chip, receives user input from one or more user input devices 108 (eg, keyboard, mouse) and forwards the input to CPU 102 via communication path 106 and memory bridge 105 . Parallel processing subsystem 112 is connected to memory bridge 105 via a bus or second communication path 113 such as Peripheral Component Interconnect (PCI) Express, Accelerated Graphics Port, or HyperTransport Link; in one embodiment, parallel processing subsystem 112 is the graphics subsystem that transfers the pixels to the display device 110, which may be any conventional cathode ray tube, liquid crystal display, light emitting diode display, or the like. System disk 114 may also be connected to I/O bridge 107 and may be configured to store applications and data and content used by CPU 102 and parallel processing subsystem 112 . System disk 114 provides non-volatile storage space for applications and data, and may contain fixed or removable hard drives, flash drives, and CD-ROM (Compact Disc Read-Only Memory), DVD-ROM (Digital Versatile Disc -ROM), Blu-ray, HD-DVD (high-resolution DVD), or other magnetic, optical, or solid-state storage devices.
交换器116提供I/O桥107与诸如网络适配器118以及各种插卡120和121的其他部件之间的连接。其他部件(未明确示出),包括通用串行总线(USB)或其他端口连接、压缩光盘(CD)驱动器、数字通用光盘(DVD)驱动器、胶片录制设备及类似部件,也可以连接到I/O桥107。图1所示的各种通信路径包括具体命名的通信路径106和113可以使用任何适合的协议实现,诸如PCI-Express、AGP(加速图形端口)、超传输或者任何其他总线或点到点通信协议,并且如本领域已知的,不同设备间的连接可使用不同协议。Switch 116 provides connections between I/O bridge 107 and other components such as network adapter 118 and various add-in cards 120 and 121 . Other components (not explicitly shown), including Universal Serial Bus (USB) or other port connections, compact disc (CD) drives, digital versatile disc (DVD) drives, film recording devices, and similar components, may also be connected to the I/O O bridge 107. The various communication paths shown in Figure 1, including specifically named communication paths 106 and 113, may be implemented using any suitable protocol, such as PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol , and as known in the art, connections between different devices may use different protocols.
在一个实施例中,并行处理子系统112包含经优化用于图形和视频处理的电路,包括例如视频输出电路,并且构成一个或多个平行处理单元(PPU)202。在另一个实施例中,并行处理子系统112包含经优化用于通用处理的电路,同时保留底层(underlying)的计算架构,本文将更详细地进行描述。在又一个实施例中,可以将并行处理子系统112与一个或多个其他系统元件集成在单个子系统中,诸如结合存储器桥105、CPU102以及I/O桥107,以形成片上系统(SoC)。众所周知,许多图形处理单元(GPU)设计为执行并行操作和计算,因而被视为一类并行处理单元(PPU)。In one embodiment, parallel processing subsystem 112 includes circuitry optimized for graphics and video processing, including, for example, video output circuitry, and constitutes one or more parallel processing units (PPUs) 202 . In another embodiment, the parallel processing subsystem 112 includes circuitry optimized for general-purpose processing while preserving the underlying computing architecture, as described in greater detail herein. In yet another embodiment, parallel processing subsystem 112 may be integrated with one or more other system elements in a single subsystem, such as combining memory bridge 105, CPU 102, and I/O bridge 107, to form a system-on-chip (SoC) . It is well known that many graphics processing units (GPUs) are designed to perform parallel operations and calculations, and thus are considered a type of parallel processing unit (PPU).
在并行处理子系统112中可以包括任何数目的PPU202。例如,可在单个插卡上提供多个PPU202、或可将多个插卡连接到通信路径113、或可将一个或多个PPU202集成到桥式芯片中。在多PPU系统中的PPU202可以彼此同样或不同。例如,不同的PPU202可能具有不同数目的处理内核、不同容量的本地并行处理存储器等等。在存在多个PPU202的情况下,可并行操作那些PPU从而以高于单个PPU202所可能达到的吞吐量来处理数据。包含一个或多个PPU202的系统可以以各种配置和形式因素来实现,包括台式电脑、笔记本电脑或手持式个人计算机、服务器、工作站、游戏控制台、嵌入式系统等。Any number of PPUs 202 may be included in parallel processing subsystem 112 . For example, multiple PPUs 202 may be provided on a single add-in card, or multiple add-in cards may be connected to communication path 113, or one or more PPUs 202 may be integrated into a bridge chip. PPUs 202 in a multiple PPU system may be the same or different from each other. For example, different PPUs 202 may have different numbers of processing cores, different amounts of local parallel processing memory, and so on. Where multiple PPUs 202 are present, those PPUs may be operated in parallel to process data at a higher throughput than is possible with a single PPU 202 . Systems incorporating one or more PPUs 202 can be implemented in a variety of configurations and form factors, including desktop, notebook or handheld personal computers, servers, workstations, game consoles, embedded systems, and the like.
PPU202有利地实现高度并行处理架构。PPU202包括大量通用处理集群(GPC)。每个GPC均能够并发执行大量的(例如,几百或几千)线程,其中每个线程均为程序的实例(instance)。在一些实施例中,单指令、多数据(SIMD)指令发出技术用于在不提供多个独立指令单元的情况下支持大量线程的并行执行。在其他实施例中,单指令、多线程(SIMT)技术用于使用配置为向GPC208中的每一个内的处理引擎集发出指令的公共指令单元来支持大量一般来说同步的线程的并行执行。不同于所有处理引擎通常都执行同样指令的SIMD执行机制,SIMT执行通过给定线程程序允许不同线程更容易跟随分散执行路径。PPU 202 advantageously implements a highly parallel processing architecture. PPU 202 includes a number of general purpose processing clusters (GPCs). Each GPC is capable of concurrently executing a large number (eg, hundreds or thousands) of threads, where each thread is an instance of a program. In some embodiments, single instruction, multiple data (SIMD) instruction issue techniques are used to support parallel execution of a large number of threads without providing multiple independent instruction units. In other embodiments, single instruction, multiple thread (SIMT) technology is used to support parallel execution of a large number of generally simultaneous threads using a common instruction unit configured to issue instructions to a set of processing engines within each of the GPCs 208 . Unlike the SIMD execution mechanism where all processing engines typically execute the same instructions, SIMT execution allows different threads to more easily follow distributed execution paths through a given thread program.
GPU包括大量流式多处理器(SM),其中每个SM均配置为处理一个或多个线程组。传送到特定GPC208的一系列指令构成线程,并且跨SM内的并行处理引擎(未示出)的某一数目的并发执行线程的集合在本文中称为“线程束(warp)”或“线程组”。如本文所使用的,“线程组”是指对不同输入数据并发执行相同程序的一组线程,所述组的一个线程被指派到SM内的不同处理引擎。另外,多个相关线程组可以在SM内同时活动(在执行的不同阶段)。该线程组集合在本文中称为“协作线程阵列”(“CTA”)或“线程阵列”。A GPU includes a large number of streaming multiprocessors (SMs), where each SM is configured to process one or more thread groups. A series of instructions delivered to a particular GPC 208 constitutes a thread, and a collection of some number of concurrently executing threads across parallel processing engines (not shown) within an SM is referred to herein as a "warp" or "thread group." ". As used herein, a "thread group" refers to a group of threads that concurrently execute the same program on different input data, one thread of the group being assigned to a different processing engine within the SM. Additionally, multiple related thread groups can be active within an SM concurrently (at different stages of execution). This collection of thread groups is referred to herein as a "cooperative thread array" ("CTA") or "thread array."
在本发明的实施例中,使用计算系统的PPU202或其他处理器来使用线程阵列执行通用计算是可取的。为线程阵列中的每个线程指派在线程的执行期间对于线程可访问的唯一的线程标识符(“线程ID”)。可被定义为一维或多维数值的线程ID控制线程处理行为的各方面。例如,线程ID可用于确定线程将要处理输入数据集的哪部分和/或确定线程将要产生或写输出数据集的哪部分。In embodiments of the present invention, it may be desirable to use a computing system's PPU 202 or other processor to perform general-purpose computations using an array of threads. Each thread in the thread array is assigned a unique thread identifier ("thread ID") that is accessible to the thread during its execution. Thread IDs, which can be defined as one-dimensional or multi-dimensional values, control aspects of thread processing behavior. For example, the thread ID can be used to determine which portion of an input data set a thread is to process and/or to determine which portion of an output data set a thread is to produce or write.
工作时,CPU102是计算机系统100的主处理器,控制和协调其他系统部件的操作。具体地,CPU102发出控制PPU202的操作的命令。在一个实施例中,通信路径113是PCI Express链路,如本领域所知的,其中专用通道被分配到每个PPU202。也可以使用其他通信路径。PPU202有利地实现高度并行处理架构。PPU202可配备有任何容量(amount)的本地并行处理存储器(PPU存储器)。In operation, CPU 102 is the main processor of computer system 100, controlling and coordinating the operation of other system components. Specifically, CPU 102 issues commands to control the operation of PPU 202 . In one embodiment, communication path 113 is a PCI Express link, as is known in the art, where a dedicated lane is assigned to each PPU 202. Other communication paths may also be used. PPU 202 advantageously implements a highly parallel processing architecture. The PPU 202 can be equipped with any amount of local parallel processing memory (PPU memory).
在一些实施例中,系统存储器104包括统一虚拟存储器(UVM)驱动器101。UVM驱动器101包括用于执行与为CPU102和PPU202所共用的统一虚拟存储器(UVM)有关的各种任务的指令。除此之外,该架构还使得CPU102和PPU202能够用共用的(common)虚拟存储器地址来访问物理存储器位置,而不管该物理存储器位置是在系统存储器104还是PPU202本地的存储器(PPU存储器)内。In some embodiments, system memory 104 includes a unified virtual memory (UVM) driver 101 . The UVM driver 101 includes instructions for performing various tasks related to the unified virtual memory (UVM) shared by the CPU 102 and the PPU 202 . Among other things, the architecture enables CPU 102 and PPU 202 to use common virtual memory addresses to access physical memory locations, whether within system memory 104 or in memory local to PPU 202 (PPU memory).
应该理解,本文所示系统是示例性的,并且变化和修改都是可能的。连接拓扑,包括桥的数目和布置、CPU102的数目以及并行处理子系统112的数目,可根据需要修改。例如,在一些实施例中,系统存储器104直接连接到CPU102而不是通过桥,并且其他设备经由存储器桥105和CPU102与系统存储器104通信。在其他替代性拓扑中,并行处理子系统112连接到I/O桥107或直接连接到CPU102,而不是连接到存储器桥105。而在其他实施例中,I/O桥107和存储器桥105可能被集成到单个芯片上而不是作为一个或多个分立设备存在。大型实施例可以包括两个或更多个CPU102以及两个或更多个并行处理系统112。本文所示的特定部件是可选的;例如,任何数目的插卡或外围设备都可能得到支持。在一些实施例中,交换器116被去掉,网络适配器118和插卡120、121直接连接到I/O桥107。It should be understood that the systems shown herein are exemplary and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of CPUs 102, and the number of parallel processing subsystems 112, can be modified as desired. For example, in some embodiments, system memory 104 is connected directly to CPU 102 rather than through a bridge, and other devices communicate with system memory 104 via memory bridge 105 and CPU 102 . In other alternative topologies, parallel processing subsystem 112 is connected to I/O bridge 107 or directly to CPU 102 instead of memory bridge 105 . Yet in other embodiments, I/O bridge 107 and memory bridge 105 may be integrated onto a single chip rather than exist as one or more discrete devices. Larger embodiments may include two or more CPUs 102 and two or more parallel processing systems 112 . Certain components shown herein are optional; for example, any number of add-in cards or peripherals may be supported. In some embodiments, switch 116 is eliminated and network adapter 118 and add-in cards 120 , 121 are directly connected to I/O bridge 107 .
统一虚拟存储器系统架构Unified Virtual Memory System Architecture
图2是根据本发明的一个实施例的、示出统一虚拟存储器(UVM)系统200的框图。如图所示,统一虚拟存储器系统200包括而不限于CPU102、系统存储器104以及与并行处理单元存储器(PPU存储器)204相连的并行处理单元(PPU)202。CPU102和系统存储器104彼此相连且经由存储器桥105而与PPU202相连。FIG. 2 is a block diagram illustrating a unified virtual memory (UVM) system 200 according to one embodiment of the present invention. As shown, the unified virtual memory system 200 includes, without limitation, a CPU 102 , a system memory 104 , and a parallel processing unit (PPU) 202 connected to a parallel processing unit memory (PPU memory) 204 . CPU 102 and system memory 104 are connected to each other and to PPU 202 via memory bridge 105 .
CPU102经由虚拟存储器地址来执行可请求存储在系统存储器104或PPU存储器204中的数据的线程。根据对存储器系统的内部工作方式的了解,虚拟存储器地址屏蔽(shield)正在CPU102中执行的线程。因而,线程可仅知晓虚拟存储器地址,并可通过经由虚拟存储器地址而请求数据来对数据进行访问。CPU 102 executes threads that may request data stored in system memory 104 or PPU memory 204 via virtual memory addresses. From knowledge of the inner workings of the memory system, virtual memory addresses shield threads that are executing in CPU 102 . Thus, a thread may only know the virtual memory address, and may access data by requesting the data through the virtual memory address.
CPU102包括CPU MMU209,其处理来自CPU102的用于将虚拟存储器地址转译成物理存储器地址的请求。对存储在诸如系统存储器104和PPU存储器204这类物理存储器单元中的数据进行访问,需要物理存储器地址。CPU102包括CPU故障处理程序211,其响应于产生页故障的CPUMMU209而执行步骤,以使所请求的数据对于CPU102是可用的。CPU故障处理程序211通常为驻存(reside)在系统存储器104中且在CPU102上执行的软件,该软件通过对CPU102的中断而被唤醒。CPU 102 includes CPU MMU 209, which handles requests from CPU 102 for translating virtual memory addresses into physical memory addresses. Accessing data stored in physical memory units such as system memory 104 and PPU memory 204 requires physical memory addresses. CPU 102 includes CPU fault handler 211 that performs steps to make requested data available to CPU 102 in response to CPUMMU 209 generating a page fault. The CPU fault handling program 211 is generally software that resides in the system memory 104 and executes on the CPU 102 , and is woken up by an interrupt to the CPU 102 .
系统存储器104存储包含各种存储器页(未示出),这些存储器页供在CPU102或PPU202上执行的线程所使用。如图所示,系统存储器104存储CPU页表206,其包含虚拟存储器地址和物理存储器地址之间的映射。系统存储器104还存储页状态目录210,其充当用于UVM系统200的“主页表”,如下面更为详细讨论的。系统存储器104存储故障缓存器216,其包含由PPU202写入以便通知CPU102P由PU202产生的页故障的条目。在一些实施例中,系统存储器104包括统一虚拟存储器(UVM)驱动器101,其包含这样的指令,这些指令在被执行时令CPU102之外还执行用于修复页故障的命令。在替代性实施例中,页状态目录210和一个或多个命令队列214的任何组合都可存储在PPU存储器204中。此外,PPU页表208可存储在系统存储器104中。System memory 104 stores various memory pages (not shown) used by threads executing on CPU 102 or PPU 202 . As shown, system memory 104 stores CPU page tables 206, which contain mappings between virtual memory addresses and physical memory addresses. System memory 104 also stores page state directory 210, which acts as a "home table" for UVM system 200, as discussed in more detail below. System memory 104 stores fault buffer 216 containing entries written by PPU 202 to notify CPU 102P of page faults generated by PU 202 . In some embodiments, system memory 104 includes a unified virtual memory (UVM) driver 101 that contains instructions that, when executed, cause CPU 102 to execute commands for repairing page faults, in addition to. In alternative embodiments, any combination of page state directory 210 and one or more command queues 214 may be stored in PPU memory 204 . Additionally, PPU page tables 208 may be stored in system memory 104 .
以与CPU102类似的方式,PPU202执行可经由虚拟存储器地址来请求存储在系统存储器104或PPU存储器204中的数据的指令。PPU202包括PPU MMU213,其处理来自PPU202的用于将虚拟存储器地址转译成物理存储器地址的请求。PPU202还包括复制引擎212,其执行存储在命令队列214中的用于复制存储器页、修改PPU页表208中的数据的命令以及其他命令。PPU故障处理程序215响应于PPU202上的页故障而执行步骤。PPU故障处理程序215可以是在处理器或PPU202中的专用微控制器上运行的软件。替代地,PPU故障处理程序215可以是彼此通信的在CPU102上运行的软件和在PPU202中的专用微控制器上运行的软件的组合。在一些实施例中,CPU故障处理程序211和PPU故障处理程序215可以是通过CPU102或PPU202任一者上的故障而被调用(invoke)。命令队列214可在PPU存储器204或系统存储器104中,但优选位于系统存储器104中。In a similar manner to CPU 102 , PPU 202 executes instructions that may request data stored in system memory 104 or PPU memory 204 via virtual memory addresses. PPU 202 includes PPU MMU 213, which handles requests from PPU 202 for translating virtual memory addresses into physical memory addresses. PPU 202 also includes copy engine 212 that executes commands stored in command queue 214 for copying memory pages, modifying data in PPU page table 208, and other commands. PPU fault handler 215 performs steps in response to a page fault on PPU 202 . PPU fault handler 215 may be software running on a processor or a dedicated microcontroller in PPU 202 . Alternatively, PPU fault handler 215 may be a combination of software running on CPU 102 and software running on a dedicated microcontroller in PPU 202 in communication with each other. In some embodiments, CPU fault handler 211 and PPU fault handler 215 may be invoked by a fault on either CPU 102 or PPU 202 . Command queue 214 may be in PPU memory 204 or system memory 104 , but is preferably located in system memory 104 .
在一些实施例中,CPU故障处理程序211和UVM驱动器101可以是统一软件程序。在这类情况中,所述统一软件程序可以是驻存在系统存储器104中且在CPU102上执行的软件。PPU故障处理程序215可以是在处理器或PPU202中的专用微控制器上运行的单独的软件程序,或者PPU故障处理程序215可以是在CPU102上运行的单独的软件程序。In some embodiments, CPU fault handler 211 and UVM driver 101 may be a unified software program. In such cases, the unified software program may be software that resides in system memory 104 and executes on CPU 102 . PPU fault handler 215 may be a separate software program running on a processor or a dedicated microcontroller in PPU 202 , or PPU fault handler 215 may be a separate software program running on CPU 102 .
在其他实施例中,PPU故障处理程序215和UVM驱动器101可以是统一软件程序。在这类情况中,所述统一软件程序可以是驻存在系统存储器104中且在CPU102上执行的软件。CPU故障处理程序211可以是驻存在系统存储器104中且在CPU102上执行的软件。In other embodiments, PPU fault handler 215 and UVM driver 101 may be a unified software program. In such cases, the unified software program may be software that resides in system memory 104 and executes on CPU 102 . CPU fault handler 211 may be software that resides in system memory 104 and executes on CPU 102 .
在其他实施例中,CPU故障处理程序211、PPU故障处理程序215和UVM驱动器101可以是统一软件程序。在这类情况中,所述统一软件程序可以是驻存于系统存储器104中且在CPU102上执行的软件。In other embodiments, the CPU fault handler 211, the PPU fault handler 215, and the UVM driver 101 may be a unified software program. In such cases, the unified software program may be software that resides in system memory 104 and executes on CPU 102 .
在一些实施例中,如上所述,CPU故障处理程序211、PPU故障处理程序215和UVM驱动器101可全部驻存在系统存储器104中。如图2所示,UVM驱动器101驻存在系统存储器104中,而CPU故障处理程序211和PPU故障处理程序215驻存在CPU102中。In some embodiments, CPU fault handler 211 , PPU fault handler 215 , and UVM driver 101 may all reside in system memory 104 as described above. As shown in FIG. 2 , UVM driver 101 resides in system memory 104 , while CPU fault handler 211 and PPU fault handler 215 reside in CPU 102 .
CPU故障处理程序211和PPU故障处理程序215可对源自CPU102或PPU202的硬件中断例如由于页故障而导致的中断做出响应。如下面进一步描述的,UVM驱动器101包含用于执行与UVM系统200的管理有关的各种任务的指令,包括而不限于修复页故障以及访问CPU页表206、页状态目录210和/或故障缓存器216。CPU fault handler 211 and PPU fault handler 215 may respond to hardware interrupts originating from CPU 102 or PPU 202 , such as interrupts due to page faults. As described further below, UVM driver 101 contains instructions for performing various tasks related to the management of UVM system 200, including without limitation repairing page faults and accessing CPU page table 206, page state directory 210, and/or fault cache device 216.
在一些实施例中,CPU页表206和PPU页表208具有不同格式,且内含不同信息;例如,PPU页表208可内含而CPU页表206不含下列信息:原子禁用位(atomic disable bit);压缩标签(compression tag);和存储器搅和类型(memory swizzling type)。In some embodiments, CPU page table 206 and PPU page table 208 have different formats and contain different information; for example, PPU page table 208 may contain the following information while CPU page table 206 does not: atomic disable bit (atomic disable bit); compression tag; and memory swizzling type.
以与系统存储器104类似的方式,PPU存储器204存储各种页(未示出)。如图所示,PPU存储器204还包括PPU页表208,其包含虚拟存储器地址和物理存储器地址之间的映射。替代地,PPU页表208可存储在系统存储器104中。In a similar manner to system memory 104 , PPU memory 204 stores various pages (not shown). As shown, PPU memory 204 also includes a PPU page table 208 that contains a mapping between virtual memory addresses and physical memory addresses. Alternatively, PPU page table 208 may be stored in system memory 104 .
转译虚拟存储器地址translate virtual memory address
当在CPU102中执行的线程经由虚拟存储器地址来请求数据时,CPU102从CPU存储器管理单元(CPU MMU)209请求将虚拟存储器地址转译成物理存储器地址。作为响应,CPU MMU209试图将虚拟存储器地址转译成物理存储器地址,所述物理存储器地址指定存储器单元中存储CPU102所请求的数据的位置,例如系统存储器104。When a thread executing in the CPU 102 requests data via a virtual memory address, the CPU 102 requests translation of the virtual memory address into a physical memory address from the CPU memory management unit (CPU MMU) 209 . In response, CPU MMU 209 attempts to translate the virtual memory address into a physical memory address specifying a location in a memory unit, such as system memory 104, where the data requested by CPU 102 is stored.
为了将虚拟存储器地址转译成物理存储器地址,CPU MMU209执行查找操作,以确定CPU页表206是否包含于虚拟存储器地址相关联的映射。除虚拟存储器地址外,访问数据的请求还可指明虚拟存储器地址空间。统一虚拟存储器系统200可实现多个虚拟存储器地址空间,每个空间均被指派一个或多个线程。虚拟存储器地址在任何给定的虚拟存储器地址空间中都是唯一的。此外,给定的虚拟存储器地址内的虚拟存储器地址跨CPU102和PPU202连续,从而允许同一虚拟存储器地址跨CPU102和PPU202指向同一数据。在一些实施例中,两个虚拟存储器地址可以指向同一数据,但可以不映射到同一物理存储器地址(例如,CPU102和PPU202每个均可具有数据的本地只读副本)。To translate a virtual memory address to a physical memory address, CPU MMU 209 performs a lookup operation to determine whether CPU page table 206 contains a mapping associated with the virtual memory address. In addition to virtual memory addresses, requests to access data may also specify a virtual memory address space. The unified virtual memory system 200 can implement multiple virtual memory address spaces, each space is assigned one or more threads. Virtual memory addresses are unique within any given virtual memory address space. Furthermore, virtual memory addresses within a given virtual memory address are contiguous across CPU 102 and PPU 202 , allowing the same virtual memory address to point to the same data across CPU 102 and PPU 202 . In some embodiments, two virtual memory addresses may point to the same data, but may not map to the same physical memory address (eg, CPU 102 and PPU 202 may each have a local read-only copy of the data).
对于任何给定的虚拟存储器地址,CPU页表206可以包含或可以不包含虚拟存储器地址和物理存储器地址之间的映射。如果CPU页表206包含映射,则CPU MMU209读取该映射,以确定与虚拟存储器地址相关联的物理存储器地址并提供物理存储器地址给CPU102。然而,如果CPU页表206不包含与虚拟存储器地址相关联的映射,则CPU MMU209不能将虚拟存储器地址转译成物理存储器地址,且CPU MMU209产生页故障。为了修复页故障并使所请求的数据对于CPU102是可用的,执行“页故障序列(sequence)”。更具体地,CPU102读取PSD210以找出页的当前映射状态且然后确定适当的页故障序列。页故障序列通常映射与所请求的虚拟存储器地址相关联的存储器页或者改变许可的访问类型(例如,读取访问、写入访问、原子访问)。下面更加详细地讨论在UVM系统200中实现的不同类型的页故障序列。For any given virtual memory address, CPU page table 206 may or may not contain a mapping between the virtual memory address and the physical memory address. If CPU page table 206 contains a mapping, CPU MMU 209 reads the mapping to determine the physical memory address associated with the virtual memory address and provides the physical memory address to CPU 102. However, if CPU page table 206 does not contain a mapping associated with the virtual memory address, then CPU MMU 209 cannot translate the virtual memory address into a physical memory address, and CPU MMU 209 generates a page fault. To repair the page fault and make the requested data available to CPU 102, a "page fault sequence" is executed. More specifically, CPU 102 reads PSD 210 to find out the current mapping state of the page and then determines the appropriate page fault sequence. The page fault sequence typically maps the memory page associated with the requested virtual memory address or changes the type of access granted (eg, read access, write access, atomic access). The different types of page fault sequences implemented in UVM system 200 are discussed in more detail below.
在UVM系统200内,与给定的虚拟存储器地址相关联的数据可存储在系统存储器104、PPU存储器204或者系统存储器104和PPU存储器204两者中作为同一数据的只读副本。此外,对于任何这类数据,CPU页表206和PPU页表208中任一者或两者都可包含与该数据相关联的映射。请注意,存在一些数据,针对其的映射存在于一个页表中但不存在于另一个中。然而,PSD210包含存储在PPU页表208中的所有映射以及存储在CPU页表206中的PPU相关映射。PSD210因而用作用于统一虚拟存储器系统200的“主”页表。因此,当CPU MMU209在与特定的虚拟存储器地址相关联的CPU页表206中没有找到映射时,CPU102读取PSD210以确定PSD210是否包含于该虚拟存储器地址相关联的映射。除与虚拟存储器地址相关联的映射外,PSD210的各种实施例还可包含与虚拟存储器地址相关联的不同类型的信息。Within UVM system 200, data associated with a given virtual memory address may be stored in system memory 104, PPU memory 204, or both system memory 104 and PPU memory 204 as a read-only copy of the same data. Furthermore, for any such data, either or both CPU page table 206 and PPU page table 208 may contain a mapping associated with the data. Note that there is some data for which a mapping exists in one page table but not another. However, PSD 210 contains all mappings stored in PPU page table 208 as well as PPU-related mappings stored in CPU page table 206 . PSD 210 thus serves as the “master” page table for unified virtual memory system 200 . Thus, when CPU MMU 209 does not find a mapping in CPU page table 206 associated with a particular virtual memory address, CPU 102 reads PSD 210 to determine whether PSD 210 contains a mapping associated with that virtual memory address. Various embodiments of PSD 210 may contain different types of information associated with virtual memory addresses in addition to the mappings associated with virtual memory addresses.
当CPU MMU209产生页故障时,CPU故障处理程序211执行针对适当的页故障序列的一系列操作以修复页故障。而且,在页故障序列过程中,CPU102读取PSD210并执行附加操作以便改变CPU页表206和PPU页表208内的映射或许可(permission)。这类操作可包含读取和/或修改CPU页表206,读取和/或修改页状态目录210和/或在存储器单元(例如,系统存储器104和PPU存储器204)之间迁移被称为“存储器页”的数据块。When the CPU MMU 209 generates a page fault, the CPU fault handler 211 performs a series of operations for the appropriate page fault sequence to repair the page fault. Also, during the page fault sequence, CPU 102 reads PSD 210 and performs additional operations to change mappings or permissions within CPU page table 206 and PPU page table 208 . Such operations may include reading and/or modifying CPU page tables 206, reading and/or modifying page state directories 210, and/or migrating between memory units (e.g., system memory 104 and PPU memory 204) are referred to as " memory page" block of data.
为了确定哪些操作将在页故障序列中执行,CPU102识别与虚拟存储器地址相关联的存储器页。CPU102然后从与和引发页故障的存储器访问请求相关联的虚拟存储器地址有关的PSD210中读取关于该存储器页的状态信息。这类状态信息此外还可包含关于与虚拟存储器地址相关联的存储器页的所有权状态。对于任何给定的存储器页,若干所有权状态都是可能的。例如,存储器页可以是“CPU所有”、“PPU所有”或“CPU共享”。如果CPU102能经由虚拟地址访问存储器页而不引发页故障,且如果PPU202不能在不引发页故障的情况下经由虚拟地址访问存储器页,则该存储器页被视为是CPU所有。优选地,CPU所有的页驻存在系统存储器104中,但也可驻存在PPU存储器204中。如果PPU202能经由虚拟地址访问存储器页,且如果CPU102不能在不引发页故障的情况下经由虚拟地址访问该页,则该存储器页被视为是PPU所有。优选地,PPU所有的页驻存在PPU存储器204中,但当不进行从系统存储器104到PPU存储器204的迁移时,也可驻存在系统存储器104中。最后,如果CPU102和PPU202能经由虚拟地址访问存储器页而不引发页故障,则该存储器被视为是CPU共享。CPU共享的页可驻存在系统存储器104或PPU存储器204任一者中。To determine which operations are to be performed in the page fault sequence, CPU 102 identifies the memory page associated with the virtual memory address. CPU 102 then reads status information about the memory page from PSD 210 associated with the virtual memory address associated with the memory access request that caused the page fault. Such state information may additionally include an ownership state regarding the memory page associated with the virtual memory address. For any given memory page, several ownership states are possible. For example, a memory page may be "CPU-owned", "PPU-owned", or "CPU-shared". A memory page is considered owned by a CPU if the CPU 102 can access the memory page via a virtual address without incurring a page fault, and if the PPU 202 cannot access the memory page via a virtual address without incurring a page fault. Preferably, all pages of the CPU reside in system memory 104 , but may also reside in PPU memory 204 . A memory page is considered owned by the PPU if the PPU 202 can access the page via the virtual address, and if the CPU 102 cannot access the page via the virtual address without incurring a page fault. Preferably, all pages of the PPU reside in PPU memory 204, but may also reside in system memory 104 when migration from system memory 104 to PPU memory 204 is not performed. Finally, memory is considered CPU shared if CPU 102 and PPU 202 can access a memory page via a virtual address without causing a page fault. CPU shared pages may reside in either system memory 104 or PPU memory 204 .
CPU页表206可基于各种因素包括存储器页的使用历史而将所有权状态指派给存储器页。使用历史可包含关于CPU102或PPU202最近是否访问过存储器页以及这类访问进行了多少次的信息。例如,如果基于给定的存储器页的使用历史,UVM系统200确定该存储器页可能主要或仅仅被CPU102所使用,则UVM系统200可对该存储器页指派“CPU所有”的所有权状态,并将该页置于系统存储器104中。类似地,如果基于给定的存储器页的使用历史,UVM系统200确定该存储器页可能主要或仅仅被PPU202所使用,则UVM系统200可对该存储器页指派“PPU所有”的所有权状态,并将该页置于PPU存储器204中。最后,如果基于给定的存储器页的使用历史,UVM系统200确定该存储器页可能被CPU102和PPU202两者都使用,并确定将存储器页从系统存储器104到PPU存储器204来回迁移将会耗费太多时间,则UVM系统200可对该存储器页指派“CPU共享”的所有权状态。CPU page table 206 may assign an ownership state to a memory page based on various factors, including the usage history of the memory page. The usage history may include information about whether CPU 102 or PPU 202 has recently accessed a page of memory and how many times such access was made. For example, if based on the usage history of a given memory page, the UVM system 200 determines that the memory page is likely to be used primarily or exclusively by the CPU 102, the UVM system 200 can assign the memory page an ownership status of "CPU Owned" and assign the memory page Pages are placed in system memory 104 . Similarly, if based on the usage history of a given memory page, UVM system 200 determines that the memory page is likely to be used primarily or exclusively by PPU 202, UVM system 200 may assign the memory page an ownership status of "PPU Owned" and assign This page is placed in PPU memory 204 . Finally, if based on the usage history of a given memory page, UVM system 200 determines that the memory page is likely to be used by both CPU 102 and PPU 202, and determines that migrating the memory page from system memory 104 to PPU memory 204 and back would be too costly time, the UVM system 200 may assign the memory page an ownership status of "CPU Shared".
作为示例,故障处理程序211和215可实施下列用于迁移的启发法(heuristics)中的任何或全部:As an example, fault handlers 211 and 215 may implement any or all of the following heuristics for migration:
(a)关于对映射至PPU202且最近未迁移的被取消映射(unmap)的页的CPU102访问,将出故障的页从PPU202取消映射,将该页迁移到CPU102,并将该页映射至CPU102;(a) with respect to CPU 102 accesses to unmapped (unmapped) pages that are mapped to PPU 202 and have not been migrated recently, unmap the faulty page from PPU 202, migrate the page to CPU 102, and map the page to CPU 102;
(b)关于对映射至CPU102且最近未迁移的被取消映射的页的PPU202访问,将出故障的页从CPU102取消映射,将该页迁移到PPU202,并将该页映射至PPU202;(b) with respect to PPU 202 accesses to unmapped pages that are mapped to CPU 102 and have not been migrated recently, unmap the faulty page from CPU 102, migrate the page to PPU 202, and map the page to PPU 202;
(c)关于对映射至PPU202且最近经迁移的被取消映射的页的CPU102访问,将出故障的页迁移到CPU102并将该页映射在CPU102和PPU202两者上;(c) with respect to CPU 102 accesses to the most recently migrated unmapped page mapped to PPU 202 , migrate the faulted page to CPU 102 and map the page on both CPU 102 and PPU 202 ;
(d)关于对映射在CPU102上且最近经迁移的被取消映射的页的PPU202访问,将该页映射至CPU102和PPU202两者;(d) for a PPU 202 access to a most recently migrated unmapped page mapped on the CPU 102, map the page to both the CPU 102 and the PPU 202;
(e)关于对映射至CPU102和PPU202两者但对于PPU202所进行的原子操作未启用的页的PPU202原子访问,将该页从CPU102取消映射,并映射至PPU202且启用原子操作;(e) with respect to a PPU 202 atomic access to a page mapped to both CPU 102 and PPU 202 but not enabled for atomic operations by PPU 202, unmap the page from CPU 102 and map to PPU 202 with atomic operations enabled;
(f)关于对映射在CPU102和PPU202上作为写入时复制(copy-on-write)(COW)的页的PPU202写入访问,将该页复制到PPU202,从而制作该页的独立副本,将新的页作为读写(read-write)映射在PPU上,并保留当前页映射在CPU102上;(f) For PPU 202 write access to a page mapped on CPU 102 and PPU 202 as copy-on-write (COW), copy the page to PPU 202, thereby making an independent copy of the page, set The new page is mapped on the PPU as a read-write (read-write), and the current page is reserved on the CPU102;
(g)关于对映射在CPU102和PPU202上作为按需填零(zero-fill-on-demand)(ZFOD)的页的PPU202读取访问,分配PPU202上的物理存储器页并用零填充,且将该页映射在PPU上,但将其改变为在CPU102上被取消映射;(g) For PPU 202 read accesses to pages mapped on CPU 102 and PPU 202 as zero-fill-on-demand (ZFOD), a physical memory page on PPU 202 is allocated and filled with zeros, and the The page is mapped on the PPU, but changed to be unmapped on the CPU102;
(h)关于由第一PPU202(1)对映射在第二PPU202(2)上且最近未迁移的被取消映射的页的访问,将出故障的页从第二PPU202(2)取消映射,将该页迁移到第一PPU202(1),并将该页映射至第一PPU202(1);以及(h) with respect to an access by the first PPU 202(1) to an unmapped page that is mapped on the second PPU 202(2) and has not been migrated recently, unmap the failed page from the second PPU 202(2), migrating the page to the first PPU 202(1), and mapping the page to the first PPU 202(1); and
(i)关于由第一PPU202(1)对映射在第二PPU202(2)上且最近经迁移的被取消映射的页的访问,将出故障的页映射至第一PPU202(1),并保持该页在第二PPU202(2)上的映射。(i) With respect to an access by the first PPU 202(1) to an unmapped page that was mapped on the second PPU 202(2) and was most recently migrated, map the failed page to the first PPU 202(1), and keep The mapping of this page on the second PPU 202(2).
总之,许多启发法规则都是可能的,且本发明的范围不限于这些示例。In conclusion, many heuristic rules are possible, and the scope of the invention is not limited to these examples.
另外,任何迁移启发法都可“向上取整(round up)”以包含较多的页或较大的页尺寸,例如:Also, any migration heuristic can be "rounded up" to include more pages or larger page sizes, for example:
(j)关于对映射至PPU202且最近未迁移的被取消映射的页的CPU102访问,将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从PPU202取消映射,将这些页迁移到CPU102,并将这些页映射至CPU102(在更详细的示例中:对于4kB故障页,迁移包含4kB故障页的对准的(aligned)64kB区域);(j) with respect to CPU 102 accesses to unmapped pages that are mapped to PPU 202 and have not been migrated recently, unmap the failed page plus additional pages adjacent to the failed page in virtual address space from PPU 202, Migrate these pages to CPU 102 and map these pages to CPU 102 (in a more detailed example: for a 4kB fault page, migrate the aligned 64kB region containing the 4kB fault page);
(k)关于对映射至CPU102且最近未迁移的被取消映射的页的PPU202访问,将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从CPU102取消映射,将这些页迁移到PPU202,并将这些页映射至PPU202(在更详细的示例中:对于4kB故障页,迁移包含4kB故障页的对准的64kB区域);(k) with respect to PPU 202 accesses to unmapped pages that are mapped to CPU 102 and have not been migrated recently, unmap the faulted page plus additional pages adjacent to the faulted page in virtual address space from CPU 102, migrate these pages to PPU 202, and map these pages to PPU 202 (in a more detailed example: for a 4kB fault page, migrate the aligned 64kB region containing the 4kB fault page);
(l)关于对映射至PPU202且最近未迁移的被取消映射的页的CPU102访问,将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从PPU202取消映射,将这些页迁移到CPU102,将这些页映射至CPU102,并将所有迁移的页作为CPU102上的一个或多个较大的页对待(在更详细的示例中:对于4kB故障页,迁移包含4kB故障页的对准的64kB区域,并将该对准的64kB区域作为64kB页对待);(l) with respect to CPU 102 accesses to unmapped pages that are mapped to PPU 202 and have not been migrated recently, unmap the failed page plus additional pages adjacent to the failed page in virtual address space from PPU 202, Migrate these pages to CPU102, map these pages to CPU102, and treat all migrated pages as one or more larger pages on CPU102 (in a more detailed example: for 4kB faulted pages, migration contains 4kB faults the aligned 64kB region of the page, and treat the aligned 64kB region as a 64kB page);
(m)关于对映射至CPU102且最近未迁移的被取消映射的页的PPU202访问,将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从CPU102取消映射,将这些页迁移到PPU202,将这些页映射至PPU202,并将所有迁移的页作为PPU202上的一个或多个较大的页对待(在更详细的示例中:对于4kB故障页,迁移包含4kB故障页的对准的64kB区域,并将该对准的64kB区域作为64kB页对待);(m) with respect to PPU 202 accesses to unmapped pages that are mapped to CPU 102 and have not been migrated recently, unmap the faulted page plus additional pages adjacent to the faulted page in virtual address space from CPU 102 , Migrate these pages to PPU202, map these pages to PPU202, and treat all migrated pages as one or more larger pages on PPU202 (in a more detailed example: for 4kB faulted pages, migration contains 4kB faults the aligned 64kB region of the page, and treat the aligned 64kB region as a 64kB page);
(n)关于由第一PPU202(1)对映射至第二PPU202(2)上且最近未迁移的被取消映射的页的访问,将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页从第二PPU202(2)取消映射,将这些页迁移到第一PPU202(1),并将这些页映射至第一PPU202(1);以及(n) For an access by the first PPU 202(1) to an unmapped page that is mapped to the second PPU 202(2) and has not been migrated recently, append the faulted page in the virtual address space with the faulted page unmapping additional pages adjacent to the page from the second PPU 202(2), migrating these pages to the first PPU 202(1), and mapping these pages to the first PPU 202(1); and
(o)关于由第一PPU202(1)对映射在第二PPU202(2)上且最近经迁移的被取消映射的页的访问,将出故障的页外加在虚拟地址空间中与该出故障的页相邻的额外的页映射至第一PPU202(1),并保持这些页在第二PPU202(2)上的映射。(o) For an access by the first PPU 202(1) to a recently migrated unmapped page mapped on the second PPU 202(2), append the faulted page in the virtual address space with the faulted page Additional pages adjacent to the page are mapped to the first PPU 202(1), and the mapping of these pages on the second PPU 202(2) is maintained.
总之,许多启发法规则都是可能的,且本发明的范围不限于这些示例。In conclusion, many heuristic rules are possible, and the scope of the invention is not limited to these examples.
在一些实施例中,PSD条目可包含过渡(transitional)状态信息,以确保由CPU102和PPU202内的单元所做出的各种请求之间的适当的同步化。例如,PSD210条目可包含这样的过渡状态信息,即特定页处于正在从CPU所有过渡到PPU所有的过程中。CPU102和PPU202中的各种单元,例如CPU故障处理程序211和PPU故障处理程序215,一经确定页处于这类过渡状态,就可放弃(forego)部分页故障序列,以避免由之前对同一虚拟存储器地址的虚拟存储器访问所触发的页故障序列中的步骤。作为具体示例,如果页故障导致页从系统存储器104迁移到PPU存储器204,则检测到将会引发同一迁移的不同的页故障且不引发另一页迁移。此外,CPU102和PPU202中的各种单元可实施原子操作,用于对PSD210上的操作进行适当的排序。例如,关于对PSD210条目的修改,CPU故障处理程序211或PPU故障处理程序215可发出原子比较和交换(swap)操作,以修改PSD210中的特定条目的页状态。因此,该修改可以在不受来自其他单元的操作干扰的情况下完成。In some embodiments, PSD entries may contain transitional state information to ensure proper synchronization between various requests made by units within CPU 102 and PPU 202 . For example, a PSD 210 entry may contain transition state information that a particular page is in the process of transitioning from CPU ownership to PPU ownership. Various units in CPU 102 and PPU 202, such as CPU fault handler 211 and PPU fault handler 215, upon determining that a page is in such a transitional state, may forego a partial page fault sequence to avoid a previous fault to the same virtual memory A step in the page fault sequence triggered by a virtual memory access to an address. As a specific example, if a page fault causes a page to be migrated from system memory 104 to PPU memory 204, a different page fault is detected that would cause the same migration and not cause another page migration. Additionally, various units within CPU 102 and PPU 202 may implement atomic operations for proper sequencing of operations on PSD 210 . For example, with respect to modifications to PSD 210 entries, CPU fault handler 211 or PPU fault handler 215 may issue an atomic compare and swap (swap) operation to modify the page state of a particular entry in PSD 210 . Therefore, the modification can be done without interference from the operation of other units.
系统存储器104中可存储多个PSD210——每个虚拟地址空间一个。CPU102或PPU202任一者产生的存储器访问请求因而可包含虚拟存储器地址并且还识别与该虚拟存储器地址相关联的虚拟存储器地址空间。Multiple PSDs 210 may be stored in system memory 104 - one for each virtual address space. A memory access request generated by either CPU 102 or PPU 202 may thus contain a virtual memory address and also identify a virtual memory address space associated with the virtual memory address.
正如CPU102可执行包含虚拟存储器地址的存储器访问请求(即,包含经由虚拟存储器地址访问数据的请求的指令)一样,PPU202也可执行类似类型的存储器访问请求。更具体地,PPU202包括上面结合图1描述的配置为执行多个线程和线程组的多个执行单元,例如GPC和SM。在操作中,那些线程可通过制定虚拟存储器地址而从存储器请求数据(例如,系统存储器104或PPU存储器204)。正如CPU102和CPU MMU209一样,PPU202包括PPU存储器管理单元(MMU)213。PPU MMU213接收来自PPU202的关于虚拟存储器地址转译的请求,并试图为虚拟存储器地址从PPU页表208提供转译。Just as CPU 102 may execute memory access requests that include virtual memory addresses (ie, instructions that include requests to access data via virtual memory addresses), PPU 202 may execute similar types of memory access requests. More specifically, PPU 202 includes multiple execution units, such as GPCs and SMs, configured to execute multiple threads and thread groups, as described above in connection with FIG. 1 . In operation, those threads may request data from memory (eg, system memory 104 or PPU memory 204 ) by specifying a virtual memory address. Like CPU 102 and CPU MMU 209 , PPU 202 includes a PPU memory management unit (MMU) 213 . PPU MMU 213 receives requests from PPU 202 for virtual memory address translations and attempts to provide translations from PPU page tables 208 for virtual memory addresses.
与CPU页表206类似地,PPU页表208包含虚拟存储器地址和物理存储器地址之间的映射。CPU页表206的情况也如此,对于任何给定的虚拟地址,PPU页表208可以不包含将虚拟存储器地址映射至物理存储器地址的页表条目。与CPU MMU209一样,当PPU MMU213从PPU页表208请求对虚拟存储器地址的转译并且PPU页表208中不存在任何映射或该访问类型是PPU页表208所不允许的时,PPU MMU213产生页故障。随后,PPU故障处理程序215触发页故障序列。而且,下面将更详细描述在UVM系统200中实施的不同类型的页故障序列。Similar to CPU page table 206, PPU page table 208 contains a mapping between virtual memory addresses and physical memory addresses. As is the case with CPU page table 206, for any given virtual address, PPU page table 208 may contain no page table entries that map virtual memory addresses to physical memory addresses. Like the CPU MMU 209, when the PPU MMU 213 requests a translation of a virtual memory address from the PPU page table 208 and there is no mapping in the PPU page table 208 or the access type is not allowed by the PPU page table 208, the PPU MMU 213 generates a page fault . Subsequently, the PPU fault handler 215 triggers a page fault sequence. Also, the different types of page fault sequences implemented in UVM system 200 will be described in more detail below.
在页故障序列过程中,CPU102或PPU202可将命令写入命令队列214,用于由复制引擎212执行。这种方法使CPU102或PPU202在复制引擎212读取并执行存储在命令队列214中的命令的同时得以空出以执行其他任务,并允许关于故障序列的所有命令同时被排队,从而避免对故障序列的进度的监视。由复制引擎212执行的命令此外还可包括删除、生成或修改PPU页表208中的页表条目,从系统存储器104读取或写入数据,以及将数据读取或写入到PPU存储器204。During a page fault sequence, CPU 102 or PPU 202 may write commands to command queue 214 for execution by replication engine 212 . This approach frees the CPU 102 or PPU 202 to perform other tasks while the replication engine 212 reads and executes commands stored in the command queue 214, and allows all commands for the fault sequence to be queued at the same time, thereby avoiding the progress monitoring. Commands executed by replication engine 212 may additionally include deleting, creating, or modifying page table entries in PPU page table 208 , reading or writing data from system memory 104 , and reading or writing data to PPU memory 204 .
故障缓存器216存储指明与由PPU202产生的页故障有关的信息的故障缓存器条目。故障缓存器条目可包括例如试图进行的访问的类型(例如,读取、写入或原子的)、引发了页故障的试图进行的访问所针对的虚拟存储器地址、虚拟地址空间以及对引发了页故障的单元或线程的指示。在操作中,当PPU202引发页故障时,PPU202可将故障缓存器条目写入故障缓存器216中,以通知PPU故障处理程序215有关出故障的页和引发该故障的访问的类型。PPU故障处理程序215然后执行动作以修复页故障。因为PPU202正在执行多个线程,所以故障缓存器216可存储多个故障,其中每个线程由于PPU202的存储器访问的管线性质均可引发一个或多个故障。Fault buffer 216 stores fault buffer entries specifying information related to page faults generated by PPU 202 . The fault buffer entry may include, for example, the type of access attempted (e.g., read, write, or atomic), the virtual memory address for which the attempted access caused the page fault, the virtual address space, and the reference to the page that caused the fault. An indication of the failing apartment or thread. In operation, when PPU 202 causes a page fault, PPU 202 may write a fault buffer entry into fault buffer 216 to notify PPU fault handler 215 of the faulted page and the type of access that caused the fault. PPU fault handler 215 then performs actions to repair the page fault. Because PPU 202 is executing multiple threads, fault buffer 216 may store multiple faults, each of which may cause one or more faults due to the pipelined nature of PPU 202's memory accesses.
页故障序列page fault sequence
如上所述,响应于收到关于虚拟存储器地址转译的请求,如果CPU页表206不包含与所请求的虚拟存储器地址相关联的映射或者不许可正被请求的访问的类型,则CPU MMU209产生页故障。类似地,响应于收到关于虚拟存储器地址转译的请求,如果PPU页表208不包含与所请求的虚拟存储器地址相关联的映射或者不许可正被请求的访问的类型,则PPU MMU213产生页故障。当CPU MMU209或PPU MMU213产生页故障时,请求了虚拟存储器地址处的数据的线程停滞(stall),且“本地故障处理程序”——用于CPU102的CPU故障处理程序211或用于PPU202的PPU故障处理程序215——试图通过执行“页故障序列”来修复页故障。如上面所指出的,页故障序列包含使得出故障的单元(即,引发了该页面故障的单元——CPU102或PPU202任一者)能够访问与虚拟存储器地址相关联的数据的一系列操作。在页故障序列结束之后,经由虚拟存储器地址请求了数据的线程继续执行。在一些实施例中,通过允许故障恢复逻辑跟踪与出故障的指令相反的出故障的存储器访问,故障恢复得以简化。As described above, in response to receiving a request for translation of a virtual memory address, CPU MMU 209 generates a page if CPU page table 206 does not contain a mapping associated with the requested virtual memory address or does not grant the type of access being requested. Fault. Similarly, in response to receiving a request for translation of a virtual memory address, PPU MMU 213 generates a page fault if PPU page table 208 does not contain a mapping associated with the requested virtual memory address or does not grant the type of access being requested . When the CPU MMU 209 or the PPU MMU 213 generates a page fault, the thread that requested the data at the virtual memory address stalls, and the "local fault handler"—the CPU fault handler 211 for the CPU 102 or the PPU for the PPU 202 Fault Handler 215 - Attempts to repair page faults by performing a "page fault sequence". As noted above, a page fault sequence includes a series of operations that enable the faulted unit (ie, the unit that caused the page fault—either CPU 102 or PPU 202 ) to access data associated with a virtual memory address. After the page fault sequence ends, the thread that requested the data via the virtual memory address continues to execute. In some embodiments, fault recovery is simplified by allowing the fault recovery logic to track the failing memory access as opposed to the failing instruction.
如果存在任何与页故障相关联的存储器页不得不经历的所有权状态的变化或访问许可的变化,则在页故障序列过程中所执行的操作取决于这些变化。从当前的所有权状态到新的所有权状态的过渡或者访问许可的变化可以是页故障序列的一部分。在一些实例中,将与页故障相关联的存储器页从系统存储器104迁移到PPU存储器204也是页故障序列的一部分。在其他实例中,将与页故障相关联的存储器页从PPU存储器204迁移到系统存储器104也是页故障序列的一部分。本文中较为充分描述的各种启发法可用于配置UVM系统200以改变存储器页所有权状态或者以按照各种操作条件和图案的集合迁移存储器页。下面将更详细描述的是关于下列四种存储器页所有权状态过渡的页故障序列:CPU所有到CPU共享、CPU所有到PPU所有、PPU所有到CPU所有以及PPU所有到CPU共享。If there are any changes in ownership status or changes in access permissions that a memory page associated with a page fault has to undergo, the operations performed during the page fault sequence depend on these changes. A transition from a current ownership state to a new ownership state or a change in access permissions may be part of a page fault sequence. In some examples, migrating memory pages associated with page faults from system memory 104 to PPU memory 204 is also part of the page fault sequence. In other examples, migrating memory pages associated with page faults from PPU memory 204 to system memory 104 is also part of the page fault sequence. Various heuristics described more fully herein may be used to configure UVM system 200 to change memory page ownership states or to migrate memory pages according to various sets of operating conditions and patterns. Described in more detail below is the page fault sequence for the following four memory page ownership state transitions: CPU owned to CPU shared, CPU owned to PPU owned, PPU owned to CPU owned, and PPU owned to CPU shared.
由PPU202所产生的故障可开始从CPU所有到CPU共享的过渡。在这种过渡之前,正在PPU202中执行的线程试图访问在PPU页表208中没被映射的虚拟存储器地址处的数据。此访问试图引发基于PPU的页故障,该页故障然后致使故障缓存器条目被写入到故障缓存器216。作为响应,PPU故障处理程序215读取与该虚拟存储器地址相对应的PSD210条目,并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后,PPU故障处理程序215确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为CPU所有。基于当前所有权状态以及其他因素,例如关于存储器页的使用特性或存储器访问的类型,PPU故障处理程序215确定该页的新的所有权状态应当为CPU共享。A fault generated by PPU 202 may initiate the transition from CPU ownership to CPU sharing. Prior to this transition, a thread executing in PPU 202 attempted to access data at a virtual memory address that was not mapped in PPU page table 208 . This access attempts to cause a PPU-based page fault, which then causes a fault buffer entry to be written to fault buffer 216 . In response, PPU fault handler 215 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, PPU fault handler 215 determines that the current ownership state of the memory page associated with the virtual memory address is owned by the CPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the memory page or the type of memory access, the PPU fault handler 215 determines that the new ownership state for the page should be CPU shared.
为了改变所有权状态,PPU故障处理程序215在PPU页表208中写入新条目,与虚拟存储器地址相对应且将虚拟存储器地址与经由PSD210条目所识别的存储器页关联起来。PPU故障处理程序215还修改关于该存储器页的PSD210条目以指明所有权状态为CPU共享。在一些实施例中,使PPU202中的转译后备(look-aside)缓存器(TLB)无效,以将其中至无效页的转译被高速缓存(cache)的情况加以考虑。此时,页故障序列结束。存储器页的所有权状态为CPU共享,意味着存储器页对于CPU102和PPU202都是可访问的。CPU页表206和PPU页表208两者都包含将虚拟存储器地址关联到存储器页的条目。To change ownership status, PPU fault handler 215 writes a new entry in PPU page table 208 corresponding to the virtual memory address and associating the virtual memory address with the memory page identified via the PSD 210 entry. PPU fault handler 215 also modifies the PSD 210 entry for the memory page to indicate that the ownership status is CPU shared. In some embodiments, a translation look-aside buffer (TLB) in the PPU 202 is invalidated to account for cases where translations to invalid pages are cached. At this point, the page fault sequence ends. The ownership status of the memory page is CPU shared, meaning that the memory page is accessible to both CPU 102 and PPU 202 . Both CPU page table 206 and PPU page table 208 contain entries that associate virtual memory addresses to memory pages.
由PPU202所产生的故障可开始从CPU所有到PPU所有的过渡。在这种过渡之前,正在PPU202中执行的操作试图访问在PPU页表208中没被映射的虚拟存储器地址处的数据。此存储器访问试图引发基于PPU的页故障,该页故障然后致使故障缓存器条目被写入到故障缓存器216。作为响应,PPU故障处理程序215读取与该虚拟存储器地址相对应的PSD210条目,并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后,PPU故障处理程序215确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为CPU所有。基于当前所有权状态以及其他因素,例如关于该页的使用特性或存储器访问的类型,PPU故障处理程序215确定该页的新的所有权状态应当为PPU所有。A fault generated by PPU 202 may initiate a transition from CPU ownership to PPU ownership. Prior to this transition, an operation being executed in PPU 202 attempted to access data at a virtual memory address that was not mapped in PPU page table 208 . This memory access attempts to cause a PPU-based page fault, which then causes a fault buffer entry to be written to fault buffer 216 . In response, PPU fault handler 215 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, PPU fault handler 215 determines that the current ownership state of the memory page associated with the virtual memory address is owned by the CPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the page or the type of memory access, the PPU fault handler 215 determines that the new ownership state for the page should be owned by the PPU.
PPU202将指明PPU202产生了页故障且指明与该页故障相关联的虚拟存储器地址的故障缓存器条目写入故障缓存器216中。正在CPU102上执行的PPU故障处理程序215读取该故障缓存器条目,且作为响应,CPU102将CPU页表206中与引发该页故障的虚拟存储器地址相关联的映射移除。CPU102可在移除映射之前和/或之后清理(flush)高速缓存。CPU102还将指示PPU202将页从系统存储器104复制到PPU存储器204中的命令写入到命令队列214中。PPU202中的复制引擎212读取命令队列214中的命令并将页从系统存储器104复制到PPU存储器204。PPU202将页表条目写入PPU页表208中,与该虚拟存储器地址相对应且将该虚拟存储器地址与PPU存储器204中的新复制的存储器页关联起来。对PPU页表208的写入可经由PPU202来完成。替代地,CPU102可更新PPU页表208。PPU故障处理程序215还修改关于该存储器页的PSD210,以指明所有权状态为PPU所有。在一些实施例中,可使PPU202或CPU102中的TLB中的条目无效,以将其中转译被高速缓存的情况纳入考虑。此时,页故障序列结束。存储器页的所有权状态为PPU所有,意味着该存储器页仅对于PPU202是可访问的。仅PPU页表208包含将虚拟存储器地址与该存储器页关联起来的条目。PPU 202 writes a fault buffer entry into fault buffer 216 indicating that PPU 202 generated a page fault and designating the virtual memory address associated with the page fault. PPU fault handler 215 executing on CPU 102 reads the fault register entry, and in response CPU 102 removes the mapping in CPU page table 206 associated with the virtual memory address that caused the page fault. CPU 102 may flush the cache before and/or after removing the mapping. CPU 102 also writes a command into command queue 214 instructing PPU 202 to copy a page from system memory 104 into PPU memory 204 . Copy engine 212 in PPU 202 reads commands in command queue 214 and copies pages from system memory 104 to PPU memory 204 . PPU 202 writes a page table entry into PPU page table 208 corresponding to the virtual memory address and associates the virtual memory address with the newly copied memory page in PPU memory 204 . Writing to PPU page table 208 may be done via PPU 202 . Alternatively, CPU 102 may update PPU page table 208 . The PPU fault handler 215 also modifies the PSD 210 for the memory page to indicate that the ownership status is owned by the PPU. In some embodiments, entries in the TLB in PPU 202 or CPU 102 may be invalidated to account for cases where translations are cached. At this point, the page fault sequence ends. The ownership status of the memory page is owned by the PPU, meaning that the memory page is only accessible to the PPU 202 . Only PPU page table 208 contains an entry associating a virtual memory address with that memory page.
由CPU102所产生的故障可开始从PPU所有到CPU所有的过渡。在这种过渡之前,正在CPU102中执行的操作试图访问在CPU页表206中没被映射的虚拟存储器地址处的数据,这引发基于CPU的页故障。CPU故障处理程序211读取与该虚拟存储器地址相对应的PSD210条目,并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后,CPU故障处理程序211确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为PPU所有。基于当前所有权状态以及其他因素,例如关于该页的使用特性或访问的类型,CPU故障处理程序211确定该页的新的所有权状态为CPU所有。A fault generated by CPU 102 may initiate a transition from PPU ownership to CPU ownership. Prior to this transition, an operation being executed in CPU 102 attempted to access data at a virtual memory address that was not mapped in CPU page table 206, causing a CPU-based page fault. CPU fault handler 211 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, CPU fault handler 211 determines that the current ownership status of the memory page associated with the virtual memory address is owned by the PPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the page or the type of access, the CPU fault handler 211 determines that the new ownership state of the page is owned by the CPU.
CPU故障处理程序211将与存储器页相关联的所有权状态改变到CPU所有。CPU故障处理程序211将命令写入命令队列214中,以令复制引擎212从PPU页表208移除将虚拟存储器地址与该存储器页关联起来的条目。可使各种TLB条目无效。CPU故障处理程序211还将存储器页从PPU存储器204复制到系统存储器104中,这可经由命令队列214和复制引擎212来完成。CPU故障处理程序211在CPU页表206中写入将虚拟存储器地址与被复制到系统存储器104中的存储器页关联起来的页表条目。CPU故障处理程序211还更新PSD210以将虚拟存储器地址与新复制的存储器页关联起来。此时,页故障序列结束。存储器页的所有权状态为CPU所有,意味着该存储器页仅对于CPU102是可访问的。仅CPU页表206包含将虚拟存储器地址与该存储器页关联起来的条目。CPU fault handler 211 changes the ownership state associated with the memory page to CPU owned. CPU fault handler 211 writes a command into command queue 214 to cause replication engine 212 to remove from PPU page table 208 an entry associating a virtual memory address with the memory page. Various TLB entries can be invalidated. CPU fault handler 211 also copies memory pages from PPU memory 204 into system memory 104 , which may be done via command queue 214 and copy engine 212 . CPU fault handler 211 writes a page table entry in CPU page table 206 that associates a virtual memory address with a memory page copied into system memory 104 . CPU fault handler 211 also updates PSD 210 to associate virtual memory addresses with newly copied memory pages. At this point, the page fault sequence ends. The ownership status of the memory page is CPU, meaning that the memory page is only accessible to the CPU 102 . Only CPU page table 206 contains an entry associating a virtual memory address with that memory page.
由CPU102所产生的故障可开始从PPU所有到CPU共享的过渡。在这种过渡之前,正在CPU102中执行的操作试图访问在CPU页表206中没被映射的虚拟存储器地址处的数据,这引发基于CPU的页故障。CPU故障处理程序211读取与该虚拟存储器地址相对应的PSD210条目,并识别与该虚拟存储器地址相关联的存储器页。在对PSD210进行读取之后,CPU故障处理程序211确定与该虚拟存储器地址相关联的存储器页的当前所有权状态为PPU所有。基于当前所有权状态以及其他因素,例如关于该页的使用特性,CPU故障处理程序211确定该页的新的所有权状态为CPU共享。A fault generated by CPU 102 may initiate the transition from PPU ownership to CPU sharing. Prior to this transition, an operation being executed in CPU 102 attempted to access data at a virtual memory address that was not mapped in CPU page table 206, causing a CPU-based page fault. CPU fault handler 211 reads the PSD 210 entry corresponding to the virtual memory address and identifies the memory page associated with the virtual memory address. After reading PSD 210, CPU fault handler 211 determines that the current ownership status of the memory page associated with the virtual memory address is owned by the PPU. Based on the current ownership state and other factors, such as regarding the usage characteristics of the page, the CPU fault handler 211 determines that the new ownership state of the page is CPU shared.
CPU故障处理程序211将与存储器页相关联的所有权状态改变到CPU共享。CPU故障处理程序211将命令写入命令队列214中,以令复制引擎212从PPU页表208移除将虚拟存储器地址与该存储器页关联起来的条目。可使各种TLB条目无效。CPU故障处理程序211还将存储器页从PPU存储器204复制到系统存储器104中。此复制操作可经由命令队列214和复制引擎212来完成。CPU故障处理程序211然后将命令写入命令队列214中,以令复制引擎212改变PPU页表208中的条目,使得虚拟存储器地址与系统存储器104中的存储器页关联起来。CPU故障处理程序211将页表条目写入CPU页表206,以将虚拟存储器地址与系统存储器104中的存储器页关联起来。CPU故障处理程序211还更新PSD210以将虚拟存储器地址与系统存储器104中的存储器页关联起来。此时,页故障序列结束。该页的所有权状态为CPU共享,且该存储器页已被复制到系统存储器104中。由于CPU页表206包含将虚拟存储器地址与系统存储器104中的存储器页关联起来的条目,所以该页对于CPU102是可访问的。由于PPU页表208包含将虚拟存储器地址与系统存储器104中的存储器页关联起来的条目,所以该页对于PPU202也是可访问的。The CPU fault handler 211 changes the ownership state associated with the memory page to CPU shared. CPU fault handler 211 writes a command into command queue 214 to cause replication engine 212 to remove from PPU page table 208 an entry associating a virtual memory address with the memory page. Various TLB entries can be invalidated. CPU fault handler 211 also copies memory pages from PPU memory 204 into system memory 104 . This copy operation can be accomplished via command queue 214 and copy engine 212 . CPU fault handler 211 then writes a command into command queue 214 to cause replication engine 212 to change an entry in PPU page table 208 so that a virtual memory address is associated with a memory page in system memory 104 . CPU fault handler 211 writes a page table entry to CPU page table 206 to associate a virtual memory address with a memory page in system memory 104 . CPU fault handler 211 also updates PSD 210 to associate virtual memory addresses with memory pages in system memory 104 . At this point, the page fault sequence ends. The ownership status of the page is CPU shared, and the memory page has been copied into system memory 104 . Because CPU page table 206 contains an entry associating a virtual memory address with a memory page in system memory 104 , the page is accessible to CPU 102 . Since PPU page table 208 contains an entry associating a virtual memory address with a memory page in system memory 104 , that page is also accessible to PPU 202 .
页故障序列的详细示例Detailed example of a page fault sequence
在此情境下,现在提供对倘若从CPPU所有到CPU共享过渡则由PPU故障处理程序215执行的页故障序列的详细描述以展示原子操作和过渡状态是如何用来更有效地管理序列的。页故障序列被试图访问在PPU页表208中不存在相关映射的虚拟地址的PPU202线程所触发。当线程试图经由虚拟存储器地址访问数据时,PPU202(具体地,用户级线程)从PPU页表208请求转译。因为PPU页表208不包含于所请求的虚拟存储器地址相关联的映射,所以作为响应,发生PPU页故障。In this context, a detailed description of the page fault sequence performed by the PPU fault handler 215 in the event of a transition from CPPU-owned to CPU-shared is now provided to show how atomic operations and transition states are used to manage the sequence more efficiently. The page fault sequence is triggered by a PPU 202 thread attempting to access a virtual address for which no associated mapping exists in the PPU page table 208 . When a thread attempts to access data via a virtual memory address, the PPU 202 (specifically, a user-level thread) requests a translation from the PPU page table 208 . In response, a PPU page fault occurs because the PPU page table 208 does not contain a mapping associated with the requested virtual memory address.
在页故障发生之后,上述线程被困、停滞,并且PPU故障处理程序215执行页故障序列。PPU故障处理程序215对PSD210进行读取,以确定哪个存储器页与虚拟存储器地址相关联,以及确定虚拟存储器地址的状态。PPU故障处理程序215从PSD210确定存储器页的所有权状态是CPU所有。因此,由PPU202所请求的数据经由虚拟存储器地址对于PPU202是可访问的。存储器页的状态信息还指明所请求的数据不能被迁移到PPU存储器204。After a page fault occurs, the aforementioned threads are trapped, stalled, and the PPU fault handler 215 executes the page fault sequence. PPU fault handler 215 reads PSD 210 to determine which memory page is associated with the virtual memory address and to determine the status of the virtual memory address. PPU fault handler 215 determines from PSD 210 that the ownership status of the memory page is CPU owned. Thus, data requested by PPU 202 is accessible to PPU 202 via the virtual memory address. The status information for the memory page also indicates that the requested data cannot be migrated to PPU memory 204 .
基于从PSD210获得的状态信息,PPU故障处理程序215确定存储器页的新的状态应当为CPU共享。PPU故障处理程序215将状态改变到“过渡到CPU共享”。此状态指明页当前处于正过渡到CPU共享的过程中。当PPU故障处理程序215在存储器管理单元中的微控制器上运行时,则两个处理器将异步地更新PSD210,对PSD210使用原子比较-交换(“CAS”)操作而将状态改变到“过渡到GPU可见(visible)”(CPU共享)。Based on the state information obtained from PSD 210, PPU fault handler 215 determines that the new state of the memory page should be CPU shared. PPU Fault Handler 215 changes state to "Transition to CPU Sharing". This status indicates that the page is currently in the process of transitioning to CPU sharing. When the PPU fault handler 215 is running on the microcontroller in the memory management unit, then the two processors will asynchronously update the PSD 210, using an atomic compare-swap (“CAS”) operation on the PSD 210 to change state to “Transition to GPU visible (visible)" (CPU sharing).
PPU202更新PPU页表208以将虚拟存储器地址与存储器页关联起来。PPU202还使TLB高速缓存条目无效。接着,PPU202对PSD210执行另一原子比较-交换,以将与存储器页相关联的所有权状态改变到CPU共享。最后,页故障序列终止,且经由虚拟存储器地址请求了数据的线程继续执行。PPU 202 updates PPU page table 208 to associate virtual memory addresses with memory pages. PPU 202 also invalidates TLB cache entries. Next, PPU 202 performs another atomic compare-swap on PSD 210 to change the ownership state associated with the memory page to CPU shared. Finally, the page fault sequence terminates, and the thread that requested data via the virtual memory address continues to execute.
UVM系统架构变形例Variation of UVM system architecture
对统一虚拟存储器系统200的各种修改都是可能的。例如,在一些实施例中,在将故障缓存器条目写入故障缓存器216中,PPU202可触发CPU中断,以令CPU102读取故障缓存器216中的故障缓存器条目并响应于该故障缓存器条目而执行任何适当的操作。在其他实施例中,CPU102可周期性地轮询(poll)故障缓存器216。倘若CPU102在故障缓存器216中找到故障缓存器条目,则CPU102响应于该故障缓存器条目而执行一系列操作。Various modifications to unified virtual memory system 200 are possible. For example, in some embodiments, upon writing a fault register entry in fault register 216, PPU 202 may trigger a CPU interrupt to cause CPU 102 to read the fault register entry in fault register 216 and respond to the fault register entry and perform any appropriate action. In other embodiments, the CPU 102 may periodically poll the fault register 216 . If CPU 102 finds a faulty register entry in faulty register 216, CPU 102 performs a series of operations in response to the faulty register entry.
在一些实施例中,系统存储器104,而不是PPU存储器204,存储PPU页表208。在其他实施例中,可实施单级或多级高速缓存层级架构(hierarchy),例如单级或多级转译后备缓存器(TLB)层级架构(未示出),以供CPU页表206或PPU页表208高速缓存虚拟地址转译。In some embodiments, system memory 104 , rather than PPU memory 204 , stores PPU page table 208 . In other embodiments, a single-level or multi-level cache hierarchy, such as a single-level or multi-level translation lookaside buffer (TLB) hierarchy (not shown), may be implemented for the CPU page table 206 or the PPU Page table 208 caches virtual address translations.
在又一些实施例中,倘若正在PPU202中执行的线程引发PPU故障(“出故障的线程”)则PPU202可采取一个或多个动作。这些动作包含:使整个PPU202停滞,使正在执行出故障的线程的SM停滞,使PPU MMU213停滞,仅使出故障的线程停滞或者使一级或多级TLB停滞。在一些实施例中,在PPU页故障发生之后,并且统一虚拟存储器系统200已执行页故障序列,则出故障的线程继续执行,且出故障线程再次试图引发了该页故障的存储器访问请求。在一些实施例中,TLB处的停滞是以表现为对出故障的SM或出故障的线程的长时延(long-latency)存储器访问这样的方式来进行的,从而无需SM针对故障做出任何特殊操作。In still other embodiments, PPU 202 may take one or more actions in the event a thread executing in PPU 202 causes a PPU fault ("faulted thread"). These actions include stalling the entire PPU 202, stalling the SM executing the failing thread, stalling the PPU MMU 213, stalling only the failing thread, or stalling one or more levels of the TLB. In some embodiments, after a PPU page fault occurs, and the unified virtual memory system 200 has executed the page fault sequence, the faulted thread continues to execute, and the faulted thread retries the memory access request that caused the page fault. In some embodiments, stalls at the TLB are performed in such a way that they appear as long-latency memory accesses to the faulting SM or thread, thereby requiring the SM to do nothing in response to the fault. special operations.
最后,在其他替代性实施例中,UVM驱动器101可包含令CPU102执行一个或多个操作用于管理UVM系统200并修复页故障的指令,例如访问CPU页表206、PSD210和/或故障缓存器216。在其他实施例中,操作系统内核(未示出)可配置为通过访问CPU页表206、PSD210和/或故障缓存器216来管理UVM系统200并修复页故障。在又一些实施例中,操作系统内核可连同UVM驱动器101一起操作,以通过访问CPU页表206、PSD210和/或故障缓存器216来管理UVM系统200并修复页故障。Finally, in other alternative embodiments, UVM driver 101 may contain instructions that cause CPU 102 to perform one or more operations for managing UVM system 200 and repairing page faults, such as accessing CPU page tables 206, PSD 210, and/or fault buffers 216. In other embodiments, an operating system kernel (not shown) may be configured to manage UVM system 200 and repair page faults by accessing CPU page table 206 , PSD 210 , and/or fault buffer 216 . In yet other embodiments, an operating system kernel may operate in conjunction with UVM driver 101 to manage UVM system 200 and repair page faults by accessing CPU page table 206 , PSD 210 and/or fault buffer 216 .
迁移不同尺寸的存储器页Migrate memory pages of different sizes
存储在系统存储器104中的存储器页被许可具有与存储在PPU存储器204中的存储器页不同的尺寸。例如,存储在系统存储器104中的存储器页可具有4KB的尺寸,而存储在PPU存储器204中的存储器页可具有128KB的尺寸。作为另一示例,存储在系统存储器104中的存储器页可具有4KB的尺寸,而存储在PPU存储器204中的存储器页可具有4KB页和128KB页的混合。作为又一示例,存储在系统存储器104中的存储器页可具有4KB和1MB页的混合,而存储在PPU存储器204中的存储器页可具有4KB页和128KB页的混合。在页故障序列过程中,UVM系统200可从一个存储器单元将存储器页传送至另一存储器单元(例如,从PPU存储器204到系统存储器104)。为了容许存储器页的尺寸上的差异,当UVM系统200传送存储器页时,UVM系统200可分割开大存储器页或组合多个小存储器页。UVM系统200还可将一个或多个另外的“同级(slibling)”存储器页与要传送的存储器页一起传送。在一些实施例中,同级存储器页是具有较小尺寸的存储器页(例如,存储器页在存储4KB和128KB存储器页两者的系统中具有4KB尺寸),其能容纳在较大页的对准的地址跨度(span)内。对准的地址跨度是指较大尺寸的存储器页的从开始到结束的地址范围。位于这样的地址跨度内的较小页被视为同级存储器页。Memory pages stored in system memory 104 are permitted to have a different size than memory pages stored in PPU memory 204 . For example, a memory page stored in system memory 104 may have a size of 4KB, while a memory page stored in PPU memory 204 may have a size of 128KB. As another example, memory pages stored in system memory 104 may have a size of 4KB, while memory pages stored in PPU memory 204 may have a mix of 4KB pages and 128KB pages. As yet another example, memory pages stored in system memory 104 may have a mix of 4KB and 1MB pages, while memory pages stored in PPU memory 204 may have a mix of 4KB pages and 128KB pages. During a page fault sequence, UVM system 200 may transfer a memory page from one memory unit to another (eg, from PPU memory 204 to system memory 104 ). To accommodate differences in the size of memory pages, UVM system 200 may split large memory pages or combine multiple small memory pages when UVM system 200 transfers memory pages. UVM system 200 may also transfer one or more additional "slibling" memory pages with the memory page being transferred. In some embodiments, a sibling memory page is a memory page with a smaller size (eg, a memory page has a 4KB size in a system that stores both 4KB and 128KB memory pages) that can accommodate the alignment of the larger page. within the address span (span). An aligned address span refers to the address range from start to end of a larger sized memory page. Smaller pages that lie within such an address span are considered sibling memory pages.
下面参照图3-7来描述与将大存储器页分割成多个较小存储器页或将多个小存储器页组合成较大存储器页并且在存储器单元之间传送这些存储器页相关联的若干操作。例如,描述在下列情境下发生的操作:从系统存储器104将小存储器页传送至PPU存储器204(图3);从系统存储器104将小存储器页和同级存储器页传送至PPU存储器204中的大存储器页(图4);将PPU存储器204中的大存储器页分割成小存储器页,并从PPU存储器204将这些小存储器页其中之一传送至系统存储器104(图5);以及从PPU存储器204将小存储器页以及该小存储器页的同级存储器页传送至系统存储器104中(图6)。Several operations associated with partitioning a large memory page into multiple smaller memory pages or combining multiple small memory pages into larger memory pages and transferring these memory pages between memory cells are described below with reference to FIGS. 3-7. For example, operations are described that occur when a small memory page is transferred from system memory 104 to PPU memory 204 (FIG. 3); a small memory page and a sibling memory page are transferred from system memory 104 to a large memory pages (FIG. 4); splitting large memory pages in PPU memory 204 into small memory pages and transferring one of these small memory pages from PPU memory 204 to system memory 104 (FIG. 5); and The small memory page and its sibling memory pages are transferred into system memory 104 ( FIG. 6 ).
图3示出了根据本发明的一个实施例的、用于从系统存储器104将小存储器页传送至PPU存储器204的操作。在该操作中,系统存储器104存储小存储器页302且PPU存储器204存储小存储器页304和大存储器页306两者。根据该操作,UVM驱动器101确定系统存储器104中的特定小存储器页302(0)将要迁移至PPU存储器204。因为PPU存储器204能够存储小尺寸或大尺寸的存储器页,则经迁移后的存储器页被存储为小存储器页304(1)。经迁移后的小存储器页304(1)可紧挨着PPU存储器204中的其他小存储器页304存储。PPU存储器204中也可存储大存储器页306。在迁移之后,系统存储器中被迁移的小存储器页所占用的空间被解除分配(deallocate)且可用于将来的分配。FIG. 3 illustrates operations for transferring a small memory page from system memory 104 to PPU memory 204, according to one embodiment of the invention. In this operation, system memory 104 stores small memory page 302 and PPU memory 204 stores both small memory page 304 and large memory page 306 . From this operation, UVM driver 101 determines that a particular small memory page 302 ( 0 ) in system memory 104 is to be migrated to PPU memory 204 . Because PPU memory 204 is capable of storing memory pages of small size or large size, the migrated memory page is stored as small memory page 304(1). The migrated small memory page 304 ( 1 ) may be stored next to other small memory pages 304 in PPU memory 204 . Large memory pages 306 may also be stored in PPU memory 204 . After migration, the space in system memory occupied by the migrated small memory pages is deallocated and made available for future allocations.
UVM驱动器101针对相关联的存储器页而更改PSD条目以执行操作。更改PSD条目可包括针对相关联的存储器页而设置PSD条目,以指明中间和/或锁定状态。更具体地,UVM驱动器101设置与存储器页302(0)相关联的PSD条目,以指明存储器页302(0)处于传送中且仅能读取。随后,UVM驱动器101设置与目标大存储器页306相关联的PSD条目,以指明该大存储器页306处于传送中且不能访问。UVM驱动器101将小存储器页302(0)复制到目标大存储器页306。随后,UVM驱动器101设置PSD条目,以指明目标大存储器页306能访问(读取和写入)。The UVM driver 101 alters the PSD entry to perform the operation for the associated memory page. Altering the PSD entry may include setting the PSD entry for the associated memory page to indicate an intermediate and/or locked state. More specifically, UVM driver 101 sets the PSD entry associated with memory page 302(0) to indicate that memory page 302(0) is in transfer and can only be read. Subsequently, the UVM driver 101 sets the PSD entry associated with the target large memory page 306 to indicate that the large memory page 306 is in transfer and cannot be accessed. UVM driver 101 copies small memory page 302(0) to target large memory page 306 . Subsequently, the UVM driver 101 sets the PSD entry to indicate that the target large memory page 306 is accessible (read and write).
如以上参照图2所描述的,特定存储器页可由于各种各样的理由而被迁移。当在PPU存储器204中需要存储在系统存储器104中的单个小存储器页302、但系统存储器104中还需要包围单个所需要的存储器页的“同级”存储器页时,可执行图3中所示操作。当单个小存储器页302频繁地被例如PPU202访问、同时同级存储器页中的一个或多个频繁地被例如CPU102访问时,会出现这种情形。通常,当包含单个小存储器页和同级存储器页的存储器页组“被强烈主张”、意味着该存储器页组中的各种存储器页被不同的处理单元例如CPU102和PPU202访问时,UVM驱动器101可分割该存储器页组。As described above with reference to FIG. 2, particular memory pages may be migrated for a variety of reasons. When a single small memory page 302 stored in system memory 104 is required in PPU memory 204, but a "sibling" memory page is also required in system memory 104 surrounding the single required memory page, the implementation shown in FIG. operate. This situation arises when a single small memory page 302 is frequently accessed by, for example, PPU 202 , while one or more of the sibling memory pages are frequently accessed by, for example, CPU 102 . Generally, when a memory page group containing a single small memory page and sibling memory pages is "strongly asserted," meaning that the various memory pages in the memory page group are accessed by different processing units, such as the CPU 102 and the PPU 202, the UVM driver 101 The group of memory pages can be partitioned.
图4示出了根据本发明的一个实施例的、用于从系统存储器104将小存储器页和相关的“同级”存储器页传送至PPU存储器204的操作。在该操作中,如参照图3描述的操作一样,系统存储器104存储小存储器页402且PPU存储器204存储小存储器页404和大存储器页406两者。根据图4中所示操作,UVM驱动器101确定特定小存储器页402(1)以及同级存储器页402(0)、同级存储器页402(2)和同级存储器页402(3)将要从系统存储器104迁移至PPU存储器204。UVM驱动器101令这些存储器页从系统存储器104迁移至PPU存储器204。因为PPU存储器204通常与较大存储器页一起工作效率较高,所以UVM驱动器101可将从系统存储器104复制的小存储器页402合并(coalesce)成包含来自小存储器页的全部数据的一个大的合并存储器页405。FIG. 4 illustrates operations for transferring a small memory page and associated "sibling" memory pages from system memory 104 to PPU memory 204, according to one embodiment of the invention. In this operation, system memory 104 stores small memory page 402 and PPU memory 204 stores both small memory page 404 and large memory page 406 as in the operation described with reference to FIG. 3 . According to the operations shown in FIG. 4, UVM driver 101 determines that a particular small memory page 402(1), as well as sibling memory pages 402(0), sibling memory pages 402(2), and sibling memory pages 402(3) are to be removed from the system Memory 104 is migrated to PPU memory 204 . UVM driver 101 causes these memory pages to be migrated from system memory 104 to PPU memory 204 . Because the PPU memory 204 typically works more efficiently with larger memory pages, the UVM driver 101 can coalesce the small memory pages 402 copied from the system memory 104 into one large coalesce containing all the data from the small memory pages Memory page 405.
另外,如上所述,参照图2和图3,迁移特定存储器页的理由可以是各种各样的,比如使用历史。当在PPU存储器204中需要特定存储器页且移动同级存储器页被认为是有利的时,可执行图4中所示操作。在一个示例中,这样的页基于最近最少使用(least-recently-used)跟踪方案来进行合并。不经常访问的小的页合并在一起。In addition, as described above with reference to FIGS. 2 and 3 , reasons for migrating a particular memory page can be various, such as usage history. The operations shown in FIG. 4 may be performed when a particular memory page is needed in PPU memory 204 and moving a sibling memory page is deemed advantageous. In one example, such pages are merged based on a least-recently-used tracking scheme. Small pages that are accessed infrequently are merged together.
如上所述,UVM驱动器101针对相关联的存储器页而更改PSD条目以执行操作。更改PSD条目可包括针对相关联的存储器页而设置PSD条目,以指明中间和/或锁定状态。更具体地,UVM驱动器101设置与存储器页420(0)、存储器页420(1)、存储器页420(2)和存储器页420(3)相关联的PSD条目,以指明这些存储器页处于传送中且仅能读取。随后,UVM驱动器101设置与目标大存储器页406相关联的PSD条目,以指明该大存储器页406处于传送中且不能访问。然后UVM驱动器101将这些小存储器页复制到目标大存储器页406。随后,UVM驱动器101设置PSD条目,以指明目标大存储器页406能访问(读取和写入)。As described above, UVM driver 101 alters PSD entries to perform operations for associated memory pages. Altering the PSD entry may include setting the PSD entry for the associated memory page to indicate an intermediate and/or locked state. More specifically, UVM driver 101 sets the PSD entries associated with memory page 420(0), memory page 420(1), memory page 420(2), and memory page 420(3) to indicate that these memory pages are in transfer and can only be read. Subsequently, the UVM driver 101 sets the PSD entry associated with the target large memory page 406 to indicate that the large memory page 406 is in transfer and cannot be accessed. UVM driver 101 then copies these small memory pages to target large memory pages 406 . Subsequently, the UVM driver 101 sets the PSD entry to indicate that the target large memory page 406 is accessible (read and write).
图5示出了根据本发明的一个实施例的、用于从PPU存储器204将小存储器页传送至系统存储器104的操作。在该操作中,与以上参照图3和图4描述的操作一样,系统存储器104存储小存储器页502,且PPU存储器204存储大存储器页504和小存储器页506两者。根据该操作,UVM驱动器101确定存储在PPU存储器204中的大存储器页的特定部分将要迁移至系统存储器104。UVM驱动器101令该大存储器页分解(break up)成小存储器页,然后令与上述部分相关联的小存储器页迁移至系统存储器104。另外,如上所述,迁移特定存储器页的理由可以是各种各样的,包括使用历史。Figure 5 illustrates operations for transferring a small memory page from PPU memory 204 to system memory 104, according to one embodiment of the present invention. In this operation, system memory 104 stores small memory page 502 and PPU memory 204 stores both large memory page 504 and small memory page 506 as described above with reference to FIGS. 3 and 4 . From this operation, UVM driver 101 determines that a particular portion of a large memory page stored in PPU memory 204 is to be migrated to system memory 104 . The UVM driver 101 causes the large memory page to break up into small memory pages, and then causes the small memory pages associated with the aforementioned parts to be migrated to the system memory 104 . Additionally, as noted above, the reasons for migrating a particular page of memory can be various, including usage history.
如上所述,可分割“被强烈主张”的存储器页。换言之,对于特定大存储器页,如果该大存储器页内的小存储器页频繁地被多个不同的处理单元例如CPU102或PPU202访问,则可分割该大存储器页。此分析是基于使用历史的一类分析。As described above, "strongly asserted" memory pages may be partitioned. In other words, for a particular large memory page, if the small memory pages within the large memory page are frequently accessed by multiple different processing units, such as CPU 102 or PPU 202 , the large memory page may be split. This analysis is one type of analysis based on usage history.
如上所述,UVM驱动器101针对相关联的存储器页更改PSD条目以执行图5中所示操作。更具体地,UVM驱动器101设置与将要分割的大存储器页相关联的PSD条目,以指明该存储器页处于传送中且仅能读取。随后,UVM驱动器101设置与目标小存储器页502相关联的PSD条目,以指明该小存储器页502处于传送中且不能访问。然后UVM驱动器101将大存储器页506的上述部分复制到目标目标小存储器页502。随后,UVM驱动器101设置PSD条目,以指明目标目标小存储器页502能访问(读取和写入)。As described above, the UVM driver 101 modifies the PSD entry for the associated memory page to perform the operations shown in FIG. 5 . More specifically, the UVM driver 101 sets the PSD entry associated with the large memory page to be split to indicate that the memory page is in transfer and can only be read. Subsequently, the UVM driver 101 sets the PSD entry associated with the target small memory page 502 to indicate that the small memory page 502 is in transfer and cannot be accessed. The UVM driver 101 then copies the aforementioned portion of the large memory page 506 to the target target small memory page 502 . Subsequently, the UVM driver 101 sets the PSD entry to indicate that the target target small memory page 502 is accessible (read and write).
图6示出了根据本发明的一个实施例的、用于从PPU存储器204将小存储器页以及同级存储器页传送至系统存储器104的操作。在该操作中,如以上参照图3-5描述的操作一样,系统存储器104存储小存储器页602且PPU存储器204存储大存储器页604和小存储器页606两者。根据该操作,UVM驱动器101确定大存储器页606的特定部分将要从PPU存储器204迁移至系统存储器104。UVM驱动器101令UVM驱动器101分割成小存储器页604,并且令小存储器页604(1)以及同级存储器页——小存储器页604(0)、小存储器页604(2)和小存储器页604(3)——迁移至系统存储器104。如图3-5一样,特定存储器页比如小存储器页604(1)可由于各种各样的理由而从PPU存储器204迁移至系统存储器104,如以上参照图1和图2所描述的。FIG. 6 illustrates operations for transferring small memory pages and sibling memory pages from PPU memory 204 to system memory 104, according to one embodiment of the invention. In this operation, system memory 104 stores small memory page 602 and PPU memory 204 stores both large memory page 604 and small memory page 606 as described above with reference to FIGS. 3-5 . From this operation, UVM driver 101 determines that a particular portion of large memory page 606 is to be migrated from PPU memory 204 to system memory 104 . UVM driver 101 has UVM driver 101 split into small memory pages 604, and has small memory page 604(1) and sibling memory pages—small memory page 604(0), small memory page 604(2), and small memory page 604 (3)——Migrate to system memory 104 . As with Figures 3-5, certain memory pages, such as small memory page 604(1), may migrate from PPU memory 204 to system memory 104 for a variety of reasons, as described above with reference to Figures 1 and 2 .
同样如上所述,UVM驱动器101针对相关联的存储器页而更改PSD条目以执行操作。更具体地,UVM驱动器101设置与将要分割的大存储器页相关联的PSD条目,以指明该存储器页处于传送中且仅能读取。随后,UVM驱动器101设置于目标小存储器页602相关联的PSD条目,以指明该小存储器页602处于传送中且不能访问。然后UVM驱动器101将现在分解成了小存储器页的大存储器页606复制到目标小存储器页602。随后,UVM驱动器101设置PSD条目以指明目标小存储器页602能访问(读取和写入)。Also as described above, the UVM driver 101 alters the PSD entry to perform the operation for the associated memory page. More specifically, the UVM driver 101 sets the PSD entry associated with the large memory page to be split to indicate that the memory page is in transfer and can only be read. Subsequently, the UVM driver 101 sets the PSD entry associated with the target small memory page 602 to indicate that the small memory page 602 is in transfer and cannot be accessed. The UVM driver 101 then copies the large memory page 606 , now broken into small memory pages, to the target small memory page 602 . Subsequently, the UVM driver 101 sets the PSD entry to indicate that the target small memory page 602 is accessible (read and write).
图7是根据本发明的一个实施例的、用于在虚拟存储器架构中的存储器单元之间迁移不同尺寸的存储器页的方法步骤的流程图。虽然结合图1-6描述了这些方法步骤,但本领域技术人员将理解,以任何次序经配置以执行这些方法的任何系统都落入本发明的范围内。7 is a flowchart of method steps for migrating memory pages of different sizes between memory units in a virtual memory architecture, according to one embodiment of the present invention. Although these method steps are described in connection with Figures 1-6, those skilled in the art will understand that any system configured to perform these methods, in any order, is within the scope of the present invention.
如图所示,方法700在步骤702开始,其中UVM驱动器101确定存储器页要迁移。在步骤704,UVM驱动器101确定该存储器页是否为大存储器页。如果该存储器页为大存储器页,则方法700前进至步骤706。在步骤706,UVM驱动器101确定是否分割大存储器页。如果UVM驱动器101确定大存储器页应当被分割,则该方法前进至步骤708。在步骤708,UVM驱动器101分割大存储器页并从一个存储器单元将来自大存储器页的小存储器页复制到另一存储器单元。如果在步骤708,UVM驱动器101确定不要分割大存储器页,则该方法前进至步骤710。在步骤710,UVM驱动器101从一个存储器单元将大存储器页复制到另一存储器单元。As shown, method 700 begins at step 702, where UVM driver 101 determines that a memory page is to be migrated. In step 704, UVM driver 101 determines whether the memory page is a large memory page. If the memory page is a large memory page, method 700 proceeds to step 706 . In step 706, the UVM driver 101 determines whether to split the large memory page. If the UVM driver 101 determines that the large memory page should be split, the method proceeds to step 708 . In step 708, the UVM driver 101 splits the large memory page and copies the small memory pages from the large memory page from one memory unit to another memory unit. If at step 708 the UVM driver 101 determines not to split the large memory page, then the method proceeds to step 710 . At step 710, the UVM driver 101 copies a large memory page from one memory unit to another memory unit.
返回步骤704,如果UVM驱动器101确定存储器页不是大存储器页,则存储器页为小存储器页且方法前进至步骤712。在步骤712,UVM驱动器101确定是否要合并小存储器页和同级存储器页。如果UVM驱动器101确定小存储器页应当被合并,则UVM驱动器101前进至步骤714。在步骤714,UVM驱动器101从一个存储器单元将小存储器页和同级存储器复制到另一存储器单元。如果在步骤712,UVM驱动器101确定不要合并存储器页和同级存储器页,则该方法前进至步骤716。在步骤716,UVM驱动器101从一个存储器单元将小存储器页复制到另一存储器单元。Returning to step 704 , if the UVM driver 101 determines that the memory page is not a large memory page, then the memory page is a small memory page and the method proceeds to step 712 . In step 712, the UVM driver 101 determines whether to merge the small memory page and the sibling memory page. If the UVM driver 101 determines that the small memory page should be merged, the UVM driver 101 proceeds to step 714 . In step 714, the UVM driver 101 copies the small memory page and peer memory from one memory unit to another memory unit. If at step 712 the UVM driver 101 determines not to merge the memory page and sibling memory pages, then the method proceeds to step 716 . In step 716, the UVM driver 101 copies the small memory page from one memory unit to another memory unit.
总之,提供一种方法,通过该方法可将驻存在存储不同尺寸的存储器页的存储器单元中的存储器页在不同的存储器单元之间迁移。UVM驱动器101确定哪些存储器将要迁移。如果存储器页是小存储器页,则UVM驱动器101确定是否还要迁移同级存储器页。如果存储器页是大存储器页,则UVM驱动器101确定是否要分割该大存储器页,或者是否要迁移整个存储器页。在迁移过程中,UVM驱动器101阻止对迁移中所涉及的存储器页的访问。In summary, a method is provided by which memory pages residing in memory cells storing memory pages of different sizes can be migrated between different memory cells. The UVM driver 101 determines which memories are to be migrated. If the memory page is a small memory page, the UVM driver 101 determines whether to migrate a sibling memory page as well. If the memory page is a large memory page, UVM driver 101 determines whether the large memory page is to be split, or whether the entire memory page is to be migrated. During migration, the UVM driver 101 blocks access to memory pages involved in the migration.
所公开的技术的一个优点在于,在虚拟存储器架构中的不同存储器单元之间可以有效地来回迁移不同尺寸的存储器页。该技术通过允许统一虚拟存储器系统与许多不同类型的存储器架构一起工作来提高统一虚拟存储器系统的灵活性。另一相关优点在于,通过允许大存储器页被分割成较小存储器页并且允许小存储器页被合并成较大存储器页,具有不同尺寸的存储器页可以存储在配置为存储不同存储器页尺寸的不同存储器单元中。此特征允许统一虚拟存储器系统在可能的情况下将页归组,以便减小在页表和/或转译后备缓存器中占用的空间量。该特征还允许存储器页被分割开并迁移至不同的存储器单元,只要这种分割将会改进存储器本地性并减少存储器访问时间。One advantage of the disclosed technique is that memory pages of different sizes can be efficiently migrated back and forth between different memory units in a virtual memory architecture. This technology increases the flexibility of the unified virtual memory system by allowing it to work with many different types of memory architectures. Another related advantage is that, by allowing large memory pages to be split into smaller memory pages and allowing small memory pages to be merged into larger memory pages, memory pages having different sizes can be stored in different memories configured to store different memory page sizes. in the unit. This feature allows the unified virtual memory system to group pages where possible in order to reduce the amount of space taken up in page tables and/or translation lookaside caches. This feature also allows memory pages to be split and migrated to different memory cells, as long as such splitting improves memory locality and reduces memory access time.
虽然上述内容针对本发明的实施例,但可对本发明的其他以及进一步的实施例进行设计而不脱离其基本范围。例如,可以硬件或软件或硬件和软件的组合来实现本发明的各方面。本发明的一个实施例可被实施为与计算机系统一起使用的程序产品。该程序产品的程序定义实施例的各功能(包括本文中描述的方法)并且可以被包含在各种计算机可读存储介质上。示例性计算机可读存储介质包括但不限于:(i)不可写的存储介质(例如,计算机内的只读存储器设备,诸如可由CD-ROM驱动器读取的光盘只读存储器(CD-ROM)盘、闪存、只读存储器(ROM)芯片或任何类型的固态非易失性半导体存储器),在其上存储永久性信息;和(ii)可写的存储介质(例如,磁盘驱动器或硬盘驱动器内的软盘或者任何类型的固态随机存取半导体存储器),在其上存储可更改的信息。当承载针对本发明的功能的计算机可读指令时,这样的计算机可读存储介质是本发明的实施例。While the foregoing is directed to embodiments of the invention, other and further embodiments of the invention may be devised without departing from the essential scope thereof. For example, aspects of the present invention can be implemented in hardware or software or a combination of hardware and software. One embodiment of the invention can be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Exemplary computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g., read-only memory devices within a computer, such as compact disc read-only memory (CD-ROM) disks that can be read by a CD-ROM drive , flash memory, read-only memory (ROM) chips, or any type of solid-state non-volatile semiconductor memory) on which to store persistent information; and (ii) writable storage media (for example, a floppy disk or any type of solid-state random-access semiconductor memory) on which information can be changed. Such computer-readable storage media, when carrying computer-readable instructions for the functions of the present invention, are embodiments of the present invention.
以上已参照特定实施例对本发明进行了描述。然而,本领域普通技术人员将理解的是,可对此做出各种修改和变化而不脱离如随附权利要求书中所阐述的本发明的较宽精神和范围。因此,前面的描述以及附图应被视为是例示性而非限制性的意义。The invention has been described above with reference to specific embodiments. Those of ordinary skill in the art will appreciate, however, that various modifications and changes can be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. Accordingly, the foregoing description and drawings are to be regarded in an illustrative rather than a restrictive sense.
因此,本发明的范围由随附的权利要求书加以界定。Accordingly, the scope of the invention is defined by the appended claims.
Claims (10)
Applications Claiming Priority (6)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201361785428P | 2013-03-14 | 2013-03-14 | |
| US61/785,428 | 2013-03-14 | ||
| US201361800004P | 2013-03-15 | 2013-03-15 | |
| US61/800,004 | 2013-03-15 | ||
| US14/134,142 | 2013-12-19 | ||
| US14/134,142 US9424201B2 (en) | 2013-03-14 | 2013-12-19 | Migrating pages of different sizes between heterogeneous processors |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN104049905A true CN104049905A (en) | 2014-09-17 |
| CN104049905B CN104049905B (en) | 2018-03-09 |
Family
ID=51418512
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201310752862.5A Active CN104049905B (en) | 2013-03-14 | 2013-12-31 | Various sizes of page is migrated between heterogeneous processor |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN104049905B (en) |
| DE (1) | DE102013021997A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106528436A (en) * | 2015-09-15 | 2017-03-22 | 慧荣科技股份有限公司 | Data storage device and data maintenance method thereof |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110055232A1 (en) * | 2009-08-26 | 2011-03-03 | Goetz Graefe | Data restructuring in multi-level memory hierarchies |
| CN102597958A (en) * | 2009-11-16 | 2012-07-18 | 国际商业机器公司 | Symmetric live migration of virtual machines |
-
2013
- 2013-12-30 DE DE102013021997.3A patent/DE102013021997A1/en active Pending
- 2013-12-31 CN CN201310752862.5A patent/CN104049905B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110055232A1 (en) * | 2009-08-26 | 2011-03-03 | Goetz Graefe | Data restructuring in multi-level memory hierarchies |
| CN102597958A (en) * | 2009-11-16 | 2012-07-18 | 国际商业机器公司 | Symmetric live migration of virtual machines |
Non-Patent Citations (1)
| Title |
|---|
| TALLURI ET AL: "Tradeoffs in Supporting Two Page Sizes", 《PROCEEDINGS OF THE 19TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE》 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN106528436A (en) * | 2015-09-15 | 2017-03-22 | 慧荣科技股份有限公司 | Data storage device and data maintenance method thereof |
| US10528263B2 (en) | 2015-09-15 | 2020-01-07 | Silicon Motion, Inc. | Data storage device and data maintenance method thereof |
| CN106528436B (en) * | 2015-09-15 | 2020-03-10 | 慧荣科技股份有限公司 | Data storage device and data maintenance method thereof |
Also Published As
| Publication number | Publication date |
|---|---|
| DE102013021997A1 (en) | 2014-09-18 |
| CN104049905B (en) | 2018-03-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10133677B2 (en) | Opportunistic migration of memory pages in a unified virtual memory system | |
| US9798487B2 (en) | Migrating pages of different sizes between heterogeneous processors | |
| US10303616B2 (en) | Migration scheme for unified virtual memory system | |
| US9830210B2 (en) | CPU-to-GPU and GPU-to-GPU atomics | |
| US9792220B2 (en) | Microcontroller for memory management unit | |
| US9430400B2 (en) | Migration directives in a unified virtual memory system architecture | |
| US9575892B2 (en) | Replaying memory transactions while resolving memory access faults | |
| US10216413B2 (en) | Migration of peer-mapped memory pages | |
| US11741015B2 (en) | Fault buffer for tracking page faults in unified virtual memory system | |
| US10114758B2 (en) | Techniques for supporting for demand paging | |
| TWI515564B (en) | Page state directory for managing unified virtual memory | |
| CN104049904B (en) | For the system and method for the page status catalogue for managing unified virtual memory | |
| CN104049951A (en) | Replaying memory transactions while resolving memory access faults | |
| CN104049905B (en) | Various sizes of page is migrated between heterogeneous processor | |
| CN104049903A (en) | Migration scheme for unified virtual memory system |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |