CN104049905A

CN104049905A - Migrating pages of different sizes between heterogeneous processors

Info

Publication number: CN104049905A
Application number: CN201310752862.5A
Authority: CN
Inventors: 杰尔姆·F·小杜鲁克; 卡梅伦·布沙特; 詹姆士·勒罗伊·德明; 卢森·邓宁; 布雷恩·法斯; 马克·海尔格罗夫; 贾承欢; 约翰·马舍; 詹姆斯·M·范·戴克
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2013-03-14
Filing date: 2013-12-31
Publication date: 2014-09-17
Anticipated expiration: 2033-12-31
Also published as: CN104049905B; DE102013021997A1

Abstract

One embodiment of the present invention sets forth a computer-implemented method for migrating a memory page from a first memory to a second memory. The method includes determining a first page size supported by the first memory. The method also includes determining a second page size supported by the second memory. The method further includes determining a use history of the memory page based on an entry in a page state directory associated with the memory page. The method also includes migrating the memory page between the first memory and the second memory based on the first page size, the second page size, and the use history.

Description

Between heterogeneous processor, move the page of different size

Technical field

Present invention relates in general to computer science, and more specifically, relate to the page that moves different size between isomery (heterogeneous) processor.

Background technology

Typical computer system comprises CPU (central processing unit) (CPU) and one or more parallel processing element (GPU).Some advanced computer systems are embodied as the unified virtual memory architecture that CPU and GPU share.In addition, this framework also makes CPU and GPU can use shared (for example, same) virtual memory address to visit physical memory location, and no matter this physical memory location is in system storage or the storer of GPU this locality.

In this class, unify in virtual memory architecture, storage page can be advantageously stored in the memory cell being associated with CPU or GPU and by size differently according to storage page.A shortcoming with the storage page of different size is, moves storage page and become comparatively complicated between those different memory cells.For example, the difficult problem there will be is that large memories page migration is extremely only stored to the memory cell of small memory page.In this case, unify this species diversity that virtual memory architecture must determine how to allow page size.

As previously mentioned, this area is needed is a kind of effective method more, to move the storage page of different size in the system implementing unified virtual memory architecture.

Summary of the invention

One embodiment of the present of invention propose a kind of by computer-implemented, for storage page being migrated to from first memory to the method for second memory.The method comprises: determine the first page size that described first memory is supported.The method also comprises: determine the second page size that described second memory is supported.The method further comprises: the entry in the page status catalogue based on being associated with described storage page, determine that the use of described storage page is historical.The method also comprises: historical based on described first page size, described second page size and described use, between described first memory and described second memory, move described storage page.

An advantage of disclosed technology is, can between the different memory unit in virtual memory architecture, effectively move back and forth the storage page of different size.This technology is by the dirigibility that allows unified virtual memory system to work to improve unified virtual memory system together with many dissimilar memory architectures.Another associated advantages is, by allowing large memories page be divided into compared with small memory page and allow small memory page to be merged into compared with large memories page, the storage page with different size can be stored in and be configured to store in the different memory unit of different memory page size.This feature allows unified virtual memory system page to be returned to group (group) possible in the situation that, to reduce at page table and/or the amount of space taking in translating standby buffer (TLB).This feature also allows that storage page is divided opens and migrate to different memory cells, as long as will improve storer locality (locality) and reduce memory access time this cutting apart.

Accompanying drawing explanation

Therefore, can at length understand above-mentioned feature of the present invention, and can obtain describing more specifically as the present invention of institute's brief overview above with reference to one exemplary embodiment, some of them embodiment is shown in the drawings.Yet, it should be noted in the discussion above that accompanying drawing only shows exemplary embodiments of the present invention, therefore should not be considered to restriction on its scope, the present invention can have other equivalent embodiment.

Fig. 1 shows the block diagram of the computer system that is configured to realize one or more aspects of the present invention;

Fig. 2 is block diagram according to an embodiment of the invention, that unified virtual memory system (UVM) is shown;

That Fig. 3 shows is according to an embodiment of the invention, for small memory page being sent to from system storage to the operation of PPU storer;

That Fig. 4 shows is according to an embodiment of the invention, for small memory page and relevant " peer " storage page being sent to from system storage to the operation of PPU storer;

That Fig. 5 shows is according to an embodiment of the invention, for small memory page being sent to from PPU storer to the operation of system storage;

That Fig. 6 shows is according to an embodiment of the invention, for small memory page and storage page at the same level being sent to from PPU storer 204 to the operation of system storage 104; And

Fig. 7 be according to an embodiment of the invention, for move the process flow diagram of method step of the storage page of different size between the memory cell of virtual memory architecture.

Embodiment

In the following description, will set forth a large amount of details so that the more thorough understanding to the present invention to be provided.Yet, it will be apparent to those skilled in the art, the present invention can be implemented the in the situation that of neither one or a plurality of these details.

System survey

Fig. 1 is the block diagram that shows the computer system 100 that is configured to realize one or more aspects of the present invention.Computer system 100 comprises via CPU (central processing unit) (CPU) 102 and the system storage 104 that can comprise the interconnection path communication of Memory bridge 105.Memory bridge 105 can be north bridge chips for example, via bus or other communication paths 106(super transmission (HyperTransport) link for example) be connected to I/O(I/O) bridge 107.I/O bridge 107, it can be South Bridge chip for example, from one or more user input device 108(for example keyboard, mouse) receive user's input and via communication path 106 and Memory bridge 105, this input be forwarded to CPU102.Parallel processing subsystem 112 is via bus or second communication path 113(for example peripheral component interconnect (pci) Express, Accelerated Graphics Port or super transmission link) be connected to Memory bridge 105; In one embodiment, parallel processing subsystem 112 is pixel to be sent to the graphics subsystem of display device 110, and display device 110 can be cathode-ray tube (CRT), liquid crystal display, light emitting diode indicator of any routine etc.System disk 114 also can be connected to I/O bridge 107, and can be configured to application program and data and content that storage is used by CPU102 and parallel processing subsystem 112.System disk 114 provides nonvolatile storage space for application program and data, and can comprise fixed or removable hard disk drive, flash drive and CD-ROM(compact disk ROM (read-only memory)), DVD-ROM(digital versatile disc-ROM), blue light, HD-DVD(high-definition DVD) or other magnetic, optics or solid storage device.

Interchanger 116 provide I/O bridge 107 with such as being connected between network adapter 118 and various plug-in card 120 and 121 miscellaneous part.Miscellaneous part (clearly not illustrating), comprises USB (universal serial bus) (USB) or the connection of other ports, compact disk (CD) driver, digital versatile disc (DVD) driver, film recording arrangement and like, also can be connected to I/O bridge 107.Various communication paths shown in Fig. 1 comprise that the communication path 106 and 113 of concrete name can be used any applicable agreement to realize, such as PCI-Express, AGP(Accelerated Graphics Port), super transmission or any other bus or point to point communication protocol, and as known in the art, the connection between distinct device can be used different agreement.

In one embodiment, parallel processing subsystem 112 comprises through optimizing the circuit for figure and Video processing, comprises for example video output circuit, and forms one or more parallel processing units (PPU) 202.In another embodiment, parallel processing subsystem 112 comprises through optimizing the circuit for general procedure, retains the computing architecture of bottom (underlying) simultaneously, will be described in more detail herein.In yet another embodiment, parallel processing subsystem 112 and one or more other system elements can be integrated in single subsystem, such as combined memory bridge 105, CPU102 and I/O bridge 107, to form SOC (system on a chip) (SoC).As everyone knows, many Graphics Processing Unit (GPU) are designed to carry out parallel work-flow and calculating, thereby are regarded as a class parallel processing element (PPU).

In parallel processing subsystem 112, can comprise the PPU202 of any number.For example, can on single plug-in card, provide a plurality of PPU202, maybe a plurality of plug-in cards can be connected to communication path 113, maybe one or more PPU202 can be integrated in bridge-type chip.PPU202 in many PPU system can be same or different each other.For example, different PPU202 may have the processing kernel of different numbers, local parallel processing storer of different capabilities etc.In the situation that there is a plurality of PPU202, thereby can with the handling capacity that may reach higher than single PPU202, carry out deal with data by those PPU of parallel work-flow.The system that comprises one or more PPU202 can usually realize with various configurations and formal cause, comprises desktop computer, notebook computer or HPC, server, workstation, game console, embedded system etc.

PPU202 advantageously realizes highly-parallel and processes framework.PPU202 comprises a large amount of general procedure clusters (GPC).Each GPC all can a large amount of (for example, hundreds of or several thousand) thread of concurrent execution, and wherein each thread is the example (instance) of program.In certain embodiments, single instrction, (SIMD) instruction of most certificate are sent technology for support the executed in parallel of a large amount of threads in the situation that a plurality of independent instruction unit is not provided.In other embodiments, single instrction, multithreading (SIMT) technology are for supporting a large amount of in general executed in parallel of synchronous thread with the common instruction unit that is configured to send to the processing engine collection in each of GPC208 instruction.Be different from all processing engine and conventionally all carry out the SIMD execution mechanism of same instruction, SIMT carries out by given thread program and allows different threads more easily to follow dispersion execution route.

GPU comprises a large amount of streaming multiprocessors (SM), and wherein each SM is all configured to process one or more sets of threads.The a series of instructions that are sent to specific GPC208 form threads, and are referred to herein as " thread bundle (warp) " or " sets of threads " across the set of the concurrent execution thread of a certain number of the parallel processing engine (not shown) in SM.As used herein, " sets of threads " refers to one group of thread difference input Data Concurrent being carried out to same program, and a thread of described group is assigned to the different disposal engine in SM.In addition, a plurality of related linear program groups can activity simultaneously in SM (different phase of carrying out).This sets of threads set is referred to herein as " cooperative thread array " (" CTA ") or " thread array ".

In an embodiment of the present invention, it is desirable with the PPU202 of computing system or other processors, using thread array to carry out general-purpose computations.For each thread in thread array assign thread the term of execution for the addressable unique thread identifier of thread (" Thread Id ").The Thread Id that can be defined as one or more dimensions numerical value is controlled the each side of thread process behavior.For example, Thread Id can be used for determining thread will be processed which part of input data set and/or which part that definite thread will produce or write output data set.

During work, CPU102 is the primary processor of computer system 100, controls and coordinate the operation of other system parts.Particularly, CPU102 sends the order of the operation of controlling PPU202.In one embodiment, communication path 113 is PCI Express links, and as known in the art, wherein designated lane is assigned to each PPU202.Also can use other communication paths.PPU202 advantageously realizes highly-parallel and processes framework.PPU202 can be equipped with the local parallel processing storer (PPU storer) of any capacity (amount).

In certain embodiments, system storage 104 comprises unified virtual memory (UVM) driver 101.UVM driver 101 comprises for carrying out and instruction for the relevant various tasks of unified virtual memory (UVM) that CPU102 and PPU202 shared.In addition, this framework also makes CPU102 and PPU202 to visit physical memory location in enough shared (common) virtual memory addresses, and no matter this physical memory location is in system storage 104 or the storer of PPU202 this locality (PPU storer).

Should be appreciated that, herein shown in system be exemplary, and to change and revise be all possible.Connect topology, comprise number and layout, the number of CPU102 and the number of parallel processing subsystem 112 of bridge, can revise as required.For example, in certain embodiments, system storage 104 is directly connected to CPU102 rather than passes through bridge, and other equipment are communicated by letter with system storage 104 with CPU102 via Memory bridge 105.In other substituting topologys, parallel processing subsystem 112 is connected to I/O bridge 107 or is directly connected to CPU102, rather than is connected to Memory bridge 105.And in other embodiments, I/O bridge 107 and Memory bridge 105 may be integrated on one single chip rather than as one or more separate devices and exist.Large-scale embodiment can comprise two or more CPU102 and two or more parallel processing system (PPS)s 112.Specific features shown in this article is optional; For example, the plug-in card of any number or peripherals all may be supported.In certain embodiments, interchanger 116 is removed, and network adapter 118 and plug-in card 120,121 are directly connected to I/O bridge 107.

Unified virtual memory system framework

Fig. 2 is block diagram according to an embodiment of the invention, that unified virtual memory (UVM) system 200 is shown.As shown in the figure, unify virtual memory system 200 include, without being limited to CPU102, system storage 104 and with parallel processing element storer (PPU storer) 204 parallel processing elements that are connected (PPU) 202.CPU102 is connected with each other with system storage 104 and be connected with PPU202 via Memory bridge 105.

CPU102 carries out the thread that can ask to be stored in the data in system storage 104 or PPU storer 204 via virtual memory address.According to the understanding to the inner workings of accumulator system, the thread that virtual memory address shielding (shield) is just being carried out in CPU102.Thereby thread can only be known virtual memory address, and can be by via virtual memory address, request msg conducts interviews to data.

CPU102 comprises CPU MMU209, its process from CPU102 for virtual memory address being translated into the request of physical memory address.To being stored in such as the data in system storage 104 and PPU storer 204 these class physical memory cells arc, conduct interviews, need physical memory address.CPU102 comprises cpu fault handling procedure 211, and it performs step in response to producing the CPUMMU209 of page fault, so that the data of asking are available for CPU102.Cpu fault handling procedure 211 is generally to stay deposits (reside) in system storage 104 and the software of carrying out on CPU102, and this software is waken up by the interruption to CPU102.

System storage 104 storages comprise various storage page (not shown), and these storage pages are for the thread institute carrying out on CPU102 or PPU202.As shown in the figure, system storage 104 storage CPU page tables 206, it comprises the mapping between virtual memory address and physical memory address.System storage 104 is gone back memory page state catalogue 210, and it serves as " page tables " for UVM system 200, as more discussed in detail below.System storage 104 storage failure buffers 216, it comprises by PPU202 and writes to notify the entry of the page fault that CPU102P produces by PU202.In certain embodiments, system storage 104 comprises unified virtual memory (UVM) driver 101, and it comprises such instruction, and these instructions are also carried out for repairing the order of page fault being performed outside seasonal CPU102.In an alternative embodiment, any combination of page status catalogue 210 and one or more command queue 214 all can be stored in PPU storer 204.In addition, PPU page table 208 can be stored in system storage 104.

With with the similar mode of CPU102, PPU202 carries out and can ask to be stored in via virtual memory address the instruction of the data in system storage 104 or PPU storer 204.PPU202 comprises PPU MMU213, its process from PPU202 for virtual memory address being translated into the request of physical memory address.PPU202 also comprises replication engine 212, its execution be stored in command queue 214 for copying storage page, revise order and other orders of the data of PPU page table 208.PPU exception handles 215 performs step in response to the page fault on PPU202.PPU exception handles 215 can be the software moving on the special-purpose microcontroller in processor or PPU202.Alternatively, PPU exception handles 215 can be the combination of the software that moves on the software moving on CPU102 that communicates with one another and the special-purpose microcontroller in PPU202.In certain embodiments, cpu fault handling procedure 211 and PPU exception handles 215 can be the fault on any one be called (invoke) by CPU102 or PPU202.Command queue 214 can be in PPU storer 204 or system storage 104, but is preferably placed in system storage 104.

In certain embodiments, cpu fault handling procedure 211 and UVM driver 101 can be unified software programs.In this class situation, described unified software program can be to reside in system storage 104 and the software of carrying out on CPU102.PPU exception handles 215 can be the independent software program moving on the special-purpose microcontroller in processor or PPU202, or PPU exception handles 215 can be the independent software program moving on CPU102.

In other embodiments, PPU exception handles 215 and UVM driver 101 can be unified software programs.In this class situation, described unified software program can be to reside in system storage 104 and the software of carrying out on CPU102.Cpu fault handling procedure 211 can be to reside in system storage 104 and the software of carrying out on CPU102.

In other embodiments, cpu fault handling procedure 211, PPU exception handles 215 and UVM driver 101 can be unified software programs.In this class situation, described unified software program can be to reside in system storage 104 and the software of carrying out on CPU102.

In certain embodiments, as mentioned above, cpu fault handling procedure 211, PPU exception handles 215 and UVM driver 101 can all reside in system storage 104.As shown in Figure 2, UVM driver 101 resides in system storage 104, and cpu fault handling procedure 211 and PPU exception handles 215 reside in CPU102.

Cpu fault handling procedure 211 and PPU exception handles 215 can be to the hardware interrupts that is derived from CPU102 or PPU202 for example because response is made in the interruption that page fault causes.As described further below, UVM driver 101 comprises for carrying out the instruction of the various tasks relevant with the management of UVM system 200, includes, without being limited to repair page fault and access CPU page table 206, page status catalogue 210 and/or fault buffer 216.

In certain embodiments, CPU page table 206 and PPU page table 208 have different-format, and include different information; For example, PPU page table 208 can include and CPU page table 206 containing following message: atom unused bit (atomic disable bit); Compression tag (compression tag); Mix type (memory swizzling type) with storer.

With with the similar mode of system storage 104, the various page of PPU storer 204 storage (not shown).As shown in the figure, PPU storer 204 also comprises PPU page table 208, and it comprises the mapping between virtual memory address and physical memory address.Alternatively, PPU page table 208 can be stored in system storage 104.

Translate virtual memory address

When the thread of carrying out in CPU102 carrys out request msg via virtual memory address, CPU102 is translated into physical memory address from CPU Memory Management Unit (CPU MMU) 209 requests by virtual memory address.As response, CPU MMU209 attempts virtual memory address to be translated into physical memory address, stores the position of the data that CPU102 asks in described physical memory address specified memory cells, and for example system storage 104.

For virtual memory address is translated into physical memory address, CPU MMU209 carries out search operation, to determine whether CPU page table 206 is contained in the mapping that virtual memory address is associated.Except virtual memory address, the request of visit data also can indicate virtual memory address space.Unified virtual memory system 200 can be realized a plurality of virtual memory address space, and each space is all assigned one or more threads.Virtual memory address is all unique in any given virtual memory address space.In addition, the virtual memory address in given virtual memory address is continuous across CPU102 and PPU202, thereby allows same virtual memory address to point to same data across CPU102 and PPU202.In certain embodiments, same data can be pointed in two virtual memory addresses, but can not be mapped to Same Physical storage address (for example, CPU102 and PPU202 each all can have the read-only copy in this locality of data).

For any given virtual memory address, CPU page table 206 can comprise or can not comprise the mapping between virtual memory address and physical memory address.If CPU page table 206 comprises mapping, CPU MMU209 reads this mapping, to determine the physical memory address being associated with virtual memory address and to provide physical memory address to CPU102.Yet if CPU page table 206 does not comprise the mapping being associated with virtual memory address, CPU MMU209 can not be translated into physical memory address by virtual memory address, and CPU MMU209 produces page fault.In order to repair page fault and to make asked data, for CPU102, be available, carry out " page fault sequence (sequence) ".More specifically, CPU102 reads PSD210 to find out the current mapping status of page and then to determine suitable page fault sequence.Page fault sequence is shone upon the storage page being associated with asked virtual memory address or the access type (for example, read access, write-access, atomic access) that changes license conventionally.The dissimilar page fault sequence realizing in UVM system 200 is discussed below in further detail.

In UVM system 200, the data that are associated with given virtual memory address can be stored in system storage 104, PPU storer 204 or system storage 104 and the PPU storer 204 read-only copy as same data in both.In addition, for any these class data, in CPU page table 206 and PPU page table 208 any one or both can comprise the mapping being associated with these data.Note that and have some data, for its mapping, be present in a page table but be not present in another.Yet PSD210 comprises and is stored in all mappings in PPU page table 208 and is stored in the PPU correlation map in CPU page table 206.PSD210 thereby with " master " page table that acts on unified virtual memory system 200.Therefore,, when CPU MMU209 does not find mapping in the CPU page table 206 being associated with specific virtual memory address, CPU102 reads PSD210 to determine whether PSD210 is contained in the mapping that this virtual memory address is associated.Except the mapping being associated with virtual memory address, the various embodiment of PSD210 also can comprise the dissimilar information being associated with virtual memory address.

When CPU MMU209 produces page fault, cpu fault handling procedure 211 is carried out sequence of operations for suitable page fault sequence to repair page fault.And, in page fault sequence process, CPU102 read PSD210 and carry out additional operations so as to change CPU page table 206 and PPU page table 208 in mapping perhaps can (permission).This generic operation can comprise and reads and/or revise CPU page table 206, reads and/or revises page status catalogue 210 and/or for example, move the data block that is called as " storage page " between memory cell (, system storage 104 and PPU storer 204).

For definite which operation will be carried out in page fault sequence, the storage page that CPU102 identification is associated with virtual memory address.The relevant PSD210 in virtual memory address that then CPU102 is associated from the memory access request with causing page fault, read the status information about this storage page.This class status information can comprise the entitlement state of the storage page about being associated with virtual memory address in addition.For any given storage page, some entitlement states are all possible.For example, storage page can be " CPU is all ", " PPU is all " or " CPU shares ".If CPU102 can not cause page fault via virtual address reference-to storage page, and if PPU202 can not be via virtual address reference-to storage page in the situation that not causing page fault, to be considered to be CPU all for this storage page.Preferably, all pages of CPU reside in system storage 104, but also can reside in PPU storer 204.If PPU202 can be via virtual address reference-to storage page, and if CPU102 can not in the situation that not causing page fault, via virtual address, access this page, to be considered to be PPU all for this storage page.Preferably, all pages of PPU reside in PPU storer 204, but when do not carry out from system storage 104 to PPU storeies 204 migration time, also can reside in system storage 104.Finally, if CPU102 and PPU202 can not cause page fault via virtual address reference-to storage page, this storer is considered to be CPU and shares.The page that CPU shares can reside on system storage 104 or PPU storer 204 in any one.

CPU page table 206 can comprise the use history of storage page and entitlement state is assigned to storage page based on various factors.Use history can comprise about CPU102 or whether PPU202 accessed storage page recently and this class is accessed the information of having carried out how many times.For example, if the use of the storage page based on given is historical, UVM system 200 determines that this storage page may mainly or only be used by CPU102, and UVM system 200 can be assigned to this storage page the entitlement state of " CPU is all ", and this page is placed in to system storage 104.Similarly, if the use of the storage page based on given is historical, UVM system 200 determines that this storage page may mainly or only be used by PPU202, and UVM system 200 can be assigned to this storage page the entitlement state of " PPU is all ", and this page is placed in to PPU storer 204.Finally, if the use of the storage page based on given is historical, UVM system 200 determines that this storage page may both be used by CPU102 and PPU202, and determine storage page is moved and will expend the too many time back and forth from system storage 104 to PPU storeies 204, UVM system 200 can be assigned to this storage page the entitlement state of " CPU shares ".

As example, exception handles 211 and 215 can be implemented following any or all of for the heuristics (heuristics) of moving:

(a) about to mapping to PPU202 and the not CPU102 access of the page that is cancelled mapping (unmap) of migration recently, out of order page is cancelled to mapping from PPU202, this page migration is arrived to CPU102, and this page is mapped to CPU102;

(b) about to mapping to CPU102 and the not PPU202 access of the page that is cancelled mapping of migration recently, out of order page is cancelled to mapping from CPU102, this page migration, to PPU202, and is mapped to PPU202 by this page;

(c) about to mapping to PPU202 and the CPU102 access of the page that is cancelled mapping through moving recently, out of order page migration is mapped on CPU102 and PPU202 to CPU102 and by this page;

(d) about accessing being mapped in CPU102 the PPU202 upper and page that is cancelled mapping through moving recently, this page is mapped to CPU102 and PPU202;

(e) about the PPU202 atomic access of the page of the atomic operation not enabled to mapping to CPU102 and PPU202 but carrying out for PPU202, this page cancel is shone upon from CPU102, and map to PPU202 and enable atomic operation;

(f) about going up the PPU202 write-access as copy-on-write (copy-on-write) page (COW) to being mapped in CPU102 and PPU202, this page is copied to PPU202, thereby make the separate copy of this page, new page is mapped in to PPU as read-write (read-write) and goes up, and retain current page and be mapped on CPU102;

(g) about going up the PPU202 read access as zero filling as required (zero-fill-on-demand) page (ZFOD) to being mapped in CPU102 and PPU202, distribute the pages of physical memory on PPU202 and use zero padding, and it is upper that this page is mapped in to PPU, but changed into, is cancelled mapping on CPU102;

(h) about the access to the page that is cancelled mapping that is mapped in the 2nd PPU202 (2) above and does not move recently by a PPU202 (1), out of order page is cancelled to mapping from the 2nd PPU202 (2), by this page migration to the PPU202 (1), and this page is mapped to a PPU202 (1); And

(i) about being gone up and the access of the page that is cancelled mapping through moving being recently mapped in the 2nd PPU202 (2) by a PPU202 (1), out of order page is mapped to a PPU202 (1), and keep the mapping of this page on the 2nd PPU202 (2).

In a word, many heuristics rules are all possible, and scope of the present invention is not limited to these examples.

In addition, any migration heuristics all can " (round up) rounds up " for example, to comprise more page or larger page size:

(j) about the CPU102 access to the page that is cancelled mapping that maps to PPU202 and do not move recently, out of order page is added to extra page adjacent with this out of order page in virtual address space and from PPU202, cancels mapping, these page migrations are arrived to CPU102, and these pages are mapped to CPU102(in more detailed example: for 4kB fault page, (aligned) 64kB region of the aligning that migration comprises 4kB fault page);

(k) about the PPU202 access to the page that is cancelled mapping that maps to CPU102 and do not move recently, out of order page is added to extra page adjacent with this out of order page in virtual address space and from CPU102, cancels mapping, these page migrations are arrived to PPU202, and these pages are mapped to PPU202(in more detailed example: for 4kB fault page, the 64kB region of the aligning that migration comprises 4kB fault page);

(l) about the CPU102 access to the page that is cancelled mapping that maps to PPU202 and do not move recently, out of order page is added to extra page adjacent with this out of order page in virtual address space and from PPU202, cancels mapping, these page migrations are arrived to CPU102, these pages are mapped to CPU102, and the one or more larger page on CPU102 is treated (in more detailed example: for 4kB fault page using the page of all migrations, the 64kB region of the aligning that migration comprises 4kB fault page, and the 64kB region of this aligning is treated as 64kB page);

(m) about the PPU202 access to the page that is cancelled mapping that maps to CPU102 and do not move recently, out of order page is added to extra page adjacent with this out of order page in virtual address space and from CPU102, cancels mapping, these page migrations are arrived to PPU202, these pages are mapped to PPU202, and the one or more larger page on PPU202 is treated (in more detailed example: for 4kB fault page using the page of all migrations, the 64kB region of the aligning that migration comprises 4kB fault page, and the 64kB region of this aligning is treated as 64kB page);

(n) about the access to the page that is cancelled mapping that maps to the 2nd PPU202 (2) above and do not move recently by a PPU202 (1), out of order page is added to extra page adjacent with this out of order page in virtual address space and from the 2nd PPU202 (2), cancels mapping, by these page migrations to the PPU202 (1), and these pages are mapped to a PPU202 (1); And

(o) about being gone up and the access of the page that is cancelled mapping through moving being recently mapped in the 2nd PPU202 (2) by a PPU202 (1), out of order page is added to extra page adjacent with this out of order page in virtual address space and maps to a PPU202 (1), and keep the mapping of these pages on the 2nd PPU202 (2).

In certain embodiments, PSD entry can comprise transition (transitional) status information, with the suitable synchronization between the various requests of guaranteeing to be made by the unit in CPU102 and PPU202.For example, PSD210 entry can comprise such transition state information, and specific page is in all being transitioned into all processes of PPU from CPU.Various unit in CPU102 and PPU202, for example cpu fault handling procedure 211 and PPU exception handles 215, once determining that page is in this class transition state, just can abandon (forego) part page fault sequence, to avoid by before to the step in the page fault sequence that virtual memory access was triggered of same virtual memory address.As a specific example, if page fault causes page to move to PPU storer 204 from system storage 104, detect and will cause the different page fault of same migration and not cause another page migration.In addition, atomic operation can be implemented in the various unit in CPU102 and PPU202, for the operation on PSD210, carries out suitable sequence.For example, about the modification to PSD210 entry, cpu fault handling procedure 211 or PPU exception handles 215 can send atomic ratio and exchange (swap) operation, to revise the page status of the particular items in PSD210.Therefore, this modification can be in the situation that be subject to have disturbed from the operation of other unit.

In system storage 104, can store a plurality of PSD210---one of each virtual address space.The memory access request of CPU102 or any one generation of PPU202 thereby can comprise virtual memory address and the virtual memory address space that is associated with this virtual memory address of identification.

As CPU102, can carry out the memory access request (that is, comprising the instruction via the request of virtual memory address visit data) that comprises virtual memory address the same, PPU202 also can carry out the memory access request of similar type.More specifically, PPU202 comprises a plurality of performance elements that are configured to carry out a plurality of threads and sets of threads of describing in conjunction with Fig. 1, for example GPC and SM above.In operation, those threads can be by formulating virtual memory address for example, from memory requests data (, system storage 104 or PPU storer 204).The same with CPU MMU209 as CPU102, PPU202 comprises PPU Memory Management Unit (MMU) 213.PPU MMU213 receives the request of translating about virtual memory address from PPU202, and attempts to provide and translate from PPU page table 208 for virtual memory address.

With CPU page table 206 similarly, PPU page table 208 comprises the mapping between virtual memory address and physical memory address.The situation of CPU page table 206 is also like this, and for any given virtual address, PPU page table 208 can not comprise the page table entries that virtual memory address is mapped to physical memory address.The same with CPU MMU209, when PPU MMU213 from PPU page table 208 request to virtual memory address translate and PPU page table 208 not have any mapping or this access type be 208 of PPU page tables when unallowed, PPU MMU213 produces page fault.Subsequently, PPU exception handles 215 triggers page fault sequence.And, will be described in more detail in the dissimilar page fault sequence of implementing in UVM system 200 below.

In page fault sequence process, CPU102 or PPU202 can write order command queue 214, for being carried out by replication engine 212.This method makes CPU102 or PPU202 when replication engine 212 reads and carry out the order being stored in command queue 214, be vacated to carry out other tasks, and allow to be queued about all orders of failure sequence, thereby avoid the supervision to the progress of failure sequence simultaneously.The order of being carried out by replication engine 212 can comprise in addition deletion, generates or revise the page table entries in PPU page table 208, reads or data writing, and data are read or written to PPU storer 204 from system storage 104.

216 storages of fault buffer indicate the fault buffer entries of the information relevant with the page fault being produced by PPU202.Fault buffer entries can comprise the type of for example attempting the access carried out (for example, read, write or atom), caused page fault attempt the access carried out for virtual memory address, virtual address space and to having caused the indication of unit or the thread of page fault.In operation, when PPU202 causes page fault, PPU202 can be by fault buffer entries Write fault buffer 216, to notify the type of PPU exception handles 215 about the access of out of order page and this fault of initiation.Then PPU exception handles 215 performs an action to repair page fault.Because PPU202 is carrying out a plurality of threads, so fault buffer 216 can be stored a plurality of faults, wherein each thread is because the pipelined nature of the memory access of PPU202 all can cause one or more faults.

Page fault sequence

As mentioned above, in response to receiving the request of translating about virtual memory address, if CPU page table 206 does not comprise the mapping being associated with asked virtual memory address or disapproves the type of just requested access, CPU MMU209 produces page fault.Similarly, in response to receiving the request of translating about virtual memory address, if PPU page table 208 does not comprise the mapping being associated with asked virtual memory address or disapproves the type of just requested access, PPU MMU213 produces page fault.When CPU MMU209 or PPU MMU213 generation page fault, asked the thread of the data at place, virtual memory address to stagnate (stall), and " local exception handles " and attempted to repair page fault by carrying out " page fault sequence "---for the cpu fault handling procedure 211 of CPU102 or for the PPU exception handles 215 of PPU202---.As noted above, page fault sequence comprises and makes out of order unit (that is, caused the unit of this page fault---CPU102 or PPU202 any one) can access the sequence of operations of the data that are associated with virtual memory address.After page fault EOS, via virtual memory address request the thread of data continue to carry out.In certain embodiments, by allowing fault recovery logic to follow the tracks of the out of order memory access contrary with out of order instruction, fault recovery is simplified.

If exist any storage page being associated with page fault have to experience the variation of entitlement state or the variation of access permission, in page fault sequence process, these variations are depended in performed operation.From current entitlement state to the new transition of entitlement state or the variation of access permission can be a part for page fault sequence.In some instances, the storage page being associated with page fault is moved to the part that PPU storer 204 is also page fault sequence from system storage 104.In other examples, the storage page being associated with page fault is moved to the part that system storage 104 is also page fault sequence from PPU storer 204.The various heuristicses of comparatively fully describing herein can be used for configuring UVM system 200 to change storage page entitlement state or to move storage page with the set according to various operating conditionss and pattern.To be in greater detail page fault sequence about the transition of following four kinds of storage page entitlement states below: CPU all to CPU share, all, the PPUs all to PPU of CPU are all and PPU is all to CPU, shares to CPU.

The fault being produced by PPU202 can start from CPU all to the shared transition of CPU.Before this transition, the thread of just carrying out in PPU202 is attempted the data that access does not have mapped place, virtual memory address in PPU page table 208.This access is attempted to cause the page fault based on PPU, and then this page fault causes fault buffer entries to be written to fault buffer 216.As response, PPU exception handles 215 reads the PSD210 entry corresponding with this virtual memory address, and identifies the storage page being associated with this virtual memory address.After PSD210 is read, PPU exception handles 215 determines that the current entitlement state of the storage page being associated with this virtual memory address is that CPU is all.Based on current entitlement state and other factors, for example, about the operating characteristic of storage page or the type of memory access, PPU exception handles 215 determines that the new entitlement state of this page should be CPU and shares.

For the state of changing ownership, PPU exception handles 215 writes new entry in PPU page table 208, corresponding with virtual memory address and virtual memory address and the storage page identified via PSD210 entry are associated.PPU exception handles 215 is also revised about the PSD210 entry of this storage page and be take and indicate entitlement state and share as CPU.In certain embodiments, make in PPU202 to translate standby (look-aside) buffer (TLB) invalid, so that the situation that is wherein cached (cache) to translating of invalid page is taken in.Now, page fault EOS.The entitlement state of storage page is that CPU shares, and means that storage page is all addressable for CPU102 and PPU202.CPU page table 206 and PPU page table 208 both comprise the entry that virtual memory address is associated with to storage page.

The fault being produced by PPU202 can start from CPU all to all transition of PPU.Before this transition, the data that access does not have mapped place, virtual memory address in PPU page table 208 are attempted in the operation of just carrying out in PPU202.This memory access is attempted to cause the page fault based on PPU, and then this page fault causes fault buffer entries to be written to fault buffer 216.As response, PPU exception handles 215 reads the PSD210 entry corresponding with this virtual memory address, and identifies the storage page being associated with this virtual memory address.After PSD210 is read, PPU exception handles 215 determines that the current entitlement state of the storage page being associated with this virtual memory address is that CPU is all.Based on current entitlement state and other factors, for example, about the operating characteristic of this page or the type of memory access, it is all that PPU exception handles 215 determines that the new entitlement state of this page should be PPU.

PPU202 has produced page fault and has indicated in the fault buffer entries Write fault buffer 216 of the virtual memory address being associated with this page fault indicating PPU202.The PPU exception handles 215 of just carrying out on CPU102 reads this fault buffer entries, and as response, CPU102 removes the mapping being associated with the virtual memory address of causing this page fault in CPU page table 206.CPU102 can clear up (flush) high-speed cache before or after removing mapping.The order that CPU102 also copies to page indication PPU202 PPU storer 204 from system storage 104 is written in command queue 214.Order in replication engine 212 reading order queues 214 in PPU202 also copies to PPU storer 204 by page from system storage 104.PPU202 writes page table entries in PPU page table 208, corresponding with this virtual memory address and the storage page newly copying in this virtual memory address and PPU storer 204 is associated.To writing of PPU page table 208, can complete via PPU202.Alternatively, the renewable PPU page table 208 of CPU102.PPU exception handles 215 is also revised the PSD210 about this storage page, take indicate entitlement state as PPU all.In certain embodiments, can make the entry in the TLB in PPU202 or CPU102 invalid, to include consideration in by wherein translating situation about being cached.Now, page fault EOS.The entitlement state of storage page is that PPU is all, means that this storage page is only addressable for PPU202.Only PPU page table 208 comprises the entry that virtual memory address and this storage page are associated.

The fault being produced by CPU102 can start from PPU all to all transition of CPU.Before this transition, the data that access does not have mapped place, virtual memory address in CPU page table 206 are attempted in the operation of just carrying out in CPU102, and this causes the page fault based on CPU.Cpu fault handling procedure 211 reads the PSD210 entry corresponding with this virtual memory address, and identifies the storage page being associated with this virtual memory address.After PSD210 is read, cpu fault handling procedure 211 determines that the current entitlement state of the storage page being associated with this virtual memory address is that PPU is all.Based on current entitlement state and other factors, for example, about the operating characteristic of this page or the type of access, cpu fault handling procedure 211 determines that the new entitlement state of this page is that CPU is all.

It is all that cpu fault handling procedure 211 changes to CPU by the entitlement state being associated with storage page.Cpu fault handling procedure 211 writes order in command queue 214, to make replication engine 212 remove the entry that virtual memory address and this storage page are associated from PPU page table 208.Can make various TLB entries invalid.Cpu fault handling procedure 211 also copies to storage page system storage 104 from PPU storer 204, and this can complete via command queue 214 and replication engine 212.Cpu fault handling procedure 211 writes virtual memory address and is copied to the page table entries that the storage page in system storage 104 associates in CPU page table 206.Cpu fault handling procedure 211 also upgrades PSD210 so that virtual memory address and the storage page newly copying are associated.Now, page fault EOS.The entitlement state of storage page is that CPU is all, means that this storage page is only addressable for CPU102.Only CPU page table 206 comprises the entry that virtual memory address and this storage page are associated.

The fault being produced by CPU102 can start from PPU all to the shared transition of CPU.Before this transition, the data that access does not have mapped place, virtual memory address in CPU page table 206 are attempted in the operation of just carrying out in CPU102, and this causes the page fault based on CPU.Cpu fault handling procedure 211 reads the PSD210 entry corresponding with this virtual memory address, and identifies the storage page being associated with this virtual memory address.After PSD210 is read, cpu fault handling procedure 211 determines that the current entitlement state of the storage page being associated with this virtual memory address is that PPU is all.Based on current entitlement state and other factors, for example, about the operating characteristic of this page, cpu fault handling procedure 211 determines that the new entitlement state of this page is that CPU shares.

Cpu fault handling procedure 211 changes to CPU by the entitlement state being associated with storage page and shares.Cpu fault handling procedure 211 writes order in command queue 214, to make replication engine 212 remove the entry that virtual memory address and this storage page are associated from PPU page table 208.Can make various TLB entries invalid.Cpu fault handling procedure 211 also copies to storage page system storage 104 from PPU storer 204.This replicate run can complete via command queue 214 and replication engine 212.Then cpu fault handling procedure 211 writes order in command queue 214, and the entry to make replication engine 212 change in PPU page table 208, associates the storage page in virtual memory address and system storage 104.Cpu fault handling procedure 211 writes CPU page table 206 by page table entries, so that the storage page in virtual memory address and system storage 104 is associated.Cpu fault handling procedure 211 also upgrades PSD210 so that the storage page in virtual memory address and system storage 104 is associated.Now, page fault EOS.The entitlement state of this page is that CPU shares, and this storage page has been copied in system storage 104.Because CPU page table 206 comprises the entry that the storage page in virtual memory address and system storage 104 is associated, so this page is addressable for CPU102.Because PPU page table 208 comprises the entry that the storage page in virtual memory address and system storage 104 is associated, so this page is also addressable for PPU202.

The detailed example of page fault sequence

Under this situation, if provide now to from all detailed descriptions of sharing the page fault sequences that transition carried out by PPU exception handles 215 to CPU of CPPU to show atomic operation and transition state are how to be used for supervisory sequence more effectively.Page fault sequence is attempted access and in PPU page table 208, is not existed the PPU202 thread of the virtual address of correlation map to trigger.When thread is attempted via virtual memory address visit data, PPU202(particularly, user-level thread) from PPU page table 208 request, translate.Because PPU page table 208 is not contained in the mapping that asked virtual memory address is associated, so as response, there is PPU page fault.

After page fault occurs, above-mentioned thread is stranded, stagnation, and PPU exception handles 215 is carried out page fault sequences.215 couples of PSD210 of PPU exception handles read, to determine which storage page is associated with virtual memory address, and the state of definite virtual memory address.PPU exception handles 215 determines that from PSD210 the entitlement state of storage page is that CPU is all.The data of therefore, being asked by PPU202 are addressable via virtual memory address for PPU202.The status information of storage page also indicates asked data can not be migrated to PPU storer 204.

Status information based on obtaining from PSD210, PPU exception handles 215 determines that the new state of storage page should be CPU and shares.PPU exception handles 215 changes to state " being transitioned into CPU shares ".It is current in the accelerating transition process shared to CPU that this state indicates page.While moving on the microcontroller of PPU exception handles 215 in Memory Management Unit, two processors will upgrade PSD210 asynchronously, to PSD210 use atomic ratio-exchange (" CAS ") operation and state is changed to " being transitioned into GPU visible (visible) " (CPU shares).

PPU202 upgrades PPU page table 208 so that virtual memory address and storage page are associated.PPU202 also makes TLB cache entries invalid.Then, PPU202 to PSD210 carry out another atomic ratio-exchange, the entitlement state being associated with storage page is changed to CPU, share.Finally, page fault sequence stops, and via virtual memory address request the thread of data continue to carry out.

UVM system architecture variation

Various modifications to unified virtual memory system 200 are all possible.For example, in certain embodiments, by in fault buffer entries Write fault buffer 216, PPU202 can trigger CPU and interrupt, to make the fault buffer entries in CPU102 read failure buffer 216 and to carry out any suitable operation in response to this fault buffer entries.In other embodiments, CPU102 poll (poll) fault buffer 216 periodically.If CPU102 finds fault buffer entries in fault buffer 216, CPU102 carries out sequence of operations in response to this fault buffer entries.

In certain embodiments, system storage 104, rather than PPU storer 204, storage PPU page table 208.In other embodiments, can implement single-stage or multilevel cache level framework (hierarchy), for example single-stage or multistage standby buffer (TLB) the level framework (not shown) of translating, for CPU page table 206 or PPU page table 208 cache virtual address translations.

In other embodiment, if the thread of just carrying out in PPU202 cause PPU fault (" out of order thread ") PPU202 can take one or more actions.These actions comprise: whole PPU202 is stagnated, the SM that carries out out of order thread is stagnated, PPU MMU213 is stagnated, only make out of order thread stagnate or one or more levels TLB is stagnated.In certain embodiments, after PPU page fault occurs, and unified virtual memory system 200 executed page fault sequences, out of order thread continues to carry out, and the thread that is out of order attempts to have caused the memory access request of this page fault again.In certain embodiments, the stagnation at TLB place is to show as the such mode of the long delay of out of order SM or out of order thread (long-latency) memory access to carry out, thereby for fault, makes any special operational without SM.

Finally, in other alternate embodiment, UVM driver 101 can comprise makes CPU102 carry out one or more operations for managing UVM system 200 and repairing the instruction of page fault, for example, access CPU page table 206, PSD210 and/or fault buffer 216.In other embodiments, operating system nucleus (not shown) can be configured to by access CPU page table 206, PSD210 and/or fault buffer 216 and manages UVM system 200 and repair page fault.In other embodiment, operating system nucleus can be together with 101 operations of UVM driver, to be managed UVM system 200 and to be repaired page fault by access CPU page table 206, PSD210 and/or fault buffer 216.

The storage page of migration different size

Be stored in that storage page in system storage 104 is licensed has a size different from being stored in storage page in PPU storer 204.For example, the storage page being stored in system storage 104 can have the size of 4KB, and is stored in the size that storage page in PPU storer 204 can have 128KB.As another example, the storage page being stored in system storage 104 can have the size of 4KB, and is stored in the mixing that storage page in PPU storer 204 can have 4KB page and 128KB page.As another example, the storage page being stored in system storage 104 can have the mixing of 4KB and 1MB page, and is stored in the mixing that storage page in PPU storer 204 can have 4KB page and 128KB page.In page fault sequence process, UVM system 200 can be sent to another memory cell (for example,, from PPU storer 204 to system storage 104) by storage page from a memory cell.In order to allow the difference in the size of storage page, when UVM system 200 transmits storage page, UVM system 200 is divisible to be opened large memories page or combines a plurality of small memory pages.UVM system 200 also can transmit one or more other " at the same level (slibling) " storage pages together with the storage page that will transmit.In certain embodiments, storage page at the same level be there is reduced size storage page (for example, storage page has 4KB size in the system of storage 4KB and 128KB storage page), it can be contained in the address span (span) of aligning of larger page.The address span of aiming at refers to the address realm from start to end of the storage page of large-size.The less page that is positioned at such address span is regarded as storage page at the same level.

Below with reference to Fig. 3-7, describe and a plurality ofly compared with small memory page or by a plurality of small memory pages, be combined into compared with large memories page and between memory cell, transmit some operations that these storage pages are associated with large memories page is divided into.For example, be described in the operation occurring under following situation: from system storage 104, small memory page is sent to PPU storer 204(Fig. 3); From system storage 104, small memory page and storage page at the same level are sent to the large memories page (Fig. 4) PPU storer 204; Large memories page in PPU storer 204 is divided into small memory page, and one of them is sent to system storage 104(Fig. 5 by these small memory pages from PPU storer 204); And from PPU storer 204, the storage page at the same level of small memory page and this small memory page is sent to (Fig. 6) system storage 104.

That Fig. 3 shows is according to an embodiment of the invention, for small memory page being sent to from system storage 104 to the operation of PPU storer 204.In this operation, system storage 104 storage small memory pages 302 and PPU storer 204 storage small memory pages 304 and large memories page 306 both.According to this operation, UVM driver 101 determines that the specific small memory page 302 (0) in system storage 104 will migrate to PPU storer 204.Because PPU storer 204 can be stored small size or large-sized storage page, the storage page after migration is stored as small memory page 304 (1).Small memory page 304 (1) after migration can be close to other small memory page 304 storages in PPU storer 204.In PPU storer 204, also can store large memories page 306.After migration, the shared space of small memory page of being moved in system storage is disengaged distribution (deallocate) and can be used for distribution in the future.

UVM driver 101 is changed PSD entry with executable operations for the storage page being associated.Change PSD entry can comprise for the storage page being associated PSD entry is set, with in the middle of indicating and/or lock-out state.More specifically, the PSD entry that UVM driver 101 setting is associated with storage page 302 (0), with indicative of memory page 302 (0) in transmission and only can read.Subsequently, the PSD entry that UVM driver 101 setting is associated with target large memories page 306, to indicate this large memories page 306 in transmission and can not access.UVM driver 101 copies to target large memories page 306 by small memory page 302 (0).Subsequently, UVM driver 101 arranges PSD entry, to indicate target large memories page 306, can access (read and write).

As described with reference to Fig. 2 above, specific memory page can be moved due to various reasons.When needing to be stored in the single small memory page 302 in system storage 104 but also need in system storage 104 to surround " peer " storage page of single needed storage page in PPU storer 204, can shown in execution graph 3, operate.When single small memory page 302 is continually by the access of PPU202 for example, when one or more in storage page at the same level are continually by CPU102 access for example, there will be this situation simultaneously.Conventionally, when the storage page group that comprises single small memory page and storage page at the same level " by strongly advocating ", mean that various storage pages in this storage page group are by different processing units for example when CPU102 and PPU202 access, divisible this storage page group of UVM driver 101.

That Fig. 4 shows is according to an embodiment of the invention, for small memory page and relevant " peer " storage page being sent to from system storage 104 to the operation of PPU storer 204.In this operation, as the operation of describing with reference to Fig. 3, system storage 104 storage small memory pages 402 and PPU storer 204 storage small memory pages 404 and large memories page 406 both.According to shown in Fig. 4, operate, UVM driver 101 determines that specific small memory page 402 (1) and storage page at the same level 402 (0), storage page at the same level 402 (2) and storage page at the same level 402 (3) will migrate to PPU storer 204 from system storage 104.UVM driver 101 makes these storage pages migrate to PPU storer 204 from system storage 104.Because PPU storer 204 is conventionally higher with work efficiency together with large memories page, so can merging (coalesce) by the small memory page copying from system storage 104 402, UVM driver 101 becomes to comprise a large merging storage page 405 from the total data of small memory page.

In addition, as mentioned above, with reference to Fig. 2 and Fig. 3, the reason of migration specific memory page can be various, historical such as using.When needing specific memory page and mobile storage page at the same level to be considered to favourable in PPU storer 204, can shown in execution graph 4, operate.In one example, such page base merges in least recently used (least-recently-used) tracking scheme.The little page of asking without frequentation combines.

As mentioned above, UVM driver 101 is changed PSD entry with executable operations for the storage page being associated.Change PSD entry can comprise for the storage page being associated PSD entry is set, with in the middle of indicating and/or lock-out state.More specifically, the PSD entry that UVM driver 101 setting is associated with storage page 420 (0), storage page 420 (1), storage page 420 (2) and storage page 420 (3), to indicate these storage pages in transmission and only can read.Subsequently, the PSD entry that UVM driver 101 setting is associated with target large memories page 406, to indicate this large memories page 406 in transmission and can not access.Then UVM driver 101 copies to target large memories page 406 by these small memory pages.Subsequently, UVM driver 101 arranges PSD entry, to indicate target large memories page 406, can access (read and write).

That Fig. 5 shows is according to an embodiment of the invention, for small memory page being sent to from PPU storer 204 to the operation of system storage 104.In this operation, the same with the operation of describing with reference to Fig. 3 and Fig. 4 above, system storage 104 storage small memory pages 502, and PPU storer 204 storage large memories pages 504 and small memory page 506 both.According to this operation, UVM driver 101 determines that the specific part that is stored in the large memories page in PPU storer 204 will migrate to system storage 104.UVM driver 101 makes this large memories page decompose (break up) one-tenth small memory page, and the small memory page migration that then order is associated with above-mentioned part is to system storage 104.In addition, as mentioned above, the reason of migration specific memory page can be various, comprises and uses history.

As mentioned above, the storage page of divisible " by strongly advocating ".In other words, for specific large memories page, if the small memory page in this large memories page continually by a plurality of different processing units for example CPU102 or PPU202 access, divisible this large memories page.This analysis is based on using a historical alanysis.

As mentioned above, UVM driver 101 is changed PSD entry with operation shown in execution graph 5 for the storage page being associated.More specifically, the PSD entry that UVM driver 101 setting is associated with the large memories page that will cut apart, to indicate this storage page in transmission and only can read.Subsequently, the PSD entry that UVM driver 101 setting is associated with target small memory page 502, to indicate this small memory page 502 in transmission and can not access.Then UVM driver 101 arrives target target small memory page 502 by the above-mentioned partial replication of large memories page 506.Subsequently, UVM driver 101 arranges PSD entry, to indicate target target small memory page 502, can access (read and write).

That Fig. 6 shows is according to an embodiment of the invention, for small memory page and storage page at the same level being sent to from PPU storer 204 to the operation of system storage 104.In this operation, as the operation of describing with reference to Fig. 3-5 above, system storage 104 storage small memory pages 602 and PPU storer 204 storage large memories pages 604 and small memory page 606 both.According to this operation, UVM driver 101 determines that the specific part of large memories page 606 will migrate to system storage 104 from PPU storer 204.UVM driver 101 makes UVM driver 101 be divided into small memory page 604, and makes small memory page 604 (1) and storage page at the same level, and---small memory page 604 (0), small memory page 604 (2) and small memory page 604 (3)---migrates to system storage 104.As Fig. 3-5, specific memory page is such as small memory page 604 (1) can migrate to system storage 104 from PPU storer 204 due to various reasons, as described in seen figures.1.and.2 above.

Similarly, as described above, UVM driver 101 is changed PSD entry with executable operations for the storage page being associated.More specifically, the PSD entry that UVM driver 101 setting is associated with the large memories page that will cut apart, to indicate this storage page in transmission and only can read.Subsequently, UVM driver 101 is arranged at the PSD entry that target small memory page 602 is associated, to indicate this small memory page 602 in transmitting and can not access.Then UVM driver 101 copies to target small memory page 602 by the large memories page 606 that has resolved into now small memory page.Subsequently, UVM driver 101 arranges PSD entry and can access (read and write) to indicate target small memory page 602.

Fig. 7 be according to an embodiment of the invention, for move the process flow diagram of method step of the storage page of different size between the memory cell of virtual memory architecture.Although described these method steps in conjunction with Fig. 1-6, it will be understood by those skilled in the art that any system that is configured to carry out these methods with any order all falls within the scope of the present invention.

As shown in the figure, method 700 starts in step 702, and wherein UVM driver 101 determines that storage page will move.In step 704, UVM driver 101 determines whether this storage page is large memories page.If this storage page is large memories page, method 700 advances to step 706.In step 706, UVM driver 101 determines whether to cut apart large memories page.If UVM driver 101 determines that large memories page should be divided, the method advances to step 708.In step 708, UVM driver 101 is cut apart large memories page and is copied to another memory cell from the small memory page of a memory cell arrogant storage page in future.If in step 708, UVM driver 101 is definite does not cut apart large memories page, and the method advances to step 710.In step 710, UVM driver 101 copies to another memory cell from a memory cell by large memories page.

Return to step 704, if UVM driver 101 determines that storage page is not large memories page, storage page is that small memory page and method advance to step 712.In step 712, UVM driver 101 determines whether to merge small memory page and storage page at the same level.If UVM driver 101 determines that small memory page should be merged, UVM driver 101 advances to step 714.In step 714, UVM driver 101 from a memory cell by small memory page and memory copy at the same level to another memory cell.If in step 712, UVM driver 101 is definite does not merge storage page and storage page at the same level, and the method advances to step 716.In step 716, UVM driver 101 copies to another memory cell from a memory cell by small memory page.

In a word, provide a kind of method, by the method, the storage page residing in the memory cell of storage page of storage different size can be moved between different memory cells.UVM driver 101 determines which storer will move.If storage page is small memory page, UVM driver 101 determines whether also will move storage page at the same level.If storage page is large memories page, UVM driver 101 determines whether to cut apart this large memories page, or whether will move whole storage page.In transition process, the access that UVM driver 101 stops storage page related in migration.

An advantage of disclosed technology is, between the different memory unit in virtual memory architecture, can effectively move back and forth the storage page of different size.This technology is by the dirigibility that allows unified virtual memory system to work to improve unified virtual memory system together with many dissimilar memory architectures.Another associated advantages is, by allowing large memories page be divided into compared with small memory page and allow small memory page to be merged into compared with large memories page, the storage page with different size can be stored in and be configured to store in the different memory unit of different memory page size.This feature allows unified virtual memory system, possible in the situation that, page is returned to group, to reduce at page table and/or the amount of space taking in translating standby buffer.This feature also allows that storage page is divided opens and migrate to different memory cells, as long as will improve storer locality and reduce memory access time this cutting apart.

Although foregoing is for embodiments of the invention, can to of the present invention other and further embodiment design and do not depart from its base region.For example, can hardware or the combination of software or hardware and software realize each aspect of the present invention.One embodiment of the present of invention can be implemented as the program product using together with computer system.The program of this program product defines each function (comprising method described herein) of embodiment and can be contained on various computer-readable recording mediums.Exemplary storage computer-readable medium includes but not limited to: (i) the storage medium that can not write (for example, ROM (read-only memory) equipment in computing machine, solid state non-volatile semiconductor memory such as the compact disc read-only memory that can be read by CD-ROM drive (CD-ROM) dish, flash memory, ROM (read-only memory) (ROM) chip or any type), store permanent information thereon; (ii) the storage medium that can write (for example, the solid-state random-access semiconductor memory of the floppy disk in disc driver or hard disk drive or any type), stores modifiable information thereon.When carrying is during for the computer-readable instruction of function of the present invention, such computer-readable recording medium is embodiments of the invention.

Below with reference to specific embodiment, invention has been described.Yet, those skilled in the art will appreciate that, can to this, make various modifications and variations and not depart from the of the present invention wider spirit and scope of setting forth as enclosed in claims.Therefore, description and accompanying drawing above should be considered to be exemplary and nonrestrictive meaning.

Therefore, scope of the present invention is defined by the claims of enclosing.

Claims

1. for storage page is migrated to a computer implemented method for second memory from first memory, the method comprises:

Determine the first page size that described first memory is supported;

Determine the second page size that described second memory is supported;

Entry in page status catalogue (PSD) based on being associated with described storage page, determines that the use of described storage page is historical; And

Historical based on described first page size, described second page size and described use, between described first memory and described second memory, move described storage page.

2. method according to claim 1, wherein, described first page size is less than described second page size, and moves described storage page and comprise: described storage page is sent to the local storer of parallel processing element (PPU) from system storage.

3. method according to claim 2, also comprise: at least one storage page at the same level is sent to the storer of described PPU this locality from described system storage, wherein, described at least one storage page at the same level will with the combination of described storage page, with generate in the storer of described PPU this locality compared with at least a portion of large memories page.

4. method according to claim 1, wherein, described first size is larger than described the second size, and moves described storage page and comprise: first memory page is sent to system storage from the local storer of parallel processing element (PPU).

5. method according to claim 4, wherein, moving described storage page further comprises: the storage page in described PPU storer is divided into and comprises a plurality of compared with small memory page of second memory page, and described second memory page is sent to described system storage from the storer of described PPU this locality.

6. for moving a calculation element for storage page, this calculation element comprises:

First memory;

Second memory;

Page status catalogue (PSD); With

Unified virtual memory (UVM) driver, is configured to:

Determine the first page size that described first memory is supported;

Determine the second page size that described second memory is supported;

Entry in page status catalogue based on being associated with described storage page, determines that the use of described storage page is historical; And

Historical based on described first page size, described second page size and described use, between described first memory and described second memory, move described storer.

7. calculation element according to claim 6, wherein, described first page size is less than described second page size, and moves described storage page and comprise: described storage page is sent to the local storer of parallel processing element (PPU) from system storage.

8. calculation element according to claim 7, wherein, described UVM driver is further configured to: at least one storage page at the same level is sent to the storer of described PPU this locality from described system storage, wherein, described at least one storage page at the same level will with the combination of described storage page, with generate in the storer of described PPU this locality compared with at least a portion of large memories page.

9. calculation element according to claim 6, wherein, described first size is larger than described the second size, and moves described storage page and comprise: first memory page is sent to system storage from the local storer of parallel processing element (PPU).

10. calculation element according to claim 9, wherein, moving described storage page further comprises: the storage page in described PPU storer is divided into and comprises a plurality of compared with small memory page of second memory page, and described second memory page is sent to described system storage from the storer of described PPU this locality.