CN100543770C - The special mechanism that is used for the page or leaf mapping of GPU - Google Patents

The special mechanism that is used for the page or leaf mapping of GPU Download PDF

Info

Publication number
CN100543770C
CN100543770C CNB2007101376430A CN200710137643A CN100543770C CN 100543770 C CN100543770 C CN 100543770C CN B2007101376430 A CNB2007101376430 A CN B2007101376430A CN 200710137643 A CN200710137643 A CN 200710137643A CN 100543770 C CN100543770 C CN 100543770C
Authority
CN
China
Prior art keywords
address
page table
process unit
graphic process
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB2007101376430A
Other languages
Chinese (zh)
Other versions
CN101118646A (en
Inventor
彼得·C·童
桑尼·S·杨
凯文·J·克兰楚施
加里·D·洛伦森
凯曼·吴
阿希什·K·考尔
科林恩·S·凯斯
斯特凡·A·戈特沙尔克
丹尼斯·K·马
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nvidia Corp
Original Assignee
Nvidia Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nvidia Corp filed Critical Nvidia Corp
Publication of CN101118646A publication Critical patent/CN101118646A/en
Application granted granted Critical
Publication of CN100543770C publication Critical patent/CN100543770C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides the circuit, the method and apparatus that reduce or eliminate the system memory access that is used for the search address address translation information.In an example, by reducing or eliminating these accesses with the pre-pattern filling TLB of clauses and subclauses, the virtual address that described clauses and subclauses are used for being used by GPU is translated into the real address of being used by system storage.By locking or limit in the described graphics TLB the required clauses and subclauses of display access and keep address translation information.This can be by with limited-access some position in described graphics TLB, by flag or other identifying information are stored in the described graphics TLB, or finishes by other suitable method.In another example, system bios is GPU allocate memory space, described storage space storage base address and address realm.Translate described virtual address by adding the virtual address in the described address realm to described base address.

Description

The special mechanism that is used for the page or leaf mapping of GPU
The cross reference of related application
The 60/820th of people's such as the application's case opinion Tong application in 31 days July in 2006, the 60/821st of No. 952 and on August 1st, 2006 application, the right of priority of No. 127 U.S. Provisional Application cases, described two application case exercise questions are " DEDICATED MECHANISM FOR PAGE-MAPPING IN A GPU ", and both are incorporated herein by reference for it.
The application's case with following own together and co-pending U.S. patent application case relevant: the 11/253rd, No. 438 of on October 18th, 2005 application is entitled as " Zero Frame Buffer "; The 11/077th, No. 662 of on March 10th, 2005 application is entitled as " Memory Management for Virtual Address Space with Translation Units of VariableRange Size "; And No. 11/077662 of on March 10th, 2005 application, be entitled as " Memory Managementfor Virtual Address Space with Translation Units of Variable Range Size ", described patent application case all is incorporated herein by reference.
Technical field
The present invention relates to eliminate or reduce the system memory access that is used for the required address translation information of searching system storer video data access.
Background technology
Graphics Processing Unit (GPU) is included as the part of computing machine, video-game, auto navigation and other electronic system, so that produce graph image on monitor or other display device.Initial GPU leaved for development is stored in pixel value (that is the color of actual displayed) in the local storage that is called frame buffer.
From that time, the complicacy of the GPU GPU of design of the NVIDIA company of Santa Clara, California and exploitation (especially by) significantly increases.The size and the complicacy that are stored in the data in the frame buffer increase equally.This graph data now not only comprises pixel value, and comprises texture, texture description symbol, shadow shielding device programmed instruction and other data and order.These frame buffers now often are called as graphic memory because of the effect of having approved its expansion.
Up to date, GPU is communicated by letter with other device in the computer system with CPU (central processing unit) by advanced graphics port or AGP bus.Though develop the very fast version of this bus, it can not be sent to GPU with enough graph datas.Therefore, graph data is stored in the local storage that GPU can use, and needn't pass through the AGP port.Fortunately, developed a kind of new bus, it is the enhancing version of periphery component interconnection (PCI) standard, or claims PCIE (PCI express).NVIDIA company has carried out improving significantly and improveing to this bus protocol and caused embodiment.This so allowed to eliminate local storage in order to be beneficial to via the system storage of PCIE bus access.
Owing to the variation of graphic memory position has produced various complex situations.A kind of complex situations are, GPU uses tracking data memory location, virtual address, and system storage uses the real address.For from the system storage reading of data, GPU translates to the real address with its virtual address.If this translates the time of overspending, system storage may not can be provided to GPU with enough fast speed with data so.Especially true for pixel that must continue and be quickly supplied to GPU or video data.
, the virtual address is not stored on the GPU time of so this address translation possibility overspending if being translated to the required information in real address.In particular, if this address translation information is unavailable on GPU, need the first memory access to come the described address translation information of retrieval from system storage so.Only in this way, just can be from system storage reading displayed data or other desired data in the second memory access.Therefore, the first memory access is series at before the second memory access, because can't carry out the second memory access under the situation of the address that does not have the first memory access to be provided.Extra first memory access may reach 1usec, thereby slows down the speed of reading displayed data or other desired data greatly.
Therefore, need elimination or minimizing to be used for from circuit, the method and apparatus of these additional memory access of system storage search address address translation information.
Summary of the invention
Therefore, the embodiment of the invention provides circuit, the method and apparatus of eliminating or reducing the system memory access that is used for the required address translation information of searching system storer video data access.In particular, the address translation information stores is in graphic process unit.This reduces or eliminates the needs for the separate payment storage access that is used to retrieve address translation information.Owing to do not need extra storage access, so processor can be translated the address quickly and be read required video data or other data from system storage.
One exemplary embodiment of the present invention by with clauses and subclauses pre-fill the cache memory that is called figure translation lookaside buffer (graphics TLB) eliminate or reduce power up after for the system memory access of address translation information, the virtual address that described clauses and subclauses can be used for being used by GPU is translated into the real address by the system storage use.In certain embodiments of the invention, come pre-pattern filling TLB, but in other embodiments of the invention, be used for the also pattern filling TLB in advance of address of the data of other type with the required address information of video data.This prevents from originally need be used for retrieving the additional system storage access of necessary address translation information.
After powering up, maintain on the graphic process unit locking or limit required clauses and subclauses of display access in the graphics TLB in other mode in order to ensure required address translation information.This can be by with limited-access some position in graphics TLB, by flag or other identifying information are stored in the graphics TLB, or finishes by other suitable method.This prevents to rewrite originally the data that need read once more from system storage.
The base address of the big block continuously of the system storage that another one exemplary embodiment of the present invention is provided by storage system BIOS and address realm are eliminated or are reduced storage access for address translation information.Power up or when other suitable incident took place, system bios distributed than the large memories block to GPU, it can be described as " marking district (carveout) ".GPU can be used for video data or other data than the large memories block with this.GPU on chip, for example is stored in base address and range storage in the hardware register.
When the virtual address of being used by GPU will be translated into the real address, carry out range check to find out that the virtual address is whether in marking the scope in district.In certain embodiments of the invention, by the base address of marking the district is simplified this corresponding to virtual address zero.Mark the highest virtual address in the district then corresponding to the scope of real address.If address to be translated in the scope of the virtual address of marking the district, can be translated into the real address with the virtual address by adding the base address to virtual address so.If address to be translated not in this scope, can use graphics TLB or page table to translate described address so.
Each embodiment of the present invention can comprise one or more features in these or the further feature described herein.Can obtain better understanding referring to following embodiment and accompanying drawing to character of the present invention and advantage.
Description of drawings
Fig. 1 is the block scheme of improved computing system by comprising the embodiment of the invention;
Fig. 2 is the block scheme of improved another computing system by comprising the embodiment of the invention;
Fig. 3 is explanation is stored in the method for the video data in the system storage according to the access of the embodiment of the invention a process flow diagram;
Fig. 4 A-C explanation is according to embodiment of the invention transmission of order and data in the computer system during the method for access video data;
Fig. 5 is the process flow diagram of explanation according to the other method of the video data in the access system storer of the embodiment of the invention;
Fig. 6 explanation is according to embodiment of the invention transmission of order and data in the computer system during the method for access video data;
Fig. 7 is the block scheme that meets the Graphics Processing Unit of the embodiment of the invention; With
Fig. 8 is the figure according to the graphics card of the embodiment of the invention.
Embodiment
Fig. 1 is the block scheme of improved computing system by comprising the embodiment of the invention.This block scheme comprises CPU (central processing unit) (CPU) or host-processor 100, system platform processor (SPP) 110, system storage 120, Graphics Processing Unit (GPU) 130, media communication processor (MCP) 150, network 160 and inside and peripheral unit 270.Also comprise frame buffer, part or graphic memory 140, but with dashed lines is showed.Though dotted line indication conventional computer system comprises this storer, the embodiment of the invention allows it is removed.This figure is the same with other figure that is comprised, and is only to show for illustration purposes, and does not limit possible embodiment of the present invention or claims.
CPU 100 via host bus 105 are connected to SPP 110.SPP 110 communicates by letter with Graphics Processing Unit 130 via PCIE bus 135.SPP 110 via memory bus 125 from system storage 120 reading of data with write data into system storage 120.MCP 150 connects via the high speed of for example HyperTransport bus 155 and communicates by letter with SPP 110, and network 160 and inside and peripheral unit 170 is connected to the remainder of computer system.Graphics Processing Unit 130 receives data via PCIE bus 135, and produces and be used for the figure and the video image that show by monitor or other display device (not shown).In other embodiment of the present invention, Graphics Processing Unit is included in the Force Integrated Graphics Processor (IGP) Nforce (IGP), uses described Force Integrated Graphics Processor (IGP) Nforce to replace SPP 110.In other embodiment, can use general GPU as GPU 130.
CPU 100 can be a processor, for example well-known those processors made by Intel Company or other supplier of those skilled in the art.SPP 110 and MCP 150 are referred to as chipset.System storage 120 normally is arranged in many dynamic random access memory means of many two-wire internal storage modules (DIMM).Graphics Processing Unit 130, SPP 110, MCP 150 and IGP (if you are using) are preferably made by NVIDIA company.
Graphics Processing Unit 130 may be positioned on the graphics card, and CPU 100, system platform processor 110, system storage 120 and media communication processor 150 can be positioned on the computer system motherboard.The graphics card that comprises Graphics Processing Unit 130 normally is attached with the data pcb of Graphics Processing Unit.Described printed circuit board (PCB) comprises connector (for example, the PCIE connector) usually, and it is attached to printed circuit board (PCB) equally and is coupled in the PCIE groove that comprises on the motherboard.In other embodiment of the present invention, graphic process unit is included on the motherboard, or is included among the IGP.
Computer system (for example, illustrated computer system) can comprise an above GPU 130.In addition, each of these Graphics Processing Unit can be positioned on the independent graphics card.Both or both above can being bonded together in these graphics cards by bonding line or other connection.NVIDIA company has developed a kind of this type of technology---pioneering SLI TMIn other embodiment of the present invention, one or more GPU can be positioned on one or more graphics cards, and one or more other GPU are positioned on the motherboard.
Formerly in Kai Fa the computer system, GPU 130 communicates by letter with system platform processor 110 or other device (at for example north bridge place) via the AGP bus.Unfortunately, the AGP bus can not be fed to GPU130 with desired data with desired rate.Therefore, frame buffer 140 is to use for GPU.This storer allows needn't cross the described data of access under the situation of AGP bottleneck in data.
The now available agreement of data transfer faster, for example PCIE and HyperTransport.It should be noted that NVIDIA company has developed improved PCIE interface.Therefore, 120 bandwidth increases greatly from GPU 130 to system storage.Therefore, the embodiment of the invention provides and allows to remove frame buffer 140.The example that can be used for removing the other Method and circuits of frame buffer can be consulted No. 11/253438 U.S. patent application case that is entitled as " ZeroFrame Buffer " of common co-pending and the application in 18 days October in 2005 owned together, and described patent application case is incorporated herein by reference.
What the embodiment of the invention allowed provides saving to removing of frame buffer, and it not only comprises the removal of these DRAM but also comprises extra saving.For instance, the working voltage regulator comes the power supply of control store usually, and uses capacitor to provide power supply to filter.Remove DRAM, regulator and capacitor cost savings are provided, it has reduced the bill of materials (BOM) of graphics card.In addition, simplify the plate layout, reduced board space, and simplified the graphics card test.These factors have been reduced research and design and other engineering and testing cost, increase the gross profit of the graphics card that comprises the embodiment of the invention whereby.
Though the embodiment of the invention is suitable for improving the performance of zero frame buffer graphic process unit preferably, also can improve other graphic process unit (comprising those graphic process unit) by comprising the embodiment of the invention with limited or on-chip memory or limited local storage.And, though this embodiment provide can be by comprising the embodiment of the invention computer system of improved particular type, also can improve the electronics or the computer system of other type.For instance, can improve the system of video and other games system, navigation, set-top box, pinball machine and other type by comprising the embodiment of the invention.
And other electronic system is comparatively common at present though the department of computer science of these types described herein unifies, current computing machine and other electronic system of just developing other type, and will develop other system in the future.Expect that the many systems in these systems also can be improved by comprising the embodiment of the invention.Therefore, cited particular instance is illustrative in essence, and does not limit possible embodiment of the present invention or claims.
Fig. 2 is the block scheme of improved another computing system by comprising the embodiment of the invention.This block scheme comprises CPU (central processing unit) or host-processor 200, SPP 210, system storage 220, Graphics Processing Unit 230, MCP 250, network 260 and inside and peripheral unit 270.Equally, comprise frame buffer, part or graphic memory 240, remove but dot with outstanding its.
CPU 200 via host bus 205 are communicated by letter with SPP 210, and via memory bus 225 access system storeies 220.GPU 230 communicates by letter with SPP 210 via PCIE bus 235, and communicates by letter with local storage via memory bus 245.MCP 250 connects via the high speed of for example HyperTransport bus 255 and communicates by letter with SPP 210, and network 260 and inside and peripheral unit 270 is connected to the remainder of computer system.
With the same before, CPU (central processing unit) or host-processor 200 can be by one of CPU (central processing unit) of Intel Company or other supplier manufacturing, and are that the those skilled in the art is well-known.Graphic process unit 230, Force Integrated Graphics Processor (IGP) Nforce 210 and medium and communication processor 240 are preferably provided by NVIDIA company.
Removing frame buffer 140 and 240 among Fig. 1 and 2, and remove other frame buffer among other embodiment of the present invention, is not produce consequence.For instance, produce about being used for from the difficulty of the address of system memory stores and reading of data.
When GPU used local storage to store data, local storage strictly was under the control of GPU.Usually, the equal not access local storage of other circuit.This allows GPU to follow the tracks of with its any way that sees fit and distributes the address.Yet system storage is used by a plurality of circuit, and operating system is to those circuit allocation space.The space that operating system is assigned to GPU can form a continuous memory areas.More likely be, the space that is assigned to GPU is subdivided into many blocks or district, some of them may have different sizes.These blocks or district can describe by initial, initial or base address and memory size or address realm.
Graphics Processing Unit is used difficulty and not convenient of actual system memory addresses, is distributed in a plurality of independent blocks because offer the address of GPU.And whenever turns on power or when reallocating storage address in addition, the address that offers GPU may change.The software that runs on the GPU uses the virtual address meeting that is independent of actual real address in the system storage much easier.In particular, GPU is considered as a bigger continuous block with storage space, and with some less diverse blocks memory allocation is arrived GPU.Therefore, when writing data into system storage or during from the system storage reading of data, carry out the virtual address of using by GPU with by translating between the real address of system storage use.Can use show to carry out this translating, the clauses and subclauses of described table comprise virtual address and corresponding real address homologue thereof.These tables are called page table, and described clauses and subclauses are called page table entries (PTE).
To such an extent as to page table can not place on the GPU too greatly; Because the cause of cost constraint, it is undesirable doing like this.Therefore, page table is stored in the system storage.Unfortunately, this means whenever need be the time from the data of system storage, need carry out first or additional memory access retrieve required page table entries, and need the second memory access to retrieve required data.Therefore, in embodiments of the present invention, some data in the page table are by in the graphics TLB of caches on GPU.
When the needs page table entries, but and time spent in the graphics TLB of page table entries on GPU, think and hit (hit), and can carry out address translation.If page table entries is not stored in the graphics TLB on the GPU, think so taken place not in (miss).At this moment, the required page table entries of retrieval in the page table from system storage.
Retrieved after the required page table entries, the possibility of this same page table entries is very big with needing once more.Therefore, in order to reduce the number of storage access, this page table entries need be stored in the graphics TLB.If there is not empty position in the cache memory, so nearest untapped page table entries may be rewritten or expel in order to be beneficial to this new page table entries.In each embodiment of the present invention, before expulsion, check to determine currently whether to be revised after system storage is read at it by Graphics Processing Unit by the clauses and subclauses of caches.If it is modified, before rewriteeing it, new page table entries carries out write back operations so in graphics TLB, in described write back operations, the page table entries through upgrading is written back to system storage.In other embodiment of the present invention, do not carry out this write-back program.
In specific embodiment of the present invention, based on the minimum interval size that system may distribute page table is enrolled index, for example, PTE can represent minimum 4 4KB blocks or page or leaf.Therefore, by with the virtual address divided by 16KB and then multiply by the size of clauses and subclauses, in page table, produce relevant index.Graphics TLB not in after, GPU uses above-mentioned index to find page table entries.In this specific embodiment, page table entries can be shone upon one or more blocks greater than 4KB.For instance, page table entries can be shone upon minimum four 4KB blocks, and can shine upon greater than 4KB and reach 4,8 or 16 blocks that maximum adds up to 256KB always.In case this page table entries is written in the cache memory, graphics TLB just can be by finding the virtual address with reference to single graphics TLB clauses and subclauses (it is single PTE) in described 256KB.In the case, page table itself is arranged in 16 byte entries, and each of described clauses and subclauses is shone upon 16KB at least.Therefore, the 256KB page table entries is replicated in each page table position of the described 256KB that is positioned at virtual address space.Therefore, in this example, there are 16 page table entries with accurately identical information.In the 256KB not in read one in those same item.
As mentioned above, if required page table entries is unavailable in graphics TLB, needs to carry out additional memory access so and retrieve described clauses and subclauses.For need be concerning stable, the special pattern function that continues access of data, these additional memory access be very undesirable.For instance, Graphics Processing Unit needs access reliably to come video data, makes it desired rate view data to be provided to monitor.Too much if desired storage access, the stand-by period that is produced may be interrupted pixel data to the flowing of monitor so, destroys graph image whereby.
In particular, read the address translation information that is used for the video data access from system storage if desired, so described access is connected with the follow-up data access, promptly must read address translation information from storer, so GPU can understand required video data and where is stored in.Additional wait time of causing of additional memory access has reduced video data to be provided to the speed of monitor thus, thereby destroys graph image once more.These additional memory access also increase the traffic on the PCIE bus and waste system memory bandwidth.
When powering up or take place graphics TLB for empty or other incident of being eliminated, the extra memory that especially may be used for the search address address translation information reads.In particular, when computer system powered up, basic input/output (BIOS) expection GPU can freely dispose the local frame memory buffer.Therefore, in conventional system, the system bios not space in the distribution system storer uses for graphic process unit.In fact, GPU is from a certain amount of system storage of operating system request space.After the operating system allocate memory space, GPU can be stored in the page table entries in the page table in the system storage, but graphics TLB is empty.When the needs video data, cause not at each request of PTE, describedly further cause additional memory access in not.
Therefore, the embodiment of the invention is with the pre-pattern filling TLB of page table entries.That is to say, the request that needs page table entries cause cache memory not in before with page table entries pattern filling TLB.This pre-filling comprises retrieval video data required page table entries usually at least, but also pattern filling TLB in advance of other page table entries.In addition, be ejected lockable or limit some clauses and subclauses in other mode in order to prevent page table entries.In specific embodiment of the present invention, locking or restriction video data required page table entries, but in other embodiments, lockable or limit the data of other type.Below show the process flow diagram of this type of one exemplary embodiment of explanation in graphic.
Fig. 3 is explanation is stored in the method for the video data in the system storage according to the access of the embodiment of the invention a process flow diagram.This figure is the same with other figure that is comprised, and is only to show for illustration purposes, and does not limit possible embodiment of the present invention or claims.And,, can improve the data access of other type herein by comprising the embodiment of the invention although this example of Zhan Shiing is suitable for the access video data especially preferably with other example.
In the method, GPU, or more particularly run on driver or explorer on the GPU, guarantee to use be stored in GPU originally on one's body address translation information the virtual address is translated into the real address, and do not need to retrieve this information from system storage.This fills in advance or is preloaded in the graphics TLB and realize by translating clauses and subclauses at first.Then lock the address that is associated with video data, or prevent that in other mode it is rewritten or expels.
In particular, in action 310, computing machine or other electronic system are powered, or experience restarts, power reset or similar incidents.In action 320, as the explorer of a part that runs on the driver on the GPU from operating system Request System storage space.In action 330, operating system is the space in the CPU distribution system storer.
Though in this example, run on operating system on the CPU and be responsible for frame buffer or graphics memory space in the distribution system storer, but in each embodiment of the present invention, the driver or other software that run on other installs in CPU or the system can be responsible for this task.In other embodiments, this task is by one or more the sharing in operating system and described driver or other software.In action 340, the real address information in the space of explorer from operating system receiving system storer.This information will comprise base address and the size or the scope in one or more districts in the system storage usually at least.
Explorer is followed compressible or is disposed this information in other mode, so that restriction will be translated into the number by the required page table entries in the real address of system storage use by the virtual address that GPU uses.For instance, capable of being combined by the independent but continuous block of operating system to the system storage space of GPU distribution, wherein single base address is used as start address, and the virtual address is used as index signal.The example of showing this situation can be consulted No. 11/077662 U.S. patent application case that is entitled as " Memory Management for Virtual Address Space with TranslationUnits of Variable Range Size " of common co-pending and the application in 10 days March in 2005 owned together, and described patent application case is incorporated herein by reference.And though in this example, this task is the responsibility as the explorer of a part that runs on the driver on the GPU; But in other embodiments, this task of showing in this example and other example of being comprised and other task can be finished or shared by other software, firmware or hardware.
In action 350, explorer is written to the clauses and subclauses of translating of page table in the system storage.Explorer is also translated in the clauses and subclauses at least some with these and is translated that clauses and subclauses are preloaded into graphics TLB or fill in advance.In action 360, the graphics TLB clauses and subclauses that lockable is some or all of, or prevent that in other mode it is ejected.In certain embodiments of the invention, prevent that the address of shown data is rewritten or expels, to guarantee under the situation that need not carry out extra system memory access at address translation information, to provide the address of display message.
Can use the whole bag of tricks that meets the embodiment of the invention to realize described locking.For instance, can be under the situation of graphics TLB reading of data in many clients, can limit one or more in these clients, make it can't write data into confined cache locations, but have to be written to one in many total (pooled) or the uncurbed cache line.More details can be consulted No. 11/298256 U.S. patent application case that is entitled as " Shared Cache with Client-Specific Replacement Policy " of applying for 8 days Dec in 2005 common co-pending and that own together, and described patent application case is incorporated herein by reference.In other embodiments, can maybe the data of for example flag can be stored in the graphics TLB with clauses and subclauses applying other restriction to the circuit that graphics TLB writes.For instance, the existence that can hide some cache lines to the circuit that can write to graphics TLB.Perhaps, if set flag, can't rewrite or expel the data in the cache line that is associated so.
In action 370, when need be, use the page table entries in the graphics TLB to be translated into the real address by the virtual address that GPU uses from the video data of system storage or other data.In particular, the virtual address is provided to graphics TLB, and reads corresponding real address.Equally, if this information is not stored in the graphics TLB, so need be from the described information of system storage request before can address translation taking place.
In each embodiment of the present invention, can comprise other technology and limit the influence of graphics TLB in not.In particular, can take extra step to reduce the storage access stand-by period, reduce whereby cache memory not in to the influence of the supply of video data.A solution is the Virtual Channel VC1 that utilizes as the part of PCIE specification.If the not middle Virtual Channel VC1 that uses of graphics TLB, it can avoid other request so, thereby allows to retrieve quickly required clauses and subclauses.Yet conventional chipset does not allow access Virtual Channel VC1.In addition, though NVIDIA company can mode according to the invention implement this solution in product, make current doing like this not cater to the need, but this situation may change in the future with the interoperability of other device.Another solution relates to and will list in preferential by the not middle request that produces of graphics TLB or carry out mark.For instance, available high priority sign carries out mark to request.This solution has with the similar interoperability of a last solution to be considered.
Fig. 4 A-C explanation is according to embodiment of the invention transmission of order and data in the computer system during the method for access video data.In this particular instance, the computer system of exploded view 1, but order and data transfer in other system (for example, shown in Figure 2 system) are similar.
Among Fig. 4 A, when system power-up, reset, restart or when other incident takes place, GPU will send to operating system for the request in system storage space.Equally, this request can come comfortable GPU to go up the driver of running, and in particular, the explorer of driver part can be made this request, but other hardware, firmware or software also can be made this request.This request can be delivered to CPU (central processing unit) 400 from GPU 430 via system platform processor 410.
Among Fig. 4 B, operating system is that space in the GPU distribution system storer is with for use as frame buffer or graphic memory 422.The data that are stored in frame buffer or the graphic memory 422 can comprise video data, promptly are used to the pixel value, texture, texture description symbol, shadow shielding device programmed instruction and other data and the order that show.
In this example, the space of being distributed, promptly the frame buffer 422 in the system storage 420 is shown as continuous.In other embodiment or example, the space of being distributed may be discontinuous, that is, they may be different fully, are split into a plurality of districts.
Be delivered to GPU with comprising one or more base address in district of system storage and the information of scope usually.Equally, in specific embodiment of the present invention, this information is delivered to the explorer part of the driver of running on GPU 430, but can uses other software, firmware or hardware.This information can be delivered to GPU 430 from CPU 400 via system platform processor 410.
Among Fig. 4 C, GPU is written in the clauses and subclauses of translating in the page table in the system storage.GPU also translates in the clauses and subclauses at least some with these and translates clauses and subclauses graphics TLB is preloaded into.Equally, these clauses and subclauses will be translated into the real address of being used by the frame buffer in the system storage 420 422 by the virtual address that GPU uses.
With the same before, lockable or limit some clauses and subclauses in the graphics TLB in other mode can't be ejected or rewrite it.Equally, in specific embodiment of the present invention, locking or limit in other mode identification is stored the clauses and subclauses that translate the address of the position of pixel or video data in the frame buffer 422.
When needs during, use graphics TLB 432 to be translated into the real address by the virtual address that GPU 430 uses from frame buffer 422 access datas.Then these requests are delivered to system platform processor 410, system platform processor 410 reads required data and passes it back GPU 430.
In above example, power up or other power reset or similar state after, GPU will send to operating system for the request in the space in the system storage.In other embodiment of the present invention, it is known that GPU will need the fact in the space in the system storage, and does not need to make request.In the case, power up, reset, restart or other suitable incident after, but the space in system bios, operating system or other software, firmware or the hardware distribution system storer.This is especially feasible in controlled environment, for example is not so good as in its mobile application that such easy quilt exchanges or substitutes in desktop application usually at GPU.
GPU may understand in system storage it with the address of using, or address information can be delivered to GPU by system bios or operating system.In either case, storage space can be the continuous part of storer, and in the case, individual address only---base address need be for known or be provided to GPU.Perhaps, storage space can be diverse or discrete, and may to need a plurality of addresses be known or be provided to GPU.Usually, for example the out of Memory of memory block size or range information also is delivered to GPU or for known to the GPU.
And, in each embodiment of the present invention, can be by operating system space in the distribution system storer when powering up of system, and GPU can make for the more request of multi-memory in the time after a while.In this example, but system bios and operating system are all used for GPU in the space in the distribution system storer.The example of the following figure shows embodiment of the invention, wherein system bios is through programming to be GPU distribution system storage space when powering up.
Fig. 5 is the process flow diagram of explanation according to the other method of the video data in the access system storer of the embodiment of the invention.Equally, though the embodiment of the invention is suitable for providing the access to video data preferably, each embodiment can provide the access to the data of this type or other type.In this example, system bios is understood the space that needs in the distribution system storer and is used for GPU when powering up.This space can be continuous or discrete.And, in this example, system bios is delivered to the explorer or the other parts of the driver on the GPU with storer and address information, but in other embodiments of the invention, the explorer of the driver on the GPU or other parts may be known described address information ahead of time.
In particular, in action 510, computing machine or other electronic system power up.In action 520, use for GPU in the space in system bios or other appropriate software, firmware or hardware (for example, operating system) the distribution system storer.If storage space is continuous, system bios is provided to the base address explorer or the driver that moves on GPU so.If storage space is discontinuous, system bios will provide many base address so.Each base address is attended by memory block size information usually, for example size or address realm information.Usually, storage space is that mark, continuous storage space.This information is attended by address realm information usually.
In action 540, storage base address and scope are for using on GPU.In action 550, can as index follow-up virtual address be converted to the real address by using the virtual address.For instance, in certain embodiments of the invention, can be by the virtual address be added to the base address and the virtual address is converted to the real address.
In particular, in the time the virtual address will being translated to the real address, carry out range check.When the real base address of being stored during corresponding to virtual address zero, if the virtual address in described scope, so can be by the virtual address is translated in the Calais mutually with real base address in the virtual address.Similarly, when the real base address of being stored during corresponding to virtual address " X ", if the virtual address in described scope, so can by with virtual address and real base address mutually adduction deduct " X " and translate the virtual address.If the virtual address not in described scope, can use graphics TLB or page table entries to translate the address so as mentioned above.
Fig. 6 explanation is according to embodiment of the invention transmission of order and data in the computer system during the method for access video data.When powering up, the space in the system bios distribution system storer 624---" marking the district " 622 used for GPU630.
Space of being distributed in GPU reception and the storage system storer 620 or the base address (or a plurality of base address) of marking district 622.These data can be stored in the graphics TLB 632, or it can be stored in other place on the GPU 630, for example are stored in the hardware register.This address is stored in (for example) hardware register together with the scope that marks district 622.
When will be, can will change into the real address of using by the virtual address that GPU 630 uses by the virtual address is considered as index by system storage from 622 reading of data of the frame buffer the system storage 620.Equally, in specific embodiment of the present invention, the virtual address that will mark in the address realm by adding the virtual address to base address translates to the real address.That is to say, if the base address is zero corresponding to the virtual address, so can be by as mentioned above the virtual address being added to the base address and the virtual address is converted into the real address.Equally, can use graphics TLB and page table to translate extraneous virtual address as mentioned above.
Fig. 7 is the block scheme that meets the Graphics Processing Unit of the embodiment of the invention.This block scheme of Graphics Processing Unit 700 comprises PCIE interface 710, graphics pipeline 720, graphics TLB 730 and logical circuit 740.PCIE interface 710 is via 750 transmission of PCIE bus and receive data.Equally, in other embodiment of the present invention, can use the bus of current other type of having developed or just having developed, and those buses that will develop in the future.Graphics Processing Unit is formed on the integrated circuit usually, but in certain embodiments, an above integrated circuit can comprise GPU 700.
Graphics pipeline 720 receives data from the PCIE interface, and plays up data so that show on monitor or other device.The virtual memory address translation that graphics TLB 730 storages are used for being used by graphics pipeline 720 becomes the page table entries of the real memory address that is used by system storage.Logical circuit 740 control graphics TLB 730 are checked locking or other restriction to the data that are stored in graphics TLB 730 places, and from the cache memory reading of data with write data into cache memory.
Fig. 8 is the figure of explanation according to the graphics card of the embodiment of the invention.Graphics card 800 comprises Graphics Processing Unit 810, Bussing connector 820 and arrives the connector 830 of second graph card.Bussing connector 828 can be to cooperate the PCIE connector of PCIE groove, for example PCIE on the groove on the motherboard of computer system through design.The connector 830 that arrives second card can be configured to cooperate bonding line or other connection that arrives one or more other graphics cards.Can comprise for example other device of power regulator and capacitor.It should be noted that and do not comprise storage arrangement on this graphics card.
Presented above description for the purpose of illustration and description to one exemplary embodiment of the present invention.Do not wish that it is detailed or limit the invention to described precise forms, and, may make many modifications and variations in view of above teaching.Selecting and describing described embodiment is in order to explain principle of the present invention and application in practice thereof best, make others skilled in the art can best the present invention be used for various embodiment whereby, and make the various modifications that are suitable for desired special-purpose.

Claims (19)

1. method of using graphic process unit to come retrieve data, it comprises:
Memory location in the request access system storer;
Receive the address information of at least one block of memory location in the described system storage, described address information comprises the information of discerning at least one real memory address; And
To be stored in corresponding to the page table entries of described at least one real memory address in the cache memory on the described graphic process unit;
Wherein receive described address information, and under situation about not waiting for during cache memory is not, described page table entries is stored in the described cache memory.
2. method according to claim 1, it further comprises: described page table entries is stored in the described system storage.
3. method according to claim 2, it further comprises: lock the position that stores described page table entries in the described cache memory.
4. method according to claim 3, wherein said graphic process unit is a Graphics Processing Unit.
5. method according to claim 3 wherein is included in described graphic process unit on the Force Integrated Graphics Processor (IGP) Nforce.
6. method according to claim 3 is wherein made the request of the memory location in the described access system storer to operating system.
7. method according to claim 3, the information of wherein said at least one real memory address of identification comprises base address and memory block size.
8. graphic process unit, it comprises:
Data-interface, its be used for providing the access system storer memory location request and be used for receiving address information about the memory location of described system storage, described address information comprises the information of discerning at least one real memory address;
Cache controller, it is used to write the page table entries corresponding to described at least one real memory address; And
Cache memory, it is used to store described page table entries,
Wherein receive described address information, and take place under the situation of cache memory in not described page table entries to be stored in the described cache memory not waiting for.
9. graphic process unit according to claim 8, wherein said data-interface also provide described page table entries are stored in request in the described system storage.
10. graphic process unit according to claim 8, wherein said data-interface are provided at the request of the memory location in the access system storer after the system power-up.
11. graphic process unit according to claim 8, wherein said cache controller locking stores the position of described page table entries.
12. graphic process unit according to claim 8, wherein said cache controller restriction access stores the position of virtual address and real address.
13. graphic process unit according to claim 8, wherein said data interface circuit are the PCIE interface circuits.
14. graphic process unit according to claim 8, wherein said graphic process unit is a Graphics Processing Unit.
15. graphic process unit according to claim 8, wherein said graphic process unit is comprised on the Force Integrated Graphics Processor (IGP) Nforce.
16. a method of using graphic process unit to come retrieve data, it comprises that the described graphic process unit of use is next:
The base address of the memory block in the receiving system storer and scope;
Described base address of storage and scope on described graphic process unit;
Receive first address;
Determine described first address whether in described scope, and if,
By described base address being added to described first address described first address translation is become second address, otherwise
Under situation about not waiting for during cache memory is not, page table entries is stored in the described cache memory;
Cache memory on described graphic process unit reads described page table entries; And
Use described page table entries that described first address translation is become second address.
17. method according to claim 16, it further comprises: before reading page table entries from described cache memory, determine whether described page table is stored in the described cache memory, and if not, read described page table entries from described system storage so.
18. method according to claim 16, wherein said graphic process unit is a Graphics Processing Unit.
19. method according to claim 16 wherein is included in described graphic process unit on the Force Integrated Graphics Processor (IGP) Nforce.
CNB2007101376430A 2006-07-31 2007-07-27 The special mechanism that is used for the page or leaf mapping of GPU Active CN100543770C (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US82095206P 2006-07-31 2006-07-31
US60/820,952 2006-07-31
US60/821,127 2006-08-01
US11/689,485 2007-03-21

Publications (2)

Publication Number Publication Date
CN101118646A CN101118646A (en) 2008-02-06
CN100543770C true CN100543770C (en) 2009-09-23

Family

ID=39054744

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101376430A Active CN100543770C (en) 2006-07-31 2007-07-27 The special mechanism that is used for the page or leaf mapping of GPU

Country Status (1)

Country Link
CN (1) CN100543770C (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8392667B2 (en) * 2008-12-12 2013-03-05 Nvidia Corporation Deadlock avoidance by marking CPU traffic as special
US9092358B2 (en) * 2011-03-03 2015-07-28 Qualcomm Incorporated Memory management unit with pre-filling capability
CN106683035B (en) * 2015-11-09 2020-03-13 龙芯中科技术有限公司 GPU acceleration method and device
CN110874332B (en) * 2016-08-26 2022-05-10 中科寒武纪科技股份有限公司 Memory management unit and management method thereof
CN111274166B (en) * 2018-12-04 2022-09-20 展讯通信(上海)有限公司 TLB pre-filling and locking method and device

Also Published As

Publication number Publication date
CN101118646A (en) 2008-02-06

Similar Documents

Publication Publication Date Title
KR101001100B1 (en) Dedicated mechanism for page-mapping in a gpu
CN102612685B (en) Non-blocking data transfer via memory cache manipulation
US5905509A (en) Accelerated Graphics Port two level Gart cache having distributed first level caches
CN101484883B (en) Apparatus and method for memory address re-mapping of graphics data
CN101681297B (en) Arrangements for memory allocation
US5444853A (en) System and method for transferring data between a plurality of virtual FIFO's and a peripheral via a hardware FIFO and selectively updating control information associated with the virtual FIFO's
US20150002526A1 (en) Shared Virtual Memory Between A Host And Discrete Graphics Device In A Computing System
US6003112A (en) Memory controller and method for clearing or copying memory utilizing register files to store address information
US20100161923A1 (en) Method and apparatus for reallocating memory content
US20140164677A1 (en) Using a logical to physical map for direct user space communication with a data storage device
CN101310259A (en) Method and system for symmetric allocation for a shared l2 mapping cache
US20160232640A1 (en) Resource management
CN101606130A (en) Enable the method and apparatus of resource allocation identification in the instruction-level of processor system
US6510497B1 (en) Method and system for page-state sensitive memory control and access in data processing systems
CN111684408B (en) Multi-memory type memory module system and method
US7353338B2 (en) Credit mechanism for multiple banks of shared cache
CN106484628A (en) Mixing memory module based on affairs
CN100543770C (en) The special mechanism that is used for the page or leaf mapping of GPU
CN107015923B (en) Coherent interconnect for managing snoop operations and data processing apparatus including the same
CN111538461A (en) Data reading and writing method and device based on solid state disk cache and storage medium
CN101369245A (en) System and method for implementing a memory defect map
CN101278270A (en) Apparatus and method for handling DMA requests in a virtual memory environment
CN100445964C (en) Mechanism for post remapping virtual machine storage page
CN117707998A (en) Method, system and storage medium for allocating cache resources
CN103870247A (en) Technique for saving and restoring thread group operating state

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant