CN101901198A

CN101901198A - Deadlock avoidance by marking CPU traffic as special

Info

Publication number: CN101901198A
Application number: CN200910249698XA
Authority: CN
Inventors: 塞缪尔·H.·邓肯; 戴维·B.·格拉斯科; 黄伟杰; 阿图·卡拉姆布尔; 帕特里克·R.·马尔尚; 丹尼斯·K.·马
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2008-12-12
Filing date: 2009-12-14
Publication date: 2010-12-01
Also published as: JP2010140480A; GB2466106B; GB2466106A; US8392667B2; CN105302524A; DE102009047518B4; KR20100068225A; US20100153658A1; KR101086507B1; JP5127815B2; DE102009047518A1; GB0920727D0

Abstract

Deadlocks are avoided by marking read requests issued by a parallel processor to system memory as ''special.'' Read completions associated with read requests marked as special are routed on virtual channel 1 of the PCIe bus. Data returning on virtual channel 1 cannot become stalled by write requests in virtual channel 0, thus avoiding a potential deadlock.

Description

By marking CPU traffic is the special deadlock of avoiding

Technical field

Relate generally to computer hardware of the present invention, and to relate more particularly to by marking CPU traffic be the special method and system of avoiding deadlock.

Background technology

Conventional computer system comprises CPU (central processing unit) (CPU) and also can comprise the coprocessor that is known as parallel processing element (PPU).CPU unloads certain processing operation to PPU to reduce the work of treatment amount of CPU.Wherein, these are handled to operate and comprise the compression and decompression operation.When CPU needed these to handle operation, PPU was given in the CPU request of sending, and comprises read request and/or write request.For example, CPU may need with data write may system storage with compressed format storage in.CPU sends write request to PPU, and PPU can read and decompress that the data relevant with this write request also will decompress and write in the system storage with the raw data that new data merges then.

Sometimes, the write request sent of CPU may cause that PPU sends one or more " derivation " read request that must finish before initial write request can be finished.For example, can to send with the system storage unit relevant with CPU be the read request of the derivation of target to PPU.When reading affairs and finish, system storage sends and runs through to PPU, and these affairs of this notice PPU are finished.

But, when CPU and the PPU Peripheral component interface by having one or more write requests co-pending fast (PCIe) may have problems when bus is connected.Because the ordering rule of PCIe bus, run through and to surpass write request, so the read request of any derivation can not be returned and runs through to PPU.Therefore, initial write request can not be finished.This situation is known in the industry as circulation interdependent or " deadlock ".Deadlock makes the some or all of communication between CPU and PPU stop and influencing unfriendly the processing handling capacity of computer system.Some examples of dead lock condition are discussed below.

In first example, if PPU need the page table from be stored in system storage read and write request co-pending in the PCIe bus, then deadlock may take place.When PPU gives system storage when obtaining project from page table to sending read request, run through relevant with read request can not return to PPU, so initial write request can not be finished.

When CPU sends when giving PPU with the write request of the high-speed cache behavior target in the cache memory unit relevant with PPU, also deadlock may take place.In order to finish write request, PPU at first determines by the inspection tag storaging piece whether cache line compresses.The compressive state of the high-speed cache line correlation of indication of tag storage part and the nearest visit in cache memory unit.When the tag storage part did not comprise compressive state by the cache line of write request appointment, PPU sent read request and is included in the reserve storaging piece of the compressive state of each cache line in the cache memory unit for visit system storage.The compressive state and sending that the reserve storaging piece returns the cache line of appointment runs through.But, when write request is co-pending in PCIe, because run through relevant with read request can not surpass these write requests co-pending, therefore deadlock may take place.

When CPU attempts to write data in the zone of system storage of compression (being known as " compressed tablet " in this area), the third deadlock may take place.CPU sends the specified compression sheet and gives PPU with the write request that comprises write data.PPU sends read request and gives system storage to read compressed tablet.When write request is co-pending in PCIe, owing to relevant with read request run through and can not surpass these write requests co-pending, so deadlock may take place.

Except these three examples, also have several other situations may cause deadlock.Therefore, still need to avoid the method and system of deadlock in this area.

Summary of the invention

Embodiments of the invention provide the method and system of avoiding deadlock in computer system, and this computer system has first processing unit, second processing unit, Memory bridge, system storage and second processing unit is connected to the bus of first processing unit, Memory bridge and system storage.When deadlock has been avoided in the request of reading or writing when first processing unit sends to second processing unit.

The method of avoiding deadlock according to the embodiment of the invention comprises: receive the request that reads or writes at second processing unit on first tunnel of bus; When reading or writing request, processing produces the read request that derives from second processing unit; Second tunnel by bus sends to system storage with the read request that derives from; Finishing of the read request that reception is derived from second tunnel of bus; With the request that reads or writes of finishing reception.

Be included in Bus Interface Unit in second processing unit according to the system that avoids deadlock of the embodiment of the invention.This Bus Interface Unit is configured to receive the request that reads or writes from first processing unit on first tunnel, and the read request of the derivation that will produce when processing reads or writes request sends by second tunnel.

Description of drawings

In order at length to understand above-mentioned feature of the present invention, the present invention for above brief description more specifically describes with reference to embodiment, wherein some embodiment is illustrated in the accompanying drawings.Yet, it should be noted that accompanying drawing just illustrates representative embodiment of the present invention, therefore can not think that accompanying drawing has limited scope of the present invention, the present invention can allow other same effectively embodiment.

Fig. 1 is that illustrated arrangement is to realize the block diagram of the computer system of one or more aspects of the present invention;

Fig. 2 is the block diagram of the parallel processing subsystem of the computer system among Fig. 1 according to an embodiment of the invention;

Fig. 3 A is the block diagram of the common treatment bunch (GPC) in one of parallel processing element (PPU) of Fig. 2 according to an embodiment of the invention;

Fig. 3 B is the block diagram of the zoning unit in one of parallel processing element in Fig. 2 according to an embodiment of the invention;

Fig. 4 is the block diagram that is configured to avoid the computer system of deadlock according to an embodiment of the invention; With

Fig. 5 is a process flow diagram of avoiding the method step of deadlock according to an embodiment of the invention.

Embodiment

In the following description, a large amount of concrete details have been provided so that more thorough understanding of the invention is provided.Yet, it will be apparent to one skilled in the art that the present invention can need not one or more these details and implemented.In other example, for fear of obscuring with the present invention, more known features are not described.

System survey

Fig. 1 is that illustrated arrangement is to realize the block diagram of the computer system 100 of the one or more aspects of the present invention.Computer system 100 comprises CPU (central processing unit) (CPU) 102 and system storage 104, and the two communicates via total thread path by Memory bridge 105.Memory bridge 105 can be integrated among the CPU 102 as shown in Figure 1.As an alternative, Memory bridge 105 can be a for example north bridge chips of usual equipment, and it is connected to CPU 102 by bus.Memory bridge 105 is connected with I/O (I/O) bridge 107 by communication path 106 (for example HyperTransport (super transmission) link).I/O bridge 107 for example can be a South Bridge chip, and it receives user's input from one or more user input devices 108 (for example keyboard, mouse), and should import by path 106 and Memory bridge 105 and be transmitted to CPU 102.Parallel processing subsystem 112 is coupled with Memory bridge 105 by bus or other communication paths 113 (for example PCIExpress, Accelerated Graphics Port or super transmission link); In one embodiment, parallel processing subsystem 112 is the graphics subsystems that pixel offered display device 110 (for example traditional CRT or based on the display of LCD).System disk 114 is connected in I/O bridge 107 equally.Switch 116 has been carried I/O bridge 107 and such as the connection between the miscellaneous part of network adapter 118 and various plug-in card 120 and 121.Comprise the miscellaneous part (clearly not illustrating among the figure) of the connection of USB or other ports, CD driver, DVD driver, film recording unit etc., also can be connected with I/O bridge 107.The interconnective communication path of various parts among Fig. 1 can be realized with any suitable agreement, such as PCI (Peripheral Component Interconnect), PCI Express (PCI-E), AGP (Accelerated Graphics Port), super transmission or any other bus or point to point protocol, and the connection between the distinct device can be used different agreement, as known in the art.

In one embodiment, parallel processing subsystem 112 is included as the circuit of figure and Video processing optimization, comprises for example video output circuit, and has constituted Graphics Processing Unit (GPU).In another embodiment, parallel processing subsystem 112 is included as common treatment optimization, and has kept the circuit of bottom computing architecture, and this paper will describe in further detail.In another embodiment, parallel processing subsystem 112 can integrated one or more other system elements, and for example Memory bridge 105, CPU 102 and I/O bridge 107 are to form SOC (system on a chip) (SoC).

Be understandable that the system that illustrates just schematically can change and revises it here.Comprise being connected topological structure and can revising as required of the quantity of bridge and arrangement.For example, in certain embodiments, system storage 104 directly is connected rather than is connected by bridge with CPU 102, and other equipment communicate with system storage 104 by Memory bridge 105 and CPU 102.In other selectable topological structures, parallel processing system (PPS) 112 is connected with I/O bridge 107 or directly is connected with CPU 102, rather than be connected with Memory bridge 105.In other embodiments, the one or more of CPU102, I/O bridge 107, parallel processing subsystem 112 and Memory bridge 105 are integrated on one or more chips.Here the specific features that illustrates is optional; For example, can support any amount of plug-in card or peripheral hardware.In certain embodiments, saved switch 116, network adapter 118 directly is connected with I/O bridge 107 with plug-in card 120,121.

Fig. 2 shows parallel processing subsystem 112 according to an embodiment of the invention.As shown in the figure, parallel processing subsystem 112 comprises one or more parallel processing elements (PPU) 202, and each parallel processing element all is coupled with local parallel processing (PP) storer 204.Usually, the parallel processing subsystem comprises U PPU, wherein U 〉=1.(here, a plurality of examples of analogical object with the Reference numeral that shows this object and show that the supplemental instruction numeral (when needing) of this example represents).PPU 202 and parallel processing storer 204 can use for example programmable processor, one or more integrated device electronics of special IC (ASIC) or memory devices, or realize in any other technical feasible mode.

Refer again to Fig. 1, in certain embodiments, in parallel processing subsystem 112 some or whole parallel processing element 202 are the graphic process unit with rendering pipeline, and it can be configured to carry out the various tasks relevant with following state: produce pixel data from the graph data that provides by CPU 102 and/or system storage 104; With local parallel processing storer 204 (it can be used as graphic memory, for example comprises conventional frame buffer) alternately with storage and upgrade pixel data; Provide pixel data to display device 110 etc.In certain embodiments, parallel processing subsystem 112 can comprise the one or more parallel processing elements 202 and one or more other parallel processing elements 202 that are used for general-purpose computations as graphic process unit work.Parallel processing element can be identical or different, and each parallel processing element 202 can have its own special-purpose parallel processing memory devices or not have special-purpose parallel processing memory devices.One or more parallel processing elements 202 data can be outputed to display device 110 or each parallel processing element 202 can output to data one or more display devices 110.

With reference to figure 2, in certain embodiments, can there be local parallel processing storer 204, memory reference is got back to system storage 104 (not shown) by the local cache (not shown) herein by interleaver unit 210 and 205 reflections of I/O unit.

In operation, CPU 102 is primary processors of computer system 100, the operation of its control and coordination other system parts.Particularly, CPU 102 sends the order of control parallel processing element 202 operations.In certain embodiments, CPU 102 will be written to commands buffer (not illustrating clearly among Fig. 1 and Fig. 2) for the command stream of each parallel processing element 202, and this commands buffer can be arranged in another memory location that system storage 104, parallel processing storer 204 or CPU 102 and parallel processing element 202 can be visited.Parallel processing element 202 sense command stream from commands buffer, and subsequently with respect to the operation exception ground fill order of CPU 102.CPU 102 also can set up data buffer, and this data buffer is read in the order of parallel processing element 202 in can the response command impact damper.Each order and data buffer can be read by each of parallel processing element 202.

Return now with reference to figure 2, each parallel processing element 202 comprises I/O (I/O) unit 205 of communicating by letter with the remainder of computer system 100 by communication path 113, this communication path 113 be connected with Memory bridge 105 (or directly being connected with CPU 102 in an optional embodiment).Being connected also of the remainder of parallel processing element 202 and computer system 100 can change.In certain embodiments, parallel processing subsystem 112 is embodied as plug-in card, and it can be inserted in the expansion slot of computer system 100.In other embodiments, parallel processing element 202 energy and bus bridge are integrated on the one chip together, and bus bridge for example can be Memory bridge 105 or I/O bridge 107.In other other embodiment, some or all elements of parallel processing element 202 can be integrated on the one chip together with CPU102.

In one embodiment, communication path 113 is PCI-E links, and wherein as known in the art, designated lane is distributed to each PPU 202.Also can use other communication paths.I/O unit 205 produces bag (or other signals) with transmission on communication path 113, and also receives all input bags from communication path 113 (or other signals), and will import the suitable parts that bag guides to PPU 202.For example, the order relevant with Processing tasks can be directed to host interface 206, and the order relevant with storage operation (for example read from parallel processing storer 204 or write to it) can be directed to memory interleave device (crossbar) unit 210.Host interface 206 reads each commands buffer, and will be outputed to front end 212 by the work of commands buffer appointment.

Each PPU 202 advantageously realizes highly-parallel processing framework.As be shown specifically, PPU 202 (0) comprises processing cluster array 230, handles cluster array 230 and comprises that quantity is common treatment bunch (GPC) 208 of C, wherein C 〉=1.Each GPC 208 can carry out in a large number (for example hundreds of or thousands of) thread simultaneously, and wherein each thread is the example of program.In different application, different GPC 208 is assigned with in order to handle dissimilar programs or to carry out dissimilar calculating.For example, in graphical application, first group of GPC 208 can be assigned as the pel topology of inlaying operation and producing dough sheet, and second group of GPC 208 can be assigned as inlay painted with the dough sheet parameter of estimating the pel topology and determine vertex position and other every vertex attributes.The distribution of GPC 208 can change based on the program of each type or the workload of calculating generation.Selectively, GPC 208 can be assigned as service time sheet scheme carry out Processing tasks to change between different processing tasks.

GPC 208 receives the Processing tasks that will carry out by work allocation unit 200, and this work allocation unit 200 receives the order that defines Processing tasks from front end unit 212.Processing tasks comprises for example pointer of wanting deal with data of surface (dough sheet) data, primitive data, vertex data and/or pixel data, also has state parameter and limits the order (for example carrying out what program) how data are handled.Work allocation unit 200 can be configured to obtain the pointer corresponding to Processing tasks, and work allocation unit 200 can be from front end 212 reception pointers, and perhaps work allocation unit 200 can directly receive data from front end 212.In certain embodiments, index has indicated the position of data in the array.Front end 212 guarantees that GPC 208 is configured to effective status before the processing of commands buffer appointment starts.

For example, when parallel processing element 202 was used for graphics process, the work of treatment amount of each dough sheet was divided into the task of about equal sizes, so that damascene can be distributed to a plurality of GPC 208.Frequency output task can provide task to be used to handle to a plurality of GPC 208 can be provided in work allocation unit 200.In some embodiments of the invention, part GPC 208 is configured to carry out dissimilar processing.For example, first can be configured to carry out vertex coloring and produce topological structure, and second portion can be configured to inlay with geometry painted, and third part can be configured to carry out the painted image of playing up with generation of pixel on screen space.Distribution portion GPC 208 has adapted to any expansion and the reduction of the data that produced by those dissimilar Processing tasks effectively with the ability of carrying out inhomogeneous Processing tasks.The intermediate data that GPC 208 produces can be cushioned, and lags behind under the situation of speed that upstream GPC 208 produces data with the speed that receives data at downstream GPC 208, allows intermediate data to send with minimum stopping between GPC 208.

Memory interface 214 can be divided into D memory partition unit, the part coupling of each memory partition unit and parallel processing storer 204, wherein D 〉=1.Every part of parallel processing storer 204 generally includes one or more memory devices (for example DRAM 220).What those having ordinary skill in the art will appreciate that is that DRAM 220 can and can be usual design by other suitable memory devices replacements usually.Therefore omitted detailed description.In one embodiment, DRAM220 can all omit, and memory requests is got back to Memory bridge 105 by interleaver 210 and 205 reflections of I/O unit.For example the playing up target and can cross over DRAM 220 storage of frame buffer or texture mapping allows zoning unit 215 that each part parallel of playing up target is write, to use the available bandwidth of parallel processing storer 204 effectively.

Any one of GPC 208 can be handled the data in any zoning unit 215 that will be written in the parallel processing storer 204.Interleaver (crossbar) unit 210 is configured to output with each GPC 208 and routes to the input of any zoning unit 215 or another GPC 208 with further processing.GPC208 communicates by letter with memory interface 214 to read or to write various external memory devices from various external memory devices by interleaver unit 210.In one embodiment, interleaver unit 210 is connected with memory interface 214 to communicate by letter with I/O unit 205, therefore interleaver unit 210 also is connected with local parallel processing storer 204, makes that nuclear energy is other memory communication non-indigenous with system storage 104 or to parallel processing element 202 in the processing in the different GPC 208.Interleaver unit 210 can use tunnel to flow with separate communications between GPC 208 and zoning unit 215.

In addition, GPC 208 can be programmed to carry out and the relevant Processing tasks of extensive multiple application, described application includes but not limited to linear and nonlinear data conversion, the filtration of video and/or audio data, modeling (are for example operated, the applied physics rule is determined position, speed and other attributes of object), image plays up operation (for example, inlaying tinter, vertex shader, geometric coloration and/or pixel shader) or the like.Parallel processing element 202 can be transferred to inside (on the sheet) storer from system storage 104 and/or local parallel processing storer 204 with data, deal with data, and result data write back in system storage 104 and/or the local parallel processing storer 204, wherein such data can be visited by the other system parts that comprise CPU 102 or another parallel processing subsystem 112.

Parallel processing element 202 can provide the local parallel processing storer 204 of any amount, comprises there is not local storage, and can use local storage and system storage with combination in any.For example, in the embodiment of storage and uniform device framework (UMA), parallel processing element 202 can be a graphic process unit.In such embodiments, can provide seldom or not provide dedicated graphics (parallel processing) storer, parallel processing element 202 uses or using system storer uniquely almost uniquely.In UMA embodiment, parallel processing element 202 can be integrated in bridge chip or the processor chips, or being provided as the have high-speed link discrete chip of (for example PCI-E), this high-speed link is connected by bridge chip or other communicators parallel processing element 202 with system storage.

The parallel processing element 202 that can comprise as mentioned above, any amount in the parallel processing subsystem 112.For example, a plurality of parallel processing elements 202 can be provided on the single plug-in card, and perhaps a plurality of plug-in cards can link to each other with communication path 113, and perhaps one or more parallel processing elements 202 can be integrated in the bridge chip.Parallel processing element 202 in the multiple parallel processing unit system can be mutually the same, perhaps can differ from one another.For example, different parallel processing elements 202 can have the processing kernel of varying number, local parallel processing storer of varying number or the like.When having a plurality of parallel processing element 202, those parallel processing elements can come parallel work-flow with deal with data to be higher than the handling capacity that single parallel processing element 202 may reach.The system that includes one or more parallel processing elements 202 can realize with various configurations and form factors, comprises desk-top computer, notebook or handheld personal computer, server, workstation, game console, embedded system or the like.

The general introduction of processing cluster array

Fig. 3 A is the block diagram of the GPC 208 in one of parallel processing element 202 of Fig. 2 according to an embodiment of the invention.Each GPC 208 can be configured to a large amount of thread of executed in parallel, is meant the example of the specific program that the input data of a particular group are carried out at this term " thread ".In certain embodiments, single instruction multiple data (SIMD) instruction is sent technology and is used to support a large amount of threads of executed in parallel, and a plurality of independently command units need not be provided.In other embodiments, use the one group of processing engine that is configured in each GPC 208 to send the common command unit of instruction, single instrction multithreading (SIMT) technology is used to support a large amount of synchronous threads usually of executed in parallel.Usually carry out identical instruction unlike all processing engine in the SIMD executive mode, the execution of SIMT allows different threads to follow the execution route of dispersion more easily by given thread program.What those having ordinary skill in the art will appreciate that is that the SIMD processing mode has been represented the subset of functionality of SIMT processing mode.

In graphical application, GPC 208 can be configured to realize the pel engine to carry out the screen space graphics processing function, and it includes but not limited to that pel foundation, rasterisation and Z reject.Pel engine 3 04 200 receives Processing tasks from the work allocation unit, and when operation that Processing tasks need not implemented by the pel engine, Processing tasks is sent to pipeline manager 305 by the pel engine.The operation of GPC 208 is advantageously controlled by pipeline manager 305, and pipeline manager 305 is distributed to stream multiprocessor (SPM) 310 with Processing tasks.Pipeline manager 305 also can be configured to come Control work to distribute interleaver (crossbar) 330 by the data named place of destination of the processing of exporting for SPM 310.

In one embodiment, each GPC 208 comprises M SPM 310, M 〉=1 wherein, and each SPM 310 is configured to handle one or more sets of threads.In addition, each SPM 310 advantageously comprise can pipelining on the same group functional unit of phase (for example ALU etc.), allow newly to instruct before preceding instruction is finished, to send, this is known in the art.Any combination of functional unit can be provided.In one embodiment, functional unit is supported various computings, comprise integer and floating-point arithmetic (for example addition and multiplication), comparison operation, Boolean calculation (with or, XOR), displacement and various algebraic function (for example planar interpolation function, triangulation function, exponential sum logarithmic function etc.) calculate; And the identical functions unit hardware can be balanced to implementing different computings.

The series of instructions that sends to specific GPC 208 has constituted at the previously defined thread of this paper, and the set of the thread that some is carried out simultaneously on parallel processing engine (not shown) in SPM 310 is referred to herein as sets of threads.As used herein, sets of threads refers to the group of difference input data being carried out simultaneously the thread of same program, and each thread in the group is assigned to the different disposal engine among the SPM 310.Sets of threads can comprise the thread that lacks than processing engine quantity among the SPM 310, and in this case, during the cycle when the processing threads group time, some processing engine will be idle.Sets of threads also can comprise the thread of Duoing than processing engine quantity among the SPM 310, and in this case, processing will occur on a plurality of clock period.Because each SPM 310 can support nearly G sets of threads simultaneously, therefore nearly GxM sets of threads can be carried out in GPC 208 in any given time.

In addition, a plurality of relevant sets of threads can be movable (in the different execute phases) in SPM310 simultaneously.The set of this sets of threads is referred to herein as " cooperation thread array " (" CTA ").The size of specific CTA equals m*k, and wherein k is the number of the thread of execution and the integral multiple of the number of parallel processing engine in SPM310 simultaneously typically in sets of threads, and m is the number of the sets of threads of while activity in SPM310.The size of CTA is determined by the amount of hardware resources such as storer or register that programmer and CTA can use usually.

Proprietary local address space can be used each thread, and every CTA address space of sharing is used for Data transmission between the thread of CTA.Be stored in data storage in every thread local address space and the every CTA address space in L1 high-speed cache 320, and can use and evict strategy from and help data are remained in the L1 high-speed cache 320.Each SPM 310 uses in the space that is used for loading with the corresponding L1 high-speed cache 320 of storage operation.Each SPM 310 also visits the L2 high-speed cache that all GPC 208 share and can be used for the zoning unit 215 of transmission data between thread.At last, SPM 310 also visits outer " overall situation " storer of sheet, and it can comprise for example parallel processing storer 204 and/or system storage 104.The L2 high-speed cache can be used for storing and is written to the data that global storage neutralization is read from global storage.Be understandable that any storer of parallel processing element 202 outsides can be used as global storage.

In graphical application, GPC 208 can be configured so that 315 couplings of each SPM 310 and texture cell to carry out the texture mapping operation, for example determine texture sample the position, read data texturing and filter data texturing.Data texturing reads by memory interface 214 and obtains from L2 high-speed cache, parallel processing storer 204 or system storage 104 as required.Texture cell 315 can be configured to storage data texturing in internally cached.In certain embodiments, texture cell 315 is stored in the L1 high-speed cache 320 with 320 couplings of L1 high-speed cache and data texturing.The task that each SPM 310 handles to 330 outputs of work allocation interleaver,, or handling of task is stored in L2 high-speed cache, parallel processing storer 204 or the system storage 104 doing further processing with task that processing is provided to another GPC 208 by interleaver unit 210.PreROP (raster manipulation device in advance) 325 is configured to receive data from SPM 310, the raster operation unit of vectoring information to the zoning unit 215, and carry out the optimization of blend of colors, tissue pixels color data, the row address of going forward side by side conversion.

Be understandable that kernel framework described herein is exemplary, can make variation and change.Any amount of processing engine, for example pel engine 3 04, SPM 310, texture cell 315 or PreROP 325 can be included among the GPC 208.Further, though only show a GPC 208, parallel processing element 202 can comprise any amount of GPC 208, and is advantageously similar each other on these GPC 208 functions, makes act of execution not depend on which GPC 208 has received the particular procedure task.Further, each GPC 208 uses and advantageously works independently with other relatively GPC 208 such as different processing engine, L1 high-speed caches 320 separately.

Fig. 3 B is the block diagram of parallel processing element 202 zoning unit 215 in one of them among Fig. 2 according to an embodiment of the invention.As shown in the figure, zoning unit 215 comprises L2 high-speed cache 350, frame buffer (FB) 355 and raster operation unit (ROP) 360.L2 high-speed cache 350 is to be configured to carry out from interleaver unit 210 and the loading of ROP 360 receptions and the read/write high-speed cache of storage operation.Read disappearance and the request that promptly writes back outputs to frame buffer 355 by L2 high-speed cache 350 and handles.Dirty (dirty) renewal is also delivered to frame buffer 355 and is used for the chance processing.Frame buffer 355 direct and parallel processing storer 204 interfaces, request is read and is write in output, and receives the data that read from parallel processing storer 204.

In graphical application, ROP 360 is the processing units that carry out raster manipulation, for example template, z test, mixing etc., and the graph data that pixel data is output as processing is to be stored in the graphic memory.In some embodiments of the invention, ROP 360 is included among each GPC 208, rather than in zoning unit 215, the read and write request of pixel sends by interleaver unit 210, rather than by the pixel fragment data.

The graph data of handling may be displayed on the display device 110, or route is further to handle by CPU 102 or by one of processing entities in the parallel processing subsystem 112.Each zoning unit 215 comprises that ROP 360 is to distribute the processing of raster manipulation.In certain embodiments, ROP 360 is configured to compress z or the color data of writing in the storer, and z that will read from storer or color data decompression.

What it will be appreciated by those skilled in the art that is, the framework of describing among Fig. 1,2,3A and the 3B never limits the scope of the invention, Jiao Dao technology can realize on the processing unit of any configuration compatibly herein, this processing unit includes but not limited to one or more CPU, one or more many kernels CPU, one or more parallel processing element 202, one or more GPC 208, one or more figure or specialized processing units etc., and these do not leave scope of the present invention.

Deadlock is avoided

When communication path 113 was the PCIe bus, the write request co-pending in the PCIe bus stoped from running through of returning of system storage 104 and arrives parallel processing subsystem 202.When parallel processing subsystem 202 need run through before write request co-pending can be handled, deadlock took place.Embodiments of the invention provide the technology that will run through by tunnel (VC) route of the PCIe that separates with the tunnel that is used to send write request.Therefore, can not stop to run through and arrive parallel processing subsystem 202 and avoided deadlock.

Fig. 4 is the block diagram that is configured to avoid the computer system 400 of deadlock according to an embodiment of the invention.As shown in the figure, computer system 400 comprises the CPU102 integrated with Memory bridge 105, system storage 104, Peripheral component interface (PCIe) bus 401 and parallel processing subsystem 202 fast.CPU102 is by Memory bridge 105 and system storage 104 couplings.CPU102 is also by Memory bridge 105 and PCIe bus 401 and 202 couplings of parallel processing subsystem.CPU102 can visit memory cell in parallel processing subsystem 202 by Memory bridge 105 and PCIe bus 401.Equally, parallel processing subsystem 202 is also by PCIe bus 401 and Memory bridge 105 access system memory 104.

CPU102 is the primary processor of computer system 400 and is configured to send the request that comprises read request and write request by Memory bridge 105 and gives system storage 104.CPU102 also gives parallel processing subsystem 202 by the request of sending of Memory bridge 105 and PCIe bus 401.

Parallel processing subsystem 202 is the coprocessors that are configured to implement into CPU102 the different disposal operation.These are handled operation and comprise data compression and decompression.Parallel processing subsystem 202 comprises and is configured to receive request and will ask route to give the PCIe interface 402 of the different parts of parallel processing subsystem 202 with processing from CPU102 by PCIe bus 401.PCIe interface 402 also sends request to CPU102 or system storage 104 by PCIe bus 401.PCIe interface 402 is by different virtual passage (VC) route data on PCIe bus 401.These tunnels comprise VC0 and VC1 (not shown).

The parallel processing subsystem further comprises main frame 404, client 406A-406N, I/O unit 205, interleaver unit 210, L2 high-speed cache 350 and parallel processing storer 204.I/O unit 205 allows the parallel processing subsystem to implement memory access operation and comprises Memory Management Unit (MMU) moderator 408, MMU410, translation lookaside buffer (TLB) 412 and one or more iterator 414.

Main frame 404 is the engines that allow CPU102 visit I/O unit 205.Main frame 404 and 408 couplings of the MMU moderator in I/O unit 205.Main frame 404 receives request and these requests is sent to MMU410 by MMU moderator 408 from CPU102.Client 406A-406N also is coupled with MMU moderator 408.Client 406A-406N is an engine of implementing difference in functionality, and these functions comprise memory management, graphic presentation, instruction fetch, encryption, texture processing and video decode.Client 406A-406N is configured to send request and gives I/O unit 205.

MMU moderator 408 arbitration and allow these engines visits MMU410 between each of main frame 404 and client 406A-406N.MMU moderator 408 checks and the relevant Engine ID of request that is received from main frame 404 and client 406A-406N that this Engine ID indicates whether that request is from CPU102.When Engine ID indication request during, equal 1 and this request marks is " special " by making " special " position in request from CPU102.Ask route to give MMU410 then.

MMU410 provides virtual to physical address translations for main frame 404 and client 406A-406N.When main frame 404 and/or client 406A-406N sent request to MMU410 by MMU moderator 408, the virtual address translation of appointment became physical address during MMU410 will ask.Virtually can use TLB412 to quicken to physical address translations.The virtual mapping of TLB412 storage visit recently to physical address.If the virtual address that receives is included among the TLB412, relevant with this virtual address so physical address can obtain from TLB412 apace.If TLB412 does not store the virtual mapping to physics that needs, MMU410 sends read request comprises needs with retrieval virtual page table to physical address map so.

Such read request of sending as the direct result of another request is called " read request of derivation " below.If it is special that the initial request of the read request that produce to derive from is labeled as, so MMU410 the read request that derives from is labeled as special.MMU410 sends to PCIe interface 402 with the read request that derives from.When the request of deriving from is not labeled as when special, PCIe interface 402 read request that route derives from VCO is when the read request that derives from is labeled as when special PCIe interface 402 read request that route derives from VC1.Never be labeled as running through on VC0 that special request returns and return, and return from being labeled as running through that special request returns at VC1.When system storage 104 receive data with request relevant run through the time, the processing continuation of initial request.

MMU410 will ask and physical address sends to one of iterator 414.Iterator 414 is the interleaver original address with physical address translations and will asks to send to interleaver unit 210 with the interleaver original address.Interleaver unit 210 will ask route to give L2 high-speed cache 350 then.

L2 high-speed cache 350 is low delay memory unit of the storage data that can be needed by I/O unit 205.L2 high-speed cache 350 comprises that 202 pairs of permission parallel processing subsystems are received from the data of system storage 104 or are stored in the compression and decompression unit (not shown) that the data in the L2 high-speed cache 350 are carried out compression and decompression.L2 high-speed cache 350 comprises tag storage part (not shown), and this tag storage part comprises the label of compressive state of cache line of the nearest visit of indication L2 high-speed cache 350.

When 350 receptions of L2 high-speed cache were the write request of target with the particular cache line in the L2 high-speed cache 350, L2 high-speed cache 350 used the tag storage parts to compress to determine whether that target cache is capable.When the tag storage part did not comprise the compressive state of the cache line of being indicated by request, L2 high-speed cache 350 produced the read request of derivation with the reserve storaging piece (not shown) of visit in system storage 104.L2 high-speed cache 350 sends to PCIe interface 402 by interleaver unit 210 with the read request that derives from.It is special that PCIe interface 402 determines whether that the read request that derives from is labeled as, and therefore with the read request route on PCIe bus 401 that derives from.When system storage 104 returns relevant with the read request that derives from running through, run through on VC1 and send when the read request that derives from is labeled as when special this, therefore when write request is on PICe bus 401, avoided dead lock condition.

If reserve storaging piece indicating target cache line compresses, L2 high-speed cache 350 decompression target caches are capable so, data that decompress and the data that are included in the write request are merged, and the data with merging that will decompress write back in the cache line in L2 high-speed cache 350.L2 high-speed cache 350 also can upgrade the compressive state of tag storage part with the cache line that comprises nearest visit.In one embodiment, the data of merging can be compressed once more.When data decompression, its form storage to decompress.The tag storage part indicate whether sheet be compression and therefore need to decompress, perhaps can directly write and need not at first decompress.

L2 high-speed cache 350 also can receive write request, and this write request specifies the CPU102 need be to zone or " compressed tablet " of the compression of the system storage 104 of its write data.Typically, though compressed tablet from the parallel processing subsystem 202, CPU102 produces compressed tablet in one embodiment.L2 high-speed cache 350 receives write request and produces the read request that derives from reads compressed tablet with access system memory 104 and from system storage 104.L2 high-speed cache 350 sends to PCIe interface 402 by interleaver unit 210 with the read request that derives from.It is special that PCIe interface 402 determines whether that the read request that derives from is labeled as, and therefore with the read request route on PCIe bus 401 that derives from.If it is special that the read request that derives from is labeled as, system storage 104 returns run through relevant with the read request that derives from VC1 so, has therefore avoided contingent dead lock condition when write request is co-pending on PCIe bus 401.L2 high-speed cache 350 receives the data of the compression of returning from the request of deriving from, and to the data decompression of compression, the data of write data and decompression is merged, and data that compression merges and data that will compress, merging write back to system storage 104.

The request marks that CPU102 is sent is special, and the read requests that also will respond the derivation that those requests produce are labeled as specially, owing to send with being labeled as relevant the running through on VC1 rather than VC0 of special request, so allows to avoid deadlock.(non-PCIe) bus protocol technology of use standard, request also can be labeled as " flexibly ordering " or mark otherwise, and its indication runs through can consider that ordering rule returns.Though the specific environment of this Technical Reference possibility produce of deadlock is described in the above, but it will be understood by those of skill in the art that, the request marks that CPU102 is sent is special, and the read requests that also will respond the derivation that those requests produce be labeled as special, when PCIe bus 401 has write request co-pending, allow to avoid deadlock.

Fig. 5 is a process flow diagram of avoiding the method step of deadlock according to an embodiment of the invention.It will be understood by those of skill in the art that though this method 500 is described in conjunction with the system of Fig. 1-4, any system with any order implementation method step of being configured to is all in scope of the present invention.

As shown in the figure, method 500 starts from step 502, and wherein MMU moderator 408 receives request from one of main frame 404 or client 406A-406N.Request can be read request or write request.In addition, request can be a target with L2 high-speed cache 350, parallel processing storer 204 or system storage 104.In step 504, MMU moderator 408 is checked the Engine ID relevant with request.The source of Engine ID indication request.For example, request may send from one of client 406A-406N, perhaps as an alternative, may send from CPU102 by main frame 404.In step 506, MMU moderator 408 determines whether that CPU102 has sent request.If Engine ID indication CPU102 sends request, method 500 proceeds to step 508 so.In step 506, if MMU moderator 408 determines that CPU102 does not send request, method 500 proceeds to step 518 so.

In step 508, MMU moderator 408 is special with request marks.The position that MMU moderator 408 is configured in request is set to " 1 " and sends from CPU102 to indicate this request.In step 510, request makes the read request that derives from produce.Request makes the read request that derives from produce under varying environment.For example, so that MMU410 can implement virtually when the physical address translations, it is the read request of the derivation of target that MMU410 produces with system storage 104 when needs read apparatus storer 104.Replacedly, so that L2 high-speed cache 350 can determine the compressive state of cache line the time, it is the read request of the derivation of target that L2 high-speed cache 350 produces with system storage 104 when needing read apparatus storer 104.Various other situations that initial request makes the read request of derivation produce are possible.

In step 512, it is special that the read request of derivation is labeled as.When MMU410 produced the read request that derives from, it is special that MMU410 is labeled as the read request that derives from.When L2 high-speed cache 350 produced the read request that derives from, it is special that L2 high-speed cache 350 is labeled as the read request that derives from.When if the other parts of parallel processing subsystem 202 produce the read request that derives from, it is special that these parts are labeled as the read request that derives from.In step 514, the 402 receiving and inspection requests of PCIe interface.This request can be the read request that derives from or replacedly be different requests.

In step 516, PCIe interface 402 determines whether that request marks is special.If request is not labeled as specially, method 500 proceeds to step 518 so, wherein PCIe interface 402 will ask with ask the relevant VC0 route that runs through by PCIe bus 401.If request marks is special, method 500 proceeds to step 520, and PCIe interface 402 will be asked and the relevant VC1 route that runs through by PCIe bus 401 with request.Method stops then.

In a word, the request marks that parallel processing element (PPU) will be received from CPU (central processing unit) (CPU) is " special ", makes the read request of the derivation that response request produces also be labeled as special and therefore in Peripheral component interface route on the auxiliary tunnel of (PCIe) bus fast.PPU will be labeled as special request by Peripheral component interface tunnel (VC) 1 transmission of (PCIe) bus fast.Run through if be labeled as special request generation, the VC1 that runs through so by the PCIe bus returns.

Advantageously, owing to send, therefore can not cause deadlock from being labeled as running through when on VC0, write request having been arranged that special request returns with being labeled as relevant the running through on different tunnels of special request.

Therefore, embodiments of the invention provide and have used the mode bit of propagating by structure to be used to discern with mark to be sent to certain request that may cause deadlock (for example read and write request) of parallel processing subsystem 202 and technology also used this mode bit mark by the affairs of any derivation of this request generation by CPU102.In other embodiments, the mechanism of Standard bus interface qualification (for example " ordering flexibly ") can be used to avoid deadlock.

Be noted that the certain affairs that send to system storage 104 by parallel processing subsystem 202 can not cause deadlock and therefore do not transmit on second tunnel.For example, as synchronization primitives or send on first tunnel based on the affairs that run through the ordering rule that is no more than write request in addition.For example, when sending signal collection, parallel processing subsystem 202 guaranteed that to detect CPU102 when when writing of parallel processing storer 204 finishes, to run through the affairs of writing that all CPU early start have arrived coherent point.

One embodiment of the present of invention may be embodied as the program product that uses with computer system.The program of program product defines the function of embodiment (comprising method described herein) and can be included in the various computer-readable recording mediums.The computer-readable recording medium of example includes but not limited to: (i) information (for example can not be write storage medium by permanent storage on it, ROM (read-only memory) equipment in the computing machine is such as the solid-state Nonvolatile semiconductor memory of the CD-ROM dish, flash memory, rom chip or any kind that be can read by CD-ROM drive); (ii) storage can change the storage medium write (for example, the solid-state random-access semiconductor memory of floppy disk in the floppy disk or hard disk drive or any kind) of information on it.

Invention has been described with reference to specific embodiment above.But, it will be understood by those of skill in the art that, under the situation of the spirit and scope that do not deviate from the broad of the present invention of illustrating, can make various modifications and change to specific embodiment as appended claim.Therefore aforesaid description and accompanying drawing are exemplary and not restrictive.

Claims

1. computer system comprises:

First processing unit, second processing unit, Memory bridge, system storage and this second processing unit is connected to the bus of this first processing unit, this Memory bridge and this system storage by first tunnel and second tunnel;

Wherein this second processing unit comprises Bus Interface Unit, and this Bus Interface Unit is configured to: (i) receive the request that reads or writes from this first processing unit by this first tunnel; (ii) will on this second tunnel, send in the read request of handling the derivation that produces when this reads or writes request.

2. according to the computer system of claim 1, wherein this second processing unit further comprises the Memory Management Unit with translation lookaside buffer, and this Memory Management Unit produces the read request of this derivation when in this translation lookaside buffer disappearance taking place.

3. according to the computer system of claim 1, further comprise the local storage that is used for this second processing unit, wherein this second processing unit is connected with this local storage by cache memory unit and this cache memory unit produces the read request of this derivation.

4. according to the computer system of claim 3, wherein when the compressive state information that is not stored in this cache memory unit was visited in this request of reading or writing, this cache memory unit produced the read request of this derivation.

5. according to the computer system of claim 3, wherein when this read or write request from the regional visit data of the compression of this system storage, this cache memory unit produced the read request of this derivation.

6. according to the computer system of claim 1, wherein this second processing unit further comprises the Memory Management Unit moderator, this Memory Management Unit moderator is configured to receive the request that reads or writes from a plurality of clients, if and read or write the request be received from this first processing unit, this is read or write request marks is special.

7. according to the computer system of claim 6, wherein each of this client has client identifier and this Memory Management Unit moderator is configured to check this client identifier relevant with each request of reading or writing.

8. according to the computer system of claim 1, wherein this first processing unit is a CPU (central processing unit), and this second processing unit is that parallel processing element and this bus are the PCIe buses.

9. have first processing unit, second processing unit, Memory bridge, system storage and this second processing unit is being connected in the computer system of bus of this first processing unit, this Memory bridge and this system storage, a kind ofly read or write the method for request, may further comprise the steps in second processing unit processes:

First tunnel by this bus receives the request that reads or writes at this second processing unit;

When handling this and read or write request, produce the read request of one or more derivations and the read request that will derive from second tunnel by this bus sends to this system storage at this second processing unit; With

This second tunnel by this bus receive this derivation read request finish and

Finish this initial request that reads or writes that receives.

10. according to the method for claim 9, wherein when handling this and read or write request, when the compression of this system storage of visit regional, produce the read request of this derivation.