CN101901198A - Deadlock avoidance by marking CPU traffic as special - Google Patents

Deadlock avoidance by marking CPU traffic as special Download PDF

Info

Publication number
CN101901198A
CN101901198A CN200910249698XA CN200910249698A CN101901198A CN 101901198 A CN101901198 A CN 101901198A CN 200910249698X A CN200910249698X A CN 200910249698XA CN 200910249698 A CN200910249698 A CN 200910249698A CN 101901198 A CN101901198 A CN 101901198A
Authority
CN
China
Prior art keywords
request
processing unit
unit
read
parallel processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910249698XA
Other languages
Chinese (zh)
Inventor
塞缪尔·H.·邓肯
戴维·B.·格拉斯科
黄伟杰
阿图·卡拉姆布尔
帕特里克·R.·马尔尚
丹尼斯·K.·马
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nvidia Corp
Original Assignee
Nvidia Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nvidia Corp filed Critical Nvidia Corp
Priority to CN201510605017.4A priority Critical patent/CN105302524A/en
Publication of CN101901198A publication Critical patent/CN101901198A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system
    • G06F13/368Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control
    • G06F13/376Handling requests for interconnection or transfer for access to common bus or bus system with decentralised access control using a contention resolving method, e.g. collision detection, collision avoidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/42Bus transfer protocol, e.g. handshake; Synchronisation
    • G06F13/4204Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus
    • G06F13/4221Bus transfer protocol, e.g. handshake; Synchronisation on a parallel bus being an input/output bus, e.g. ISA bus, EISA bus, PCI bus, SCSI bus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1652Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1652Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
    • G06F13/1663Access to shared memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/36Handling requests for interconnection or transfer for access to common bus or bus system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4004Coupling between buses
    • G06F13/4027Coupling between buses using bus bridges
    • G06F13/4031Coupling between buses using bus bridges with arbitration
    • G06F13/4036Coupling between buses using bus bridges with arbitration and deadlock prevention
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3877Concurrent instruction execution, e.g. pipeline or look ahead using a slave processor, e.g. coprocessor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
  • Debugging And Monitoring (AREA)
  • Multi Processors (AREA)

Abstract

Deadlocks are avoided by marking read requests issued by a parallel processor to system memory as ''special.'' Read completions associated with read requests marked as special are routed on virtual channel 1 of the PCIe bus. Data returning on virtual channel 1 cannot become stalled by write requests in virtual channel 0, thus avoiding a potential deadlock.

Description

By marking CPU traffic is the special deadlock of avoiding
Technical field
Relate generally to computer hardware of the present invention, and to relate more particularly to by marking CPU traffic be the special method and system of avoiding deadlock.
Background technology
Conventional computer system comprises CPU (central processing unit) (CPU) and also can comprise the coprocessor that is known as parallel processing element (PPU).CPU unloads certain processing operation to PPU to reduce the work of treatment amount of CPU.Wherein, these are handled to operate and comprise the compression and decompression operation.When CPU needed these to handle operation, PPU was given in the CPU request of sending, and comprises read request and/or write request.For example, CPU may need with data write may system storage with compressed format storage in.CPU sends write request to PPU, and PPU can read and decompress that the data relevant with this write request also will decompress and write in the system storage with the raw data that new data merges then.
Sometimes, the write request sent of CPU may cause that PPU sends one or more " derivation " read request that must finish before initial write request can be finished.For example, can to send with the system storage unit relevant with CPU be the read request of the derivation of target to PPU.When reading affairs and finish, system storage sends and runs through to PPU, and these affairs of this notice PPU are finished.
But, when CPU and the PPU Peripheral component interface by having one or more write requests co-pending fast (PCIe) may have problems when bus is connected.Because the ordering rule of PCIe bus, run through and to surpass write request, so the read request of any derivation can not be returned and runs through to PPU.Therefore, initial write request can not be finished.This situation is known in the industry as circulation interdependent or " deadlock ".Deadlock makes the some or all of communication between CPU and PPU stop and influencing unfriendly the processing handling capacity of computer system.Some examples of dead lock condition are discussed below.
In first example, if PPU need the page table from be stored in system storage read and write request co-pending in the PCIe bus, then deadlock may take place.When PPU gives system storage when obtaining project from page table to sending read request, run through relevant with read request can not return to PPU, so initial write request can not be finished.
When CPU sends when giving PPU with the write request of the high-speed cache behavior target in the cache memory unit relevant with PPU, also deadlock may take place.In order to finish write request, PPU at first determines by the inspection tag storaging piece whether cache line compresses.The compressive state of the high-speed cache line correlation of indication of tag storage part and the nearest visit in cache memory unit.When the tag storage part did not comprise compressive state by the cache line of write request appointment, PPU sent read request and is included in the reserve storaging piece of the compressive state of each cache line in the cache memory unit for visit system storage.The compressive state and sending that the reserve storaging piece returns the cache line of appointment runs through.But, when write request is co-pending in PCIe, because run through relevant with read request can not surpass these write requests co-pending, therefore deadlock may take place.
When CPU attempts to write data in the zone of system storage of compression (being known as " compressed tablet " in this area), the third deadlock may take place.CPU sends the specified compression sheet and gives PPU with the write request that comprises write data.PPU sends read request and gives system storage to read compressed tablet.When write request is co-pending in PCIe, owing to relevant with read request run through and can not surpass these write requests co-pending, so deadlock may take place.
Except these three examples, also have several other situations may cause deadlock.Therefore, still need to avoid the method and system of deadlock in this area.
Summary of the invention
Embodiments of the invention provide the method and system of avoiding deadlock in computer system, and this computer system has first processing unit, second processing unit, Memory bridge, system storage and second processing unit is connected to the bus of first processing unit, Memory bridge and system storage.When deadlock has been avoided in the request of reading or writing when first processing unit sends to second processing unit.
The method of avoiding deadlock according to the embodiment of the invention comprises: receive the request that reads or writes at second processing unit on first tunnel of bus; When reading or writing request, processing produces the read request that derives from second processing unit; Second tunnel by bus sends to system storage with the read request that derives from; Finishing of the read request that reception is derived from second tunnel of bus; With the request that reads or writes of finishing reception.
Be included in Bus Interface Unit in second processing unit according to the system that avoids deadlock of the embodiment of the invention.This Bus Interface Unit is configured to receive the request that reads or writes from first processing unit on first tunnel, and the read request of the derivation that will produce when processing reads or writes request sends by second tunnel.
Description of drawings
In order at length to understand above-mentioned feature of the present invention, the present invention for above brief description more specifically describes with reference to embodiment, wherein some embodiment is illustrated in the accompanying drawings.Yet, it should be noted that accompanying drawing just illustrates representative embodiment of the present invention, therefore can not think that accompanying drawing has limited scope of the present invention, the present invention can allow other same effectively embodiment.
Fig. 1 is that illustrated arrangement is to realize the block diagram of the computer system of one or more aspects of the present invention;
Fig. 2 is the block diagram of the parallel processing subsystem of the computer system among Fig. 1 according to an embodiment of the invention;
Fig. 3 A is the block diagram of the common treatment bunch (GPC) in one of parallel processing element (PPU) of Fig. 2 according to an embodiment of the invention;
Fig. 3 B is the block diagram of the zoning unit in one of parallel processing element in Fig. 2 according to an embodiment of the invention;
Fig. 4 is the block diagram that is configured to avoid the computer system of deadlock according to an embodiment of the invention; With
Fig. 5 is a process flow diagram of avoiding the method step of deadlock according to an embodiment of the invention.
Embodiment
In the following description, a large amount of concrete details have been provided so that more thorough understanding of the invention is provided.Yet, it will be apparent to one skilled in the art that the present invention can need not one or more these details and implemented.In other example, for fear of obscuring with the present invention, more known features are not described.
System survey
Fig. 1 is that illustrated arrangement is to realize the block diagram of the computer system 100 of the one or more aspects of the present invention.Computer system 100 comprises CPU (central processing unit) (CPU) 102 and system storage 104, and the two communicates via total thread path by Memory bridge 105.Memory bridge 105 can be integrated among the CPU 102 as shown in Figure 1.As an alternative, Memory bridge 105 can be a for example north bridge chips of usual equipment, and it is connected to CPU 102 by bus.Memory bridge 105 is connected with I/O (I/O) bridge 107 by communication path 106 (for example HyperTransport (super transmission) link).I/O bridge 107 for example can be a South Bridge chip, and it receives user's input from one or more user input devices 108 (for example keyboard, mouse), and should import by path 106 and Memory bridge 105 and be transmitted to CPU 102.Parallel processing subsystem 112 is coupled with Memory bridge 105 by bus or other communication paths 113 (for example PCIExpress, Accelerated Graphics Port or super transmission link); In one embodiment, parallel processing subsystem 112 is the graphics subsystems that pixel offered display device 110 (for example traditional CRT or based on the display of LCD).System disk 114 is connected in I/O bridge 107 equally.Switch 116 has been carried I/O bridge 107 and such as the connection between the miscellaneous part of network adapter 118 and various plug-in card 120 and 121.Comprise the miscellaneous part (clearly not illustrating among the figure) of the connection of USB or other ports, CD driver, DVD driver, film recording unit etc., also can be connected with I/O bridge 107.The interconnective communication path of various parts among Fig. 1 can be realized with any suitable agreement, such as PCI (Peripheral Component Interconnect), PCI Express (PCI-E), AGP (Accelerated Graphics Port), super transmission or any other bus or point to point protocol, and the connection between the distinct device can be used different agreement, as known in the art.
In one embodiment, parallel processing subsystem 112 is included as the circuit of figure and Video processing optimization, comprises for example video output circuit, and has constituted Graphics Processing Unit (GPU).In another embodiment, parallel processing subsystem 112 is included as common treatment optimization, and has kept the circuit of bottom computing architecture, and this paper will describe in further detail.In another embodiment, parallel processing subsystem 112 can integrated one or more other system elements, and for example Memory bridge 105, CPU 102 and I/O bridge 107 are to form SOC (system on a chip) (SoC).
Be understandable that the system that illustrates just schematically can change and revises it here.Comprise being connected topological structure and can revising as required of the quantity of bridge and arrangement.For example, in certain embodiments, system storage 104 directly is connected rather than is connected by bridge with CPU 102, and other equipment communicate with system storage 104 by Memory bridge 105 and CPU 102.In other selectable topological structures, parallel processing system (PPS) 112 is connected with I/O bridge 107 or directly is connected with CPU 102, rather than be connected with Memory bridge 105.In other embodiments, the one or more of CPU102, I/O bridge 107, parallel processing subsystem 112 and Memory bridge 105 are integrated on one or more chips.Here the specific features that illustrates is optional; For example, can support any amount of plug-in card or peripheral hardware.In certain embodiments, saved switch 116, network adapter 118 directly is connected with I/O bridge 107 with plug-in card 120,121.
Fig. 2 shows parallel processing subsystem 112 according to an embodiment of the invention.As shown in the figure, parallel processing subsystem 112 comprises one or more parallel processing elements (PPU) 202, and each parallel processing element all is coupled with local parallel processing (PP) storer 204.Usually, the parallel processing subsystem comprises U PPU, wherein U 〉=1.(here, a plurality of examples of analogical object with the Reference numeral that shows this object and show that the supplemental instruction numeral (when needing) of this example represents).PPU 202 and parallel processing storer 204 can use for example programmable processor, one or more integrated device electronics of special IC (ASIC) or memory devices, or realize in any other technical feasible mode.
Refer again to Fig. 1, in certain embodiments, in parallel processing subsystem 112 some or whole parallel processing element 202 are the graphic process unit with rendering pipeline, and it can be configured to carry out the various tasks relevant with following state: produce pixel data from the graph data that provides by CPU 102 and/or system storage 104; With local parallel processing storer 204 (it can be used as graphic memory, for example comprises conventional frame buffer) alternately with storage and upgrade pixel data; Provide pixel data to display device 110 etc.In certain embodiments, parallel processing subsystem 112 can comprise the one or more parallel processing elements 202 and one or more other parallel processing elements 202 that are used for general-purpose computations as graphic process unit work.Parallel processing element can be identical or different, and each parallel processing element 202 can have its own special-purpose parallel processing memory devices or not have special-purpose parallel processing memory devices.One or more parallel processing elements 202 data can be outputed to display device 110 or each parallel processing element 202 can output to data one or more display devices 110.
With reference to figure 2, in certain embodiments, can there be local parallel processing storer 204, memory reference is got back to system storage 104 (not shown) by the local cache (not shown) herein by interleaver unit 210 and 205 reflections of I/O unit.
In operation, CPU 102 is primary processors of computer system 100, the operation of its control and coordination other system parts.Particularly, CPU 102 sends the order of control parallel processing element 202 operations.In certain embodiments, CPU 102 will be written to commands buffer (not illustrating clearly among Fig. 1 and Fig. 2) for the command stream of each parallel processing element 202, and this commands buffer can be arranged in another memory location that system storage 104, parallel processing storer 204 or CPU 102 and parallel processing element 202 can be visited.Parallel processing element 202 sense command stream from commands buffer, and subsequently with respect to the operation exception ground fill order of CPU 102.CPU 102 also can set up data buffer, and this data buffer is read in the order of parallel processing element 202 in can the response command impact damper.Each order and data buffer can be read by each of parallel processing element 202.
Return now with reference to figure 2, each parallel processing element 202 comprises I/O (I/O) unit 205 of communicating by letter with the remainder of computer system 100 by communication path 113, this communication path 113 be connected with Memory bridge 105 (or directly being connected with CPU 102 in an optional embodiment).Being connected also of the remainder of parallel processing element 202 and computer system 100 can change.In certain embodiments, parallel processing subsystem 112 is embodied as plug-in card, and it can be inserted in the expansion slot of computer system 100.In other embodiments, parallel processing element 202 energy and bus bridge are integrated on the one chip together, and bus bridge for example can be Memory bridge 105 or I/O bridge 107.In other other embodiment, some or all elements of parallel processing element 202 can be integrated on the one chip together with CPU102.
In one embodiment, communication path 113 is PCI-E links, and wherein as known in the art, designated lane is distributed to each PPU 202.Also can use other communication paths.I/O unit 205 produces bag (or other signals) with transmission on communication path 113, and also receives all input bags from communication path 113 (or other signals), and will import the suitable parts that bag guides to PPU 202.For example, the order relevant with Processing tasks can be directed to host interface 206, and the order relevant with storage operation (for example read from parallel processing storer 204 or write to it) can be directed to memory interleave device (crossbar) unit 210.Host interface 206 reads each commands buffer, and will be outputed to front end 212 by the work of commands buffer appointment.
Each PPU 202 advantageously realizes highly-parallel processing framework.As be shown specifically, PPU 202 (0) comprises processing cluster array 230, handles cluster array 230 and comprises that quantity is common treatment bunch (GPC) 208 of C, wherein C 〉=1.Each GPC 208 can carry out in a large number (for example hundreds of or thousands of) thread simultaneously, and wherein each thread is the example of program.In different application, different GPC 208 is assigned with in order to handle dissimilar programs or to carry out dissimilar calculating.For example, in graphical application, first group of GPC 208 can be assigned as the pel topology of inlaying operation and producing dough sheet, and second group of GPC 208 can be assigned as inlay painted with the dough sheet parameter of estimating the pel topology and determine vertex position and other every vertex attributes.The distribution of GPC 208 can change based on the program of each type or the workload of calculating generation.Selectively, GPC 208 can be assigned as service time sheet scheme carry out Processing tasks to change between different processing tasks.
GPC 208 receives the Processing tasks that will carry out by work allocation unit 200, and this work allocation unit 200 receives the order that defines Processing tasks from front end unit 212.Processing tasks comprises for example pointer of wanting deal with data of surface (dough sheet) data, primitive data, vertex data and/or pixel data, also has state parameter and limits the order (for example carrying out what program) how data are handled.Work allocation unit 200 can be configured to obtain the pointer corresponding to Processing tasks, and work allocation unit 200 can be from front end 212 reception pointers, and perhaps work allocation unit 200 can directly receive data from front end 212.In certain embodiments, index has indicated the position of data in the array.Front end 212 guarantees that GPC 208 is configured to effective status before the processing of commands buffer appointment starts.
For example, when parallel processing element 202 was used for graphics process, the work of treatment amount of each dough sheet was divided into the task of about equal sizes, so that damascene can be distributed to a plurality of GPC 208.Frequency output task can provide task to be used to handle to a plurality of GPC 208 can be provided in work allocation unit 200.In some embodiments of the invention, part GPC 208 is configured to carry out dissimilar processing.For example, first can be configured to carry out vertex coloring and produce topological structure, and second portion can be configured to inlay with geometry painted, and third part can be configured to carry out the painted image of playing up with generation of pixel on screen space.Distribution portion GPC 208 has adapted to any expansion and the reduction of the data that produced by those dissimilar Processing tasks effectively with the ability of carrying out inhomogeneous Processing tasks.The intermediate data that GPC 208 produces can be cushioned, and lags behind under the situation of speed that upstream GPC 208 produces data with the speed that receives data at downstream GPC 208, allows intermediate data to send with minimum stopping between GPC 208.
Memory interface 214 can be divided into D memory partition unit, the part coupling of each memory partition unit and parallel processing storer 204, wherein D 〉=1.Every part of parallel processing storer 204 generally includes one or more memory devices (for example DRAM 220).What those having ordinary skill in the art will appreciate that is that DRAM 220 can and can be usual design by other suitable memory devices replacements usually.Therefore omitted detailed description.In one embodiment, DRAM220 can all omit, and memory requests is got back to Memory bridge 105 by interleaver 210 and 205 reflections of I/O unit.For example the playing up target and can cross over DRAM 220 storage of frame buffer or texture mapping allows zoning unit 215 that each part parallel of playing up target is write, to use the available bandwidth of parallel processing storer 204 effectively.
Any one of GPC 208 can be handled the data in any zoning unit 215 that will be written in the parallel processing storer 204.Interleaver (crossbar) unit 210 is configured to output with each GPC 208 and routes to the input of any zoning unit 215 or another GPC 208 with further processing.GPC208 communicates by letter with memory interface 214 to read or to write various external memory devices from various external memory devices by interleaver unit 210.In one embodiment, interleaver unit 210 is connected with memory interface 214 to communicate by letter with I/O unit 205, therefore interleaver unit 210 also is connected with local parallel processing storer 204, makes that nuclear energy is other memory communication non-indigenous with system storage 104 or to parallel processing element 202 in the processing in the different GPC 208.Interleaver unit 210 can use tunnel to flow with separate communications between GPC 208 and zoning unit 215.
In addition, GPC 208 can be programmed to carry out and the relevant Processing tasks of extensive multiple application, described application includes but not limited to linear and nonlinear data conversion, the filtration of video and/or audio data, modeling (are for example operated, the applied physics rule is determined position, speed and other attributes of object), image plays up operation (for example, inlaying tinter, vertex shader, geometric coloration and/or pixel shader) or the like.Parallel processing element 202 can be transferred to inside (on the sheet) storer from system storage 104 and/or local parallel processing storer 204 with data, deal with data, and result data write back in system storage 104 and/or the local parallel processing storer 204, wherein such data can be visited by the other system parts that comprise CPU 102 or another parallel processing subsystem 112.
Parallel processing element 202 can provide the local parallel processing storer 204 of any amount, comprises there is not local storage, and can use local storage and system storage with combination in any.For example, in the embodiment of storage and uniform device framework (UMA), parallel processing element 202 can be a graphic process unit.In such embodiments, can provide seldom or not provide dedicated graphics (parallel processing) storer, parallel processing element 202 uses or using system storer uniquely almost uniquely.In UMA embodiment, parallel processing element 202 can be integrated in bridge chip or the processor chips, or being provided as the have high-speed link discrete chip of (for example PCI-E), this high-speed link is connected by bridge chip or other communicators parallel processing element 202 with system storage.
The parallel processing element 202 that can comprise as mentioned above, any amount in the parallel processing subsystem 112.For example, a plurality of parallel processing elements 202 can be provided on the single plug-in card, and perhaps a plurality of plug-in cards can link to each other with communication path 113, and perhaps one or more parallel processing elements 202 can be integrated in the bridge chip.Parallel processing element 202 in the multiple parallel processing unit system can be mutually the same, perhaps can differ from one another.For example, different parallel processing elements 202 can have the processing kernel of varying number, local parallel processing storer of varying number or the like.When having a plurality of parallel processing element 202, those parallel processing elements can come parallel work-flow with deal with data to be higher than the handling capacity that single parallel processing element 202 may reach.The system that includes one or more parallel processing elements 202 can realize with various configurations and form factors, comprises desk-top computer, notebook or handheld personal computer, server, workstation, game console, embedded system or the like.
The general introduction of processing cluster array
Fig. 3 A is the block diagram of the GPC 208 in one of parallel processing element 202 of Fig. 2 according to an embodiment of the invention.Each GPC 208 can be configured to a large amount of thread of executed in parallel, is meant the example of the specific program that the input data of a particular group are carried out at this term " thread ".In certain embodiments, single instruction multiple data (SIMD) instruction is sent technology and is used to support a large amount of threads of executed in parallel, and a plurality of independently command units need not be provided.In other embodiments, use the one group of processing engine that is configured in each GPC 208 to send the common command unit of instruction, single instrction multithreading (SIMT) technology is used to support a large amount of synchronous threads usually of executed in parallel.Usually carry out identical instruction unlike all processing engine in the SIMD executive mode, the execution of SIMT allows different threads to follow the execution route of dispersion more easily by given thread program.What those having ordinary skill in the art will appreciate that is that the SIMD processing mode has been represented the subset of functionality of SIMT processing mode.
In graphical application, GPC 208 can be configured to realize the pel engine to carry out the screen space graphics processing function, and it includes but not limited to that pel foundation, rasterisation and Z reject.Pel engine 3 04 200 receives Processing tasks from the work allocation unit, and when operation that Processing tasks need not implemented by the pel engine, Processing tasks is sent to pipeline manager 305 by the pel engine.The operation of GPC 208 is advantageously controlled by pipeline manager 305, and pipeline manager 305 is distributed to stream multiprocessor (SPM) 310 with Processing tasks.Pipeline manager 305 also can be configured to come Control work to distribute interleaver (crossbar) 330 by the data named place of destination of the processing of exporting for SPM 310.
In one embodiment, each GPC 208 comprises M SPM 310, M 〉=1 wherein, and each SPM 310 is configured to handle one or more sets of threads.In addition, each SPM 310 advantageously comprise can pipelining on the same group functional unit of phase (for example ALU etc.), allow newly to instruct before preceding instruction is finished, to send, this is known in the art.Any combination of functional unit can be provided.In one embodiment, functional unit is supported various computings, comprise integer and floating-point arithmetic (for example addition and multiplication), comparison operation, Boolean calculation (with or, XOR), displacement and various algebraic function (for example planar interpolation function, triangulation function, exponential sum logarithmic function etc.) calculate; And the identical functions unit hardware can be balanced to implementing different computings.
The series of instructions that sends to specific GPC 208 has constituted at the previously defined thread of this paper, and the set of the thread that some is carried out simultaneously on parallel processing engine (not shown) in SPM 310 is referred to herein as sets of threads.As used herein, sets of threads refers to the group of difference input data being carried out simultaneously the thread of same program, and each thread in the group is assigned to the different disposal engine among the SPM 310.Sets of threads can comprise the thread that lacks than processing engine quantity among the SPM 310, and in this case, during the cycle when the processing threads group time, some processing engine will be idle.Sets of threads also can comprise the thread of Duoing than processing engine quantity among the SPM 310, and in this case, processing will occur on a plurality of clock period.Because each SPM 310 can support nearly G sets of threads simultaneously, therefore nearly GxM sets of threads can be carried out in GPC 208 in any given time.
In addition, a plurality of relevant sets of threads can be movable (in the different execute phases) in SPM310 simultaneously.The set of this sets of threads is referred to herein as " cooperation thread array " (" CTA ").The size of specific CTA equals m*k, and wherein k is the number of the thread of execution and the integral multiple of the number of parallel processing engine in SPM310 simultaneously typically in sets of threads, and m is the number of the sets of threads of while activity in SPM310.The size of CTA is determined by the amount of hardware resources such as storer or register that programmer and CTA can use usually.
Proprietary local address space can be used each thread, and every CTA address space of sharing is used for Data transmission between the thread of CTA.Be stored in data storage in every thread local address space and the every CTA address space in L1 high-speed cache 320, and can use and evict strategy from and help data are remained in the L1 high-speed cache 320.Each SPM 310 uses in the space that is used for loading with the corresponding L1 high-speed cache 320 of storage operation.Each SPM 310 also visits the L2 high-speed cache that all GPC 208 share and can be used for the zoning unit 215 of transmission data between thread.At last, SPM 310 also visits outer " overall situation " storer of sheet, and it can comprise for example parallel processing storer 204 and/or system storage 104.The L2 high-speed cache can be used for storing and is written to the data that global storage neutralization is read from global storage.Be understandable that any storer of parallel processing element 202 outsides can be used as global storage.
In graphical application, GPC 208 can be configured so that 315 couplings of each SPM 310 and texture cell to carry out the texture mapping operation, for example determine texture sample the position, read data texturing and filter data texturing.Data texturing reads by memory interface 214 and obtains from L2 high-speed cache, parallel processing storer 204 or system storage 104 as required.Texture cell 315 can be configured to storage data texturing in internally cached.In certain embodiments, texture cell 315 is stored in the L1 high-speed cache 320 with 320 couplings of L1 high-speed cache and data texturing.The task that each SPM 310 handles to 330 outputs of work allocation interleaver,, or handling of task is stored in L2 high-speed cache, parallel processing storer 204 or the system storage 104 doing further processing with task that processing is provided to another GPC 208 by interleaver unit 210.PreROP (raster manipulation device in advance) 325 is configured to receive data from SPM 310, the raster operation unit of vectoring information to the zoning unit 215, and carry out the optimization of blend of colors, tissue pixels color data, the row address of going forward side by side conversion.
Be understandable that kernel framework described herein is exemplary, can make variation and change.Any amount of processing engine, for example pel engine 3 04, SPM 310, texture cell 315 or PreROP 325 can be included among the GPC 208.Further, though only show a GPC 208, parallel processing element 202 can comprise any amount of GPC 208, and is advantageously similar each other on these GPC 208 functions, makes act of execution not depend on which GPC 208 has received the particular procedure task.Further, each GPC 208 uses and advantageously works independently with other relatively GPC 208 such as different processing engine, L1 high-speed caches 320 separately.
Fig. 3 B is the block diagram of parallel processing element 202 zoning unit 215 in one of them among Fig. 2 according to an embodiment of the invention.As shown in the figure, zoning unit 215 comprises L2 high-speed cache 350, frame buffer (FB) 355 and raster operation unit (ROP) 360.L2 high-speed cache 350 is to be configured to carry out from interleaver unit 210 and the loading of ROP 360 receptions and the read/write high-speed cache of storage operation.Read disappearance and the request that promptly writes back outputs to frame buffer 355 by L2 high-speed cache 350 and handles.Dirty (dirty) renewal is also delivered to frame buffer 355 and is used for the chance processing.Frame buffer 355 direct and parallel processing storer 204 interfaces, request is read and is write in output, and receives the data that read from parallel processing storer 204.
In graphical application, ROP 360 is the processing units that carry out raster manipulation, for example template, z test, mixing etc., and the graph data that pixel data is output as processing is to be stored in the graphic memory.In some embodiments of the invention, ROP 360 is included among each GPC 208, rather than in zoning unit 215, the read and write request of pixel sends by interleaver unit 210, rather than by the pixel fragment data.
The graph data of handling may be displayed on the display device 110, or route is further to handle by CPU 102 or by one of processing entities in the parallel processing subsystem 112.Each zoning unit 215 comprises that ROP 360 is to distribute the processing of raster manipulation.In certain embodiments, ROP 360 is configured to compress z or the color data of writing in the storer, and z that will read from storer or color data decompression.
What it will be appreciated by those skilled in the art that is, the framework of describing among Fig. 1,2,3A and the 3B never limits the scope of the invention, Jiao Dao technology can realize on the processing unit of any configuration compatibly herein, this processing unit includes but not limited to one or more CPU, one or more many kernels CPU, one or more parallel processing element 202, one or more GPC 208, one or more figure or specialized processing units etc., and these do not leave scope of the present invention.
Deadlock is avoided
When communication path 113 was the PCIe bus, the write request co-pending in the PCIe bus stoped from running through of returning of system storage 104 and arrives parallel processing subsystem 202.When parallel processing subsystem 202 need run through before write request co-pending can be handled, deadlock took place.Embodiments of the invention provide the technology that will run through by tunnel (VC) route of the PCIe that separates with the tunnel that is used to send write request.Therefore, can not stop to run through and arrive parallel processing subsystem 202 and avoided deadlock.
Fig. 4 is the block diagram that is configured to avoid the computer system 400 of deadlock according to an embodiment of the invention.As shown in the figure, computer system 400 comprises the CPU102 integrated with Memory bridge 105, system storage 104, Peripheral component interface (PCIe) bus 401 and parallel processing subsystem 202 fast.CPU102 is by Memory bridge 105 and system storage 104 couplings.CPU102 is also by Memory bridge 105 and PCIe bus 401 and 202 couplings of parallel processing subsystem.CPU102 can visit memory cell in parallel processing subsystem 202 by Memory bridge 105 and PCIe bus 401.Equally, parallel processing subsystem 202 is also by PCIe bus 401 and Memory bridge 105 access system memory 104.
CPU102 is the primary processor of computer system 400 and is configured to send the request that comprises read request and write request by Memory bridge 105 and gives system storage 104.CPU102 also gives parallel processing subsystem 202 by the request of sending of Memory bridge 105 and PCIe bus 401.
Parallel processing subsystem 202 is the coprocessors that are configured to implement into CPU102 the different disposal operation.These are handled operation and comprise data compression and decompression.Parallel processing subsystem 202 comprises and is configured to receive request and will ask route to give the PCIe interface 402 of the different parts of parallel processing subsystem 202 with processing from CPU102 by PCIe bus 401.PCIe interface 402 also sends request to CPU102 or system storage 104 by PCIe bus 401.PCIe interface 402 is by different virtual passage (VC) route data on PCIe bus 401.These tunnels comprise VC0 and VC1 (not shown).
The parallel processing subsystem further comprises main frame 404, client 406A-406N, I/O unit 205, interleaver unit 210, L2 high-speed cache 350 and parallel processing storer 204.I/O unit 205 allows the parallel processing subsystem to implement memory access operation and comprises Memory Management Unit (MMU) moderator 408, MMU410, translation lookaside buffer (TLB) 412 and one or more iterator 414.
Main frame 404 is the engines that allow CPU102 visit I/O unit 205.Main frame 404 and 408 couplings of the MMU moderator in I/O unit 205.Main frame 404 receives request and these requests is sent to MMU410 by MMU moderator 408 from CPU102.Client 406A-406N also is coupled with MMU moderator 408.Client 406A-406N is an engine of implementing difference in functionality, and these functions comprise memory management, graphic presentation, instruction fetch, encryption, texture processing and video decode.Client 406A-406N is configured to send request and gives I/O unit 205.
MMU moderator 408 arbitration and allow these engines visits MMU410 between each of main frame 404 and client 406A-406N.MMU moderator 408 checks and the relevant Engine ID of request that is received from main frame 404 and client 406A-406N that this Engine ID indicates whether that request is from CPU102.When Engine ID indication request during, equal 1 and this request marks is " special " by making " special " position in request from CPU102.Ask route to give MMU410 then.
MMU410 provides virtual to physical address translations for main frame 404 and client 406A-406N.When main frame 404 and/or client 406A-406N sent request to MMU410 by MMU moderator 408, the virtual address translation of appointment became physical address during MMU410 will ask.Virtually can use TLB412 to quicken to physical address translations.The virtual mapping of TLB412 storage visit recently to physical address.If the virtual address that receives is included among the TLB412, relevant with this virtual address so physical address can obtain from TLB412 apace.If TLB412 does not store the virtual mapping to physics that needs, MMU410 sends read request comprises needs with retrieval virtual page table to physical address map so.
Such read request of sending as the direct result of another request is called " read request of derivation " below.If it is special that the initial request of the read request that produce to derive from is labeled as, so MMU410 the read request that derives from is labeled as special.MMU410 sends to PCIe interface 402 with the read request that derives from.When the request of deriving from is not labeled as when special, PCIe interface 402 read request that route derives from VCO is when the read request that derives from is labeled as when special PCIe interface 402 read request that route derives from VC1.Never be labeled as running through on VC0 that special request returns and return, and return from being labeled as running through that special request returns at VC1.When system storage 104 receive data with request relevant run through the time, the processing continuation of initial request.
MMU410 will ask and physical address sends to one of iterator 414.Iterator 414 is the interleaver original address with physical address translations and will asks to send to interleaver unit 210 with the interleaver original address.Interleaver unit 210 will ask route to give L2 high-speed cache 350 then.
L2 high-speed cache 350 is low delay memory unit of the storage data that can be needed by I/O unit 205.L2 high-speed cache 350 comprises that 202 pairs of permission parallel processing subsystems are received from the data of system storage 104 or are stored in the compression and decompression unit (not shown) that the data in the L2 high-speed cache 350 are carried out compression and decompression.L2 high-speed cache 350 comprises tag storage part (not shown), and this tag storage part comprises the label of compressive state of cache line of the nearest visit of indication L2 high-speed cache 350.
When 350 receptions of L2 high-speed cache were the write request of target with the particular cache line in the L2 high-speed cache 350, L2 high-speed cache 350 used the tag storage parts to compress to determine whether that target cache is capable.When the tag storage part did not comprise the compressive state of the cache line of being indicated by request, L2 high-speed cache 350 produced the read request of derivation with the reserve storaging piece (not shown) of visit in system storage 104.L2 high-speed cache 350 sends to PCIe interface 402 by interleaver unit 210 with the read request that derives from.It is special that PCIe interface 402 determines whether that the read request that derives from is labeled as, and therefore with the read request route on PCIe bus 401 that derives from.When system storage 104 returns relevant with the read request that derives from running through, run through on VC1 and send when the read request that derives from is labeled as when special this, therefore when write request is on PICe bus 401, avoided dead lock condition.
If reserve storaging piece indicating target cache line compresses, L2 high-speed cache 350 decompression target caches are capable so, data that decompress and the data that are included in the write request are merged, and the data with merging that will decompress write back in the cache line in L2 high-speed cache 350.L2 high-speed cache 350 also can upgrade the compressive state of tag storage part with the cache line that comprises nearest visit.In one embodiment, the data of merging can be compressed once more.When data decompression, its form storage to decompress.The tag storage part indicate whether sheet be compression and therefore need to decompress, perhaps can directly write and need not at first decompress.
L2 high-speed cache 350 also can receive write request, and this write request specifies the CPU102 need be to zone or " compressed tablet " of the compression of the system storage 104 of its write data.Typically, though compressed tablet from the parallel processing subsystem 202, CPU102 produces compressed tablet in one embodiment.L2 high-speed cache 350 receives write request and produces the read request that derives from reads compressed tablet with access system memory 104 and from system storage 104.L2 high-speed cache 350 sends to PCIe interface 402 by interleaver unit 210 with the read request that derives from.It is special that PCIe interface 402 determines whether that the read request that derives from is labeled as, and therefore with the read request route on PCIe bus 401 that derives from.If it is special that the read request that derives from is labeled as, system storage 104 returns run through relevant with the read request that derives from VC1 so, has therefore avoided contingent dead lock condition when write request is co-pending on PCIe bus 401.L2 high-speed cache 350 receives the data of the compression of returning from the request of deriving from, and to the data decompression of compression, the data of write data and decompression is merged, and data that compression merges and data that will compress, merging write back to system storage 104.
The request marks that CPU102 is sent is special, and the read requests that also will respond the derivation that those requests produce are labeled as specially, owing to send with being labeled as relevant the running through on VC1 rather than VC0 of special request, so allows to avoid deadlock.(non-PCIe) bus protocol technology of use standard, request also can be labeled as " flexibly ordering " or mark otherwise, and its indication runs through can consider that ordering rule returns.Though the specific environment of this Technical Reference possibility produce of deadlock is described in the above, but it will be understood by those of skill in the art that, the request marks that CPU102 is sent is special, and the read requests that also will respond the derivation that those requests produce be labeled as special, when PCIe bus 401 has write request co-pending, allow to avoid deadlock.
Fig. 5 is a process flow diagram of avoiding the method step of deadlock according to an embodiment of the invention.It will be understood by those of skill in the art that though this method 500 is described in conjunction with the system of Fig. 1-4, any system with any order implementation method step of being configured to is all in scope of the present invention.
As shown in the figure, method 500 starts from step 502, and wherein MMU moderator 408 receives request from one of main frame 404 or client 406A-406N.Request can be read request or write request.In addition, request can be a target with L2 high-speed cache 350, parallel processing storer 204 or system storage 104.In step 504, MMU moderator 408 is checked the Engine ID relevant with request.The source of Engine ID indication request.For example, request may send from one of client 406A-406N, perhaps as an alternative, may send from CPU102 by main frame 404.In step 506, MMU moderator 408 determines whether that CPU102 has sent request.If Engine ID indication CPU102 sends request, method 500 proceeds to step 508 so.In step 506, if MMU moderator 408 determines that CPU102 does not send request, method 500 proceeds to step 518 so.
In step 508, MMU moderator 408 is special with request marks.The position that MMU moderator 408 is configured in request is set to " 1 " and sends from CPU102 to indicate this request.In step 510, request makes the read request that derives from produce.Request makes the read request that derives from produce under varying environment.For example, so that MMU410 can implement virtually when the physical address translations, it is the read request of the derivation of target that MMU410 produces with system storage 104 when needs read apparatus storer 104.Replacedly, so that L2 high-speed cache 350 can determine the compressive state of cache line the time, it is the read request of the derivation of target that L2 high-speed cache 350 produces with system storage 104 when needing read apparatus storer 104.Various other situations that initial request makes the read request of derivation produce are possible.
In step 512, it is special that the read request of derivation is labeled as.When MMU410 produced the read request that derives from, it is special that MMU410 is labeled as the read request that derives from.When L2 high-speed cache 350 produced the read request that derives from, it is special that L2 high-speed cache 350 is labeled as the read request that derives from.When if the other parts of parallel processing subsystem 202 produce the read request that derives from, it is special that these parts are labeled as the read request that derives from.In step 514, the 402 receiving and inspection requests of PCIe interface.This request can be the read request that derives from or replacedly be different requests.
In step 516, PCIe interface 402 determines whether that request marks is special.If request is not labeled as specially, method 500 proceeds to step 518 so, wherein PCIe interface 402 will ask with ask the relevant VC0 route that runs through by PCIe bus 401.If request marks is special, method 500 proceeds to step 520, and PCIe interface 402 will be asked and the relevant VC1 route that runs through by PCIe bus 401 with request.Method stops then.
In a word, the request marks that parallel processing element (PPU) will be received from CPU (central processing unit) (CPU) is " special ", makes the read request of the derivation that response request produces also be labeled as special and therefore in Peripheral component interface route on the auxiliary tunnel of (PCIe) bus fast.PPU will be labeled as special request by Peripheral component interface tunnel (VC) 1 transmission of (PCIe) bus fast.Run through if be labeled as special request generation, the VC1 that runs through so by the PCIe bus returns.
Advantageously, owing to send, therefore can not cause deadlock from being labeled as running through when on VC0, write request having been arranged that special request returns with being labeled as relevant the running through on different tunnels of special request.
Therefore, embodiments of the invention provide and have used the mode bit of propagating by structure to be used to discern with mark to be sent to certain request that may cause deadlock (for example read and write request) of parallel processing subsystem 202 and technology also used this mode bit mark by the affairs of any derivation of this request generation by CPU102.In other embodiments, the mechanism of Standard bus interface qualification (for example " ordering flexibly ") can be used to avoid deadlock.
Be noted that the certain affairs that send to system storage 104 by parallel processing subsystem 202 can not cause deadlock and therefore do not transmit on second tunnel.For example, as synchronization primitives or send on first tunnel based on the affairs that run through the ordering rule that is no more than write request in addition.For example, when sending signal collection, parallel processing subsystem 202 guaranteed that to detect CPU102 when when writing of parallel processing storer 204 finishes, to run through the affairs of writing that all CPU early start have arrived coherent point.
One embodiment of the present of invention may be embodied as the program product that uses with computer system.The program of program product defines the function of embodiment (comprising method described herein) and can be included in the various computer-readable recording mediums.The computer-readable recording medium of example includes but not limited to: (i) information (for example can not be write storage medium by permanent storage on it, ROM (read-only memory) equipment in the computing machine is such as the solid-state Nonvolatile semiconductor memory of the CD-ROM dish, flash memory, rom chip or any kind that be can read by CD-ROM drive); (ii) storage can change the storage medium write (for example, the solid-state random-access semiconductor memory of floppy disk in the floppy disk or hard disk drive or any kind) of information on it.
Invention has been described with reference to specific embodiment above.But, it will be understood by those of skill in the art that, under the situation of the spirit and scope that do not deviate from the broad of the present invention of illustrating, can make various modifications and change to specific embodiment as appended claim.Therefore aforesaid description and accompanying drawing are exemplary and not restrictive.

Claims (10)

1. computer system comprises:
First processing unit, second processing unit, Memory bridge, system storage and this second processing unit is connected to the bus of this first processing unit, this Memory bridge and this system storage by first tunnel and second tunnel;
Wherein this second processing unit comprises Bus Interface Unit, and this Bus Interface Unit is configured to: (i) receive the request that reads or writes from this first processing unit by this first tunnel; (ii) will on this second tunnel, send in the read request of handling the derivation that produces when this reads or writes request.
2. according to the computer system of claim 1, wherein this second processing unit further comprises the Memory Management Unit with translation lookaside buffer, and this Memory Management Unit produces the read request of this derivation when in this translation lookaside buffer disappearance taking place.
3. according to the computer system of claim 1, further comprise the local storage that is used for this second processing unit, wherein this second processing unit is connected with this local storage by cache memory unit and this cache memory unit produces the read request of this derivation.
4. according to the computer system of claim 3, wherein when the compressive state information that is not stored in this cache memory unit was visited in this request of reading or writing, this cache memory unit produced the read request of this derivation.
5. according to the computer system of claim 3, wherein when this read or write request from the regional visit data of the compression of this system storage, this cache memory unit produced the read request of this derivation.
6. according to the computer system of claim 1, wherein this second processing unit further comprises the Memory Management Unit moderator, this Memory Management Unit moderator is configured to receive the request that reads or writes from a plurality of clients, if and read or write the request be received from this first processing unit, this is read or write request marks is special.
7. according to the computer system of claim 6, wherein each of this client has client identifier and this Memory Management Unit moderator is configured to check this client identifier relevant with each request of reading or writing.
8. according to the computer system of claim 1, wherein this first processing unit is a CPU (central processing unit), and this second processing unit is that parallel processing element and this bus are the PCIe buses.
9. have first processing unit, second processing unit, Memory bridge, system storage and this second processing unit is being connected in the computer system of bus of this first processing unit, this Memory bridge and this system storage, a kind ofly read or write the method for request, may further comprise the steps in second processing unit processes:
First tunnel by this bus receives the request that reads or writes at this second processing unit;
When handling this and read or write request, produce the read request of one or more derivations and the read request that will derive from second tunnel by this bus sends to this system storage at this second processing unit; With
This second tunnel by this bus receive this derivation read request finish and
Finish this initial request that reads or writes that receives.
10. according to the method for claim 9, wherein when handling this and read or write request, when the compression of this system storage of visit regional, produce the read request of this derivation.
CN200910249698XA 2008-12-12 2009-12-14 Deadlock avoidance by marking CPU traffic as special Pending CN101901198A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510605017.4A CN105302524A (en) 2008-12-12 2009-12-14 Deadlock Avoidance By Marking CPU Traffic As Special

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US12/334,394 US8392667B2 (en) 2008-12-12 2008-12-12 Deadlock avoidance by marking CPU traffic as special
US12/334,394 2008-12-12

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201510605017.4A Division CN105302524A (en) 2008-12-12 2009-12-14 Deadlock Avoidance By Marking CPU Traffic As Special

Publications (1)

Publication Number Publication Date
CN101901198A true CN101901198A (en) 2010-12-01

Family

ID=41572725

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201510605017.4A Pending CN105302524A (en) 2008-12-12 2009-12-14 Deadlock Avoidance By Marking CPU Traffic As Special
CN200910249698XA Pending CN101901198A (en) 2008-12-12 2009-12-14 Deadlock avoidance by marking CPU traffic as special

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201510605017.4A Pending CN105302524A (en) 2008-12-12 2009-12-14 Deadlock Avoidance By Marking CPU Traffic As Special

Country Status (6)

Country Link
US (1) US8392667B2 (en)
JP (1) JP5127815B2 (en)
KR (1) KR101086507B1 (en)
CN (2) CN105302524A (en)
DE (1) DE102009047518B4 (en)
GB (1) GB2466106B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103532876A (en) * 2013-10-23 2014-01-22 中国科学院声学研究所 Processing method and system of data stream
WO2018019009A1 (en) * 2016-07-25 2018-02-01 中兴通讯股份有限公司 Data processing method and system, peripheral component interconnect express device and host
CN109343984A (en) * 2018-10-19 2019-02-15 珠海金山网络游戏科技有限公司 Data processing method, calculates equipment and storage medium at system
CN109582589A (en) * 2017-09-28 2019-04-05 瑞萨电子株式会社 Semiconductor equipment and memory access method

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539130B2 (en) * 2009-09-24 2013-09-17 Nvidia Corporation Virtual channels for effective packet transfer
US8789170B2 (en) 2010-09-24 2014-07-22 Intel Corporation Method for enforcing resource access control in computer systems
CN102497527B (en) * 2011-12-16 2013-11-27 杭州海康威视数字技术股份有限公司 Multi-processor video processing system and video image synchronous transmission and display method thereof
US9324126B2 (en) * 2012-03-20 2016-04-26 Massively Parallel Technologies, Inc. Automated latency management and cross-communication exchange conversion
US9075952B2 (en) * 2013-01-17 2015-07-07 Intel Corporation Controlling bandwidth allocations in a system on a chip (SoC)
WO2014152800A1 (en) 2013-03-14 2014-09-25 Massively Parallel Technologies, Inc. Project planning and debugging from functional decomposition
US10019375B2 (en) 2016-03-02 2018-07-10 Toshiba Memory Corporation Cache device and semiconductor device including a tag memory storing absence, compression and write state information
US9996471B2 (en) * 2016-06-28 2018-06-12 Arm Limited Cache with compressed data and tag
KR102588143B1 (en) 2018-11-07 2023-10-13 삼성전자주식회사 Storage device including memory controller and method of operating electronic systme including memory
CN111382849B (en) * 2018-12-28 2022-11-22 上海寒武纪信息科技有限公司 Data compression method, processor, data compression device and storage medium
US11296995B2 (en) 2020-08-31 2022-04-05 Micron Technology, Inc. Reduced sized encoding of packet length field
US11360920B2 (en) * 2020-08-31 2022-06-14 Micron Technology, Inc. Mapping high-speed, point-to-point interface channels to packet virtual channels
US11418455B2 (en) 2020-08-31 2022-08-16 Micron Technology, Inc. Transparent packet splitting and recombining
US11412075B2 (en) 2020-08-31 2022-08-09 Micron Technology, Inc. Multiple protocol header processing
US11539623B2 (en) 2020-08-31 2022-12-27 Micron Technology, Inc. Single field for encoding multiple elements

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5696927A (en) * 1995-12-21 1997-12-09 Advanced Micro Devices, Inc. Memory paging system and method including compressed page mapping hierarchy
US6104417A (en) 1996-09-13 2000-08-15 Silicon Graphics, Inc. Unified memory computer architecture with dynamic graphics memory allocation
US6026451A (en) * 1997-12-22 2000-02-15 Intel Corporation System for controlling a dispatch of requested data packets by generating size signals for buffer space availability and preventing a dispatch prior to a data request granted signal asserted
US6349372B1 (en) 1999-05-19 2002-02-19 International Business Machines Corporation Virtual uncompressed cache for compressed main memory
US6950438B1 (en) * 1999-09-17 2005-09-27 Advanced Micro Devices, Inc. System and method for implementing a separate virtual channel for posted requests in a multiprocessor computer system
JP4906226B2 (en) 2000-08-17 2012-03-28 アドバンスト・マイクロ・ディバイシズ・インコーポレイテッド System and method for implementing a separate virtual channel for posted requests in a multiprocessor computer system
US6574708B2 (en) * 2001-05-18 2003-06-03 Broadcom Corporation Source controlled cache allocation
US6807599B2 (en) * 2001-10-15 2004-10-19 Advanced Micro Devices, Inc. Computer system I/O node for connection serially in a chain to a host
US7165131B2 (en) * 2004-04-27 2007-01-16 Intel Corporation Separating transactions into different virtual channels
US20050237329A1 (en) 2004-04-27 2005-10-27 Nvidia Corporation GPU rendering to system memory
US7748001B2 (en) * 2004-09-23 2010-06-29 Intel Corporation Multi-thread processing system for detecting and handling live-lock conditions by arbitrating livelock priority of logical processors based on a predertermined amount of time
US7499452B2 (en) * 2004-12-28 2009-03-03 International Business Machines Corporation Self-healing link sequence counts within a circular buffer
CN100543770C (en) * 2006-07-31 2009-09-23 辉达公司 The special mechanism that is used for the page or leaf mapping of GPU
US20080028181A1 (en) 2006-07-31 2008-01-31 Nvidia Corporation Dedicated mechanism for page mapping in a gpu

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103532876A (en) * 2013-10-23 2014-01-22 中国科学院声学研究所 Processing method and system of data stream
WO2018019009A1 (en) * 2016-07-25 2018-02-01 中兴通讯股份有限公司 Data processing method and system, peripheral component interconnect express device and host
CN109582589A (en) * 2017-09-28 2019-04-05 瑞萨电子株式会社 Semiconductor equipment and memory access method
CN109582589B (en) * 2017-09-28 2023-12-15 瑞萨电子株式会社 Semiconductor device and memory access method
CN109343984A (en) * 2018-10-19 2019-02-15 珠海金山网络游戏科技有限公司 Data processing method, calculates equipment and storage medium at system
CN109343984B (en) * 2018-10-19 2020-05-19 珠海金山网络游戏科技有限公司 Data processing method, system, computing device and storage medium

Also Published As

Publication number Publication date
JP2010140480A (en) 2010-06-24
GB2466106B (en) 2011-03-30
GB2466106A (en) 2010-06-16
US8392667B2 (en) 2013-03-05
CN105302524A (en) 2016-02-03
DE102009047518B4 (en) 2014-07-03
KR20100068225A (en) 2010-06-22
US20100153658A1 (en) 2010-06-17
KR101086507B1 (en) 2011-11-23
JP5127815B2 (en) 2013-01-23
DE102009047518A1 (en) 2010-07-08
GB0920727D0 (en) 2010-01-13

Similar Documents

Publication Publication Date Title
CN101901198A (en) Deadlock avoidance by marking CPU traffic as special
CN101739357B (en) Multi-class data cache policies
CN101714247B (en) Single pass tessellation
CN101751285B (en) Centralized device virtualization layer for heterogeneous processing units
US7516301B1 (en) Multiprocessor computing systems with heterogeneous processors
CN102696023B (en) Unified addressing and instructions for accessing parallel memory spaces
US9024946B2 (en) Tessellation shader inter-thread coordination
US10169072B2 (en) Hardware for parallel command list generation
CN101751344A (en) A compression status bit cache and backing store
US8542247B1 (en) Cull before vertex attribute fetch and vertex lighting
US9589310B2 (en) Methods to facilitate primitive batching
US8698802B2 (en) Hermite gregory patch for watertight tessellation
GB2492653A (en) Simultaneous submission to a multi-producer queue by multiple threads
US8624910B2 (en) Register indexed sampler for texture opcodes
CN103810743A (en) Setting downstream render state in an upstream shader
US9436969B2 (en) Time slice processing of tessellation and geometry shaders
GB2491490A (en) Emitting coherent output from multiple execution threads using the printf command
US8310482B1 (en) Distributed calculation of plane equations
US8570916B1 (en) Just in time distributed transaction crediting
US8704835B1 (en) Distributed clip, cull, viewport transform and perspective correction
US8976185B2 (en) Method for handling state transitions in a network of virtual processing nodes
US9147224B2 (en) Method for handling state transitions in a network of virtual processing nodes
US9542192B1 (en) Tokenized streams for concurrent execution between asymmetric multiprocessors
US8319783B1 (en) Index-based zero-bandwidth clears
US8330766B1 (en) Zero-bandwidth clears

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20101201