DE102013100169A1

DE102013100169A1 - Computer-implemented method for selection of a processor, which is incorporated in multiple processors to receive work, which relates to an arithmetic problem

Info

Publication number: DE102013100169A1
Application number: DE201310100169
Authority: DE
Inventors: Karim M. Abdalla; Lacky V. Shah; Jerome F. Duluk jun.; Timothy John Purcell; Tanmoy Mandal; Gentaro Hirota
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2012-01-18
Filing date: 2013-01-09
Publication date: 2013-07-18
Also published as: CN103218259A; TW201351276A

Abstract

The method involves analyzing status data of each processor in the multiple processors to identify one or more processors, to which an arithmetic is already allotted and which is suitable to receive work that relates to an arithmetic problem. An availability value is received to receive a work from the one or more multiple processors, where the availability value indicates the capacity of the processors. A processor is selected to receive work based on the availability values, which are received from the one or more multiple processors.

Description

HINTERGRUND DER ERFINDUNGBACKGROUND OF THE INVENTION

GEBIET DER ERFINDUNGFIELD OF THE INVENTION

Die vorliegende Erfindung betrifft im Allgemeinen Rechenaufgaben (compute tasks) und insbesondere Planen (scheduling) und Ausführung von Rechenaufgaben.The present invention relates generally to compute tasks, and more particularly to scheduling and performing computational tasks.

BESCHREIBUNG DER BETREFFENDEN TECHNIKDESCRIPTION OF THE RELATED TECHNIQUE

Herkömmliches Planen von Rechenaufgaben zur Ausführung in Mehrprozessor-Systemen stützt sich auf ein Anwendungs-Programm oder einen Treiber. Während einer Ausführung der Rechenaufgaben kann eine Interaktion zwischen dem Treiber und mehreren Prozessoren, welche benötigt ist, um dem Treiber zu ermöglichen, die Rechenaufgaben zu planen, Ausführung der Rechenaufgaben verzögern.Conventional computing for execution in multiprocessor systems relies on an application program or driver. During execution of the computational tasks, interaction between the driver and multiple processors needed to enable the driver to schedule the computational tasks may delay execution of the computational tasks.

Demgemäß ist, was in der Technik gebraucht wird, ein System und Verfahren zum dynamischen Planen von Rechenaufgaben zur Ausführung basierend auf den Verarbeitungs-Ressourcen und Prioritäten der verfügbaren Rechenaufgaben. Wichtigerweise sollte der Planungs-Mechanismus nicht von Software- oder Treiber-Interaktion abhängen oder Software- oder Treiber-Interaktion erfordern.Accordingly, what is needed in the art is a system and method for dynamically scheduling computational tasks for execution based on the processing resources and priorities of the available computational tasks. Importantly, the scheduling mechanism should not depend on software or driver interaction or require software or driver interaction.

ZUSAMMENFASSUNG DER ERFINDUNGSUMMARY OF THE INVENTION

Eine Ausführungsform der vorliegenden Erfindung führt ein Verfahren zum Auswählen eines ersten Prozessors aus, welcher in einer Mehrzahl von Prozessoren umfasst ist, um eine Arbeit (work) zu empfangen, welche eine Rechenaufgabe betrifft. Das Verfahren involviert Analysieren von Zustandsdaten jedes Prozessors in der Mehrzahl von Prozessoren, um einen oder mehrere Prozessoren zu identifizieren, welchem bereits eine Rechenaufgabe zugewiesen worden ist und welche berechtigt bzw. geeignet (eligible) sind, eine Arbeit zu empfangen, welche eine Rechenaufgabe betrifft, Empfangen von jedem des einen oder mehreren Prozessoren, welcher als geeignet identifiziert ist, eines Verfügbarkeits-Wertes (availability value), welcher die Kapazität des Prozessors, neue Arbeit zu empfangen, anzeigt, Auswählen eines ersten Prozessors, um Arbeit zu empfangen, welche die eine Rechenaufgabe betrifft, basierend auf den Verfügbarkeits-Werten, welche von dem einen oder den mehreren Prozessoren empfangen sind, und Ausstellen (issuing), an den ersten Prozessor über ein kooperatives Thread-Feld (CTA), der Arbeit, welche die eine Rechenaufgabe betrifft.An embodiment of the present invention performs a method of selecting a first processor included in a plurality of processors to receive a work related to a computational task. The method involves analyzing state data of each processor in the plurality of processors to identify one or more processors to which a computational task has already been assigned and which are eligible to receive work related to a computational task, Receiving from each of the one or more processors, identified as appropriate, an availability value indicating the capacity of the processor to receive new work, selecting a first processor to receive work that is the one Compute task, based on the availability values received from the one or more processors and issuing, to the first processor via a cooperative thread field (CTA), the work that relates to the one computational task.

Eine andere Ausführungsform der vorliegenden Erfindung führt ein Verfahren zum Zuweisen einer Rechenaufgabe an einen ersten Prozessor aus, welcher in einer Mehrzahl von Prozessoren umfasst ist. Das Verfahren involviert Analysieren jeder Rechenaufgabe in einer Mehrzahl von Rechenaufgaben, um eine oder mehrere Rechenaufgaben zu identifizieren, welche für eine Zuweisung an den ersten Prozessor geeignet bzw. berechtigt sind, wobei jede Rechenaufgabe in einer ersten Tabelle aufgelistet ist und mit einem Prioritäts-Wert und einer Allozierungs-Ordnung assoziiert ist, welche eine Zeit anzeigt, bei welcher die Rechenaufgabe zu der ersten Tabelle hinzugefügt wurde. Die Technik involviert ferner ein Auswählen einer ersten Rechenaufgabe von der identifizierten einen oder mehr Rechenaufgaben basierend auf dem Prioritäts-Wert und/oder der Allozierungs-Ordnung, und Zuweisen der ersten Rechenaufgabe an den ersten Prozessor zur Ausführung.Another embodiment of the present invention performs a method of assigning a computational task to a first processor included in a plurality of processors. The method involves analyzing each computational task in a plurality of computational tasks to identify one or more computational tasks that are eligible for assignment to the first processor, each computational task being listed in a first table and having a priority value and associated with an allocation order indicating a time at which the arithmetic task has been added to the first table. The technique further includes selecting a first arithmetic task from the identified one or more arithmetic tasks based on the priority value and / or the allocation order, and assigning the first arithmetic task to the first processor for execution.

Weitere Ausführungsformen stellen ein nicht-transitorisches bzw. nicht-flüchtiges Computer-lesbares Medium und ein Computer-System bereit, um die jeweiligen Verfahren auszuführen, welche oben ausgeführt sind.Other embodiments provide a non-transitory computer-readable medium and computer system for performing the respective methods set forth above.

KURZE BESCHREIBUNG DER ZEICHNUNGENBRIEF DESCRIPTION OF THE DRAWINGS

So dass die Weise, in welcher die oben zitierten Merkmale der vorliegenden Erfindung im Detail verstanden werden können, kann eine besondere Beschreibung der Erfindung, welche kurz oben zusammengefasst ist, durch Bezugnahme auf Ausführungsformen genommen werden, von welchen einige in den angehängten Zeichnungen illustriert sind. Es ist jedoch zu bemerken, dass die angehängten Zeichnungen nur typische Ausführungsformen dieser Erfindung illustrieren und dass sie daher nicht aufzufassen sind, ihren Geltungsbereich zu begrenzen, denn die Erfindung kann andere genauso effektive Ausführungsformen zulassen.Thus, the manner in which the above-cited features of the present invention may be understood in detail may be taken by reference to embodiments of which a particular description of the invention, which is briefly summarized above, some of which are illustrated in the appended drawings. It should be understood, however, that the appended drawings illustrate only typical embodiments of this invention, and therefore, are not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.

1 ist ein Blockdiagramm, welches ein Computersystem illustriert, welches konfiguriert ist, einen oder mehrere Aspekte der vorliegenden Erfindung zu implementieren; 1 FIG. 10 is a block diagram illustrating a computer system configured to implement one or more aspects of the present invention; FIG.

2 ist ein Blockdiagramm eines Parallel-Verarbeitungs-Subsystem für das Computersystem der 1, gemäß einer Ausführungsform der vorliegenden Erfindung; 2 FIG. 12 is a block diagram of a parallel processing subsystem for the computer system of FIG 1 in accordance with an embodiment of the present invention;

3A ist ein Blockdiagramm des Aufgabe-/Arbeit-Einheit von 2, gemäß einer Ausführungsform der vorliegenden Erfindung; 3A is a block diagram of the task / work unit of 2 in accordance with an embodiment of the present invention;

3B ist ein Blockdiagramm eines Allgemein-Verarbeitungs-Clusters innerhalb einer der Parallel-Verarbeitungs-Einheiten von 2, gemäß einer Ausführungsform der vorliegenden Erfindung; 3B FIG. 12 is a block diagram of a general processing cluster within one of the parallel processing units of FIG 2 in accordance with an embodiment of the present invention;

3C ist ein Blockdiagramm eines Teils des Streaming-Mehrfach-Prozessors von 3B, gemäß einer Ausführungsform der vorliegenden Erfindung. 3C FIG. 12 is a block diagram of a portion of the streaming multiple processor of FIG 3B , according to an embodiment of the present invention.

4A bis 4B illustrieren ein Verfahren zum Zuweisen von Aufgaben an Streaming-Mehrprozessoren (SMs) von 3A bis 3C gemäß einer Ausführungsform der Erfindung. 4A to 4B illustrate a method for assigning tasks to Streaming Multiprocessors (SMs) of 3A to 3C according to an embodiment of the invention.

5 illustriert ein Verfahren zum Auswählen eines SM, um Arbeit zu empfangen, welche eine Aufgabe betrifft, gemäß einer Ausführungsform der Erfindung. 5 FIG. 10 illustrates a method for selecting an SM to receive work relating to a task according to an embodiment of the invention.

DETAILLIERTE BESCHREIBUNGDETAILED DESCRIPTION

In der folgenden Beschreibung werden zahlreiche spezifische Details ausgeführt, um ein durchgängigeres Verständnis der vorliegenden Erfindung bereitzustellen. Es wird jedoch für den Fachmann in der Technik ersichtlich sein, dass die vorliegende Erfindung ohne ein oder mehrere dieser spezifischen Details praktiziert werden kann. In anderen Fällen sind wohl bekannte Merkmale nicht beschrieben worden, um ein Verschleiern der vorliegenden Erfindung zu vermeiden.In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. However, it will be apparent to those skilled in the art that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described to avoid obscuring the present invention.

SystemüberblickSystem Overview

1 ist ein Blockdiagramm, welches ein Computersystem 100 illustriert, welches konfiguriert ist, einen oder mehrere Aspekte der vorliegenden Erfindung zu implementieren. Computersystem 100 umfasst eine Zentralverarbeitungseinheit (CPU) 102 und einen Systemspeicher 104, welcher über einen Zwischenverbindungspfad (interconnection path) kommuniziert, welcher eine Speicherbrücke 105 umfassen kann. Speicherbrücke 105, welche z. B. ein Northbridge-Chip sein kann, ist über einen Bus oder einen anderen Kommunikationspfad 106 (z. B. HyperTransport-Link) mit einer I/O-(Eingabe/Ausgabe)-Brücke 107 verbunden. I/O-Brücke 107, welche z. B. ein Southbridge-Chip sein kann, empfängt Benutzereingabe von einem oder mehreren Benutzer-Eingabegeräten 108 (z. B. Tastatur, Maus) und leitet die Eingabe an CPU 102 über Kommunikationspfad 106 und Speicherbrücke 105 weiter. Ein Parallel-Verarbeitungs-Subsystem 112 ist mit der Speicherbrücke 105 über einen Bus oder einen zweiten Kommunikationspfad 113 (z. B. einen Peripheral Component Interconnect(PCI)-Express Accelerated Graphics Port, oder HyperTransport-Link) gekoppelt; in einer Ausführungsform ist das Parallel-Verarbeitungs-Subsystem 112 ein Grafik-Subsystem, welches Pixel an ein Anzeigegerät 110 (z. B. ein konventioneller Kathodenstrahlröhre- oder Flüssigkristallanzeige-basierter Monitor) liefert. Eine Systemplatte 114 ist auch mit der I/O-Brücke 107 verbunden. Ein Switch 116 stellt Verbindungen zwischen I/O-Brücke 107 und anderen Komponenten bereit, wie etwa ein Netzwerkadapter 118 und verschiedenen Hinzufügungskarten (Add-in-Cards) 120 und 121. Andere Komponenten (nicht explizit gezeigt) umfassend Universal Serial Bus USB- oder andere Port-Verbindungen, Kompakt-Disk(CD)-Laufwerke, digitale Video-Disk(DVD)-Laufwerke, Filmaufnahmegeräte, und dergleichen, können auch mit der I/O-Brücke 107 verbunden sein. Die verschiedenen Kommunikationspfade, welche in 1 gezeigt sind, einschließlich die speziell genannten Kommunikationspfade 106 und 113, können unter Benutzung irgendwelcher geeigneten Protokolle implementiert sein, wie etwa PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, oder irgendeines oder irgendwelcher Bus- oder Punkt-zu-Punkt-Kommunikations-Protokoll(e), und Verbindungen zwischen verschiedenen Geräten können verschiedene Protokolle benutzen, wie in der Technik bekannt ist. 1 is a block diagram illustrating a computer system 100 which is configured to implement one or more aspects of the present invention. computer system 100 includes a central processing unit (CPU) 102 and a system memory 104 which communicates via an interconnection path which is a memory bridge 105 may include. memory bridge 105 which z. B. may be a Northbridge chip is via a bus or other communication path 106 (eg HyperTransport link) with an I / O (input / output) bridge 107 connected. I / O bridge 107 which z. B. may be a southbridge chip, receives user input from one or more user input devices 108 (eg keyboard, mouse) and passes the input to the CPU 102 via communication path 106 and memory bridge 105 further. A parallel processing subsystem 112 is with the memory bridge 105 via a bus or a second communication path 113 (eg, a Peripheral Component Interconnect (PCI) Express Accelerated Graphics Port, or HyperTransport Link); in one embodiment, the parallel processing subsystem is 112 a graphics subsystem that sends pixels to a display device 110 (eg, a conventional CRT or liquid crystal display based monitor). A system disk 114 is also with the I / O bridge 107 connected. A switch 116 provides connections between I / O bridge 107 and other components, such as a network adapter 118 and various add-in cards 120 and 121 , Other components (not explicitly shown) including Universal Serial Bus USB or other port connections, compact disc (CD) drives, digital video disc (DVD) drives, movie recording devices, and the like, may also be used with the I / O -Bridge 107 be connected. The different communication paths, which in 1 including the communication paths specifically mentioned 106 and 113 may be implemented using any suitable protocols, such as PCI-Express, Accelerated Graphics Port (AGP), HyperTransport, or any bus or point-to-point communication protocol (s), and connections between various devices may use various protocols as known in the art.

In einer Ausführungsform inkorporiert das Parallel-Verarbeitungs-Subsystem 112 Schaltung, welche für Grafik- und Video-Verarbeitung optimiert ist, einschließlich zum Beispiel Videoausgabe-Schaltung, und konstituiert eine Grafik-Verarbeitungseinheit (GPU). In einer anderen Ausführungsform umfasst das Parallel-Verarbeitungs-Subsystem 112 Schaltung, welche für Allgemeinzweck-Verarbeitung optimiert ist, während die darunter liegende Computer-Architektur, welche im größeren Detail hierin beschrieben ist, beibehalten ist. In noch einer anderen Ausführungsform kann das Parallel-Verarbeitungs-Subsystem 102 mit einem oder mit mehreren anderen Systemelementen in ein einzelnes Subsystem integriert sein, wie etwa die Speicherbrücke 105, CPU 102 und I/O-Brücke 107 verbindend, um ein System auf dem Chip (system an chip) (SoC) zu bilden.In one embodiment, the parallel processing subsystem incorporates 112 Circuit optimized for graphics and video processing, including, for example, video output circuitry, and constitutes a graphics processing unit (GPU). In another embodiment, the parallel processing subsystem includes 112 Circuit optimized for general purpose processing while maintaining the underlying computer architecture, which is described in more detail herein. In yet another embodiment, the parallel processing subsystem 102 with one or more other system elements integrated into a single subsystem, such as the memory bridge 105 , CPU 102 and I / O bridge 107 connecting to form a system on chip (SoC).

Es wird geschätzt werden, dass das hierin gezeigte System illustrativ ist und dass Variationen und Modifikationen möglich sind. Die Verbindungstopologie, einschließlich der Anzahl und der Anordnung von Brücken, der Anzahl von CPUs 102, und der Anzahl von Parallel-Verarbeitungs-Subsystemen 112 kann wie gewünscht modifiziert werden. Zum Beispiel ist in einigen Ausführungsformen Systemspeicher 104 mit CPU 102 direkt gekoppelt anstatt durch eine Brücke, und andere Geräte kommunizieren mit Systemspeicher 104 über Speicherbrücke 105 und CPU 102. In anderen alternativen Topologien ist das Parallel-Verarbeitungs-Subsystem 112 mit I/O-Brücke 107 oder direkt mit CPU 102 verbunden anstatt mit der Speicherbrücke 105. In noch anderen Ausführungsformen können die I/O-Brücke 107 und Speicherbrücke 105 in einen einzelnen Chip integriert sein, anstatt als ein oder mehr Geräte zu existieren. Große Ausführungsformen können zwei oder mehr CPUs 102 und zwei oder mehr Parallel-Verarbeitungs-Subsysteme 112 umfassen. Die besonderen Komponenten, welche hierin gezeigt sind, sind optional; z. B. könnte irgendeine Anzahl von Hinzufügungskarten oder peripheren Geräten unterstützt sein. In einigen Ausführungsformen ist der Switch 116 eliminiert und der Netzwerkadapter 116 und Hinzufügungskarten 120, 121 verbinden direkt mit der I/O-Brücke 107.It will be appreciated that the system shown herein is illustrative and that variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of CPUs 102 , and the number of parallel processing subsystems 112 can be modified as desired. For example, in some embodiments, system memory 104 with CPU 102 directly coupled rather than through a bridge, and other devices communicate with system memory 104 over memory bridge 105 and CPU 102 , In other alternative topologies, the parallel processing subsystem is 112 with I / O bridge 107 or directly with CPU 102 connected instead of the memory bridge 105 , In still other embodiments, the I / O bridge 107 and memory bridge 105 be integrated into a single chip rather than existing as one or more devices. Large embodiments may include two or more CPUs 102 and two or more parallel processing subsystems 112 include. The particular components shown herein are optional; z. B. Any number of add-on cards or peripheral devices could be supported. In some embodiments, the switch is 116 eliminated and the network adapter 116 and add cards 120 . 121 connect directly to the I / O bridge 107 ,

2 illustriert ein Parallel-Verarbeitungs-Subsystem 112 gemäß einer Ausführungsform der vorliegenden Offenbarung. Wie gezeigt, umfasst das Parallel-Verarbeitungs-Subsystem 112 eine oder mehrere Parallel-Verarbeitungseinheiten (PPUs) 202, wobei jede von diesen mit einem lokalen Parallel-Verarbeitungs-(PP)-Speicher 204 gekoppelt ist. Im Allgemeinen umfasst ein Parallel-Verarbeitungs-Subsystem eine Anzahl U von PPUs, wobei U ≥ 1 (hierin sind mehrere Instanzen von ähnlichen Objekten mit Referenznummern bezeichnet, welche das Objekt identifizieren und Nummern in Klammern die Instanz identifizieren, wenn benötigt). PPUs 202 und Parallel-Verarbeitungs-Speicher 204 können unter Benutzung von einem oder mehreren integrierte-Schaltung-Geräten implementiert sein, wie etwa programmierbare Prozessoren, Anwendungs-spezifische integrierte Schaltungen (ASICs), oder Speichergeräte, oder in irgendeiner anderen technisch machbaren Weise. 2 illustrates a parallel processing subsystem 112 according to an embodiment of the present disclosure. As shown, the parallel processing subsystem includes 112 one or more parallel processing units (PPUs) 202 each of them having a local parallel processing (PP) memory 204 is coupled. In general, a parallel processing subsystem includes a number U of PPUs, where U ≥ 1 (here are multiple instances of similar objects with reference numbers identifying the object and numbers in parentheses identifying the instance, if needed). PPUs 202 and parallel processing memory 204 may be implemented using one or more integrated circuit devices, such as programmable processors, application specific integrated circuits (ASICs), or memory devices, or in any other technically feasible manner.

Mit Bezug wieder auf 1 sowie 2 sind in einigen Ausführungsformen einige oder alle der PPUs 202 in dem Parallel-Verarbeitungs-Subsystem 112 Grafikprozessoren mit Render-Pipelines, welche konfiguriert sein können, um verschiedene Operationen durchzuführen, welche das Erzeugen von Pixeldaten von Grafik-Daten, welche mittels CPU 102 und/oder Systemspeicher 104 über Speicherbrücke 105 und den zweiten Kommunikationspfad 113 zugeführt sind, ein Interagieren mit lokalem Parallel-Verarbeitungs-Speicher 204 (welcher als ein Grafikspeicher benutzt werden kann einschließlich z. B. eines konventionellen Bildpuffers (frame buffer)), um Pixeldaten zu speichern und zu aktualisieren, ein Liefern von Pixeldaten an das Anzeigegeräte 110, und dergleichen betreffen. In einigen Ausführungsformen kann das Parallel-Verarbeitungs-Subsystem 112 eine oder mehrere PPUs 202 umfassen, welche als Grafikprozessoren operieren, und eine oder mehrere andere PPUs 202, welche für Allgemeinzweck-Berechnungen benutzt werden können. Die PPUs können identisch sein oder verschieden sein und jede PPU kann ein dediziertes Parallel-Verarbeitungs-Speichergerät(e) haben oder braucht nicht dedizierte Parallel-Verarbeitungs-Speichergerät(e) zu haben. Eine oder mehrere PPUs 202 in Parallelverarbeitungs-Subsystem 112 können Daten an das Anzeigegeräte 110 ausgeben oder jede PPU 202 Parallelverarbeitungs-Subsystem 112 kann Daten an eines oder mehrere Anzeigegeräte 110 ausgeben.With respect to again 1 such as 2 In some embodiments, some or all of the PPUs are 202 in the parallel processing subsystem 112 Graphics processors having rendering pipelines that may be configured to perform various operations that generate pixel data from graphics data using CPU 102 and / or system memory 104 over memory bridge 105 and the second communication path 113 are fed, interacting with local parallel processing memory 204 (which may be used as a graphics memory including, for example, a conventional frame buffer) to store and update pixel data, providing pixel data to the display device 110 , and the like. In some embodiments, the parallel processing subsystem 112 one or more PPUs 202 which operate as graphics processors and one or more other PPUs 202 , which can be used for general purpose calculations. The PPUs may be identical or different, and each PPU may have a dedicated parallel processing memory device (s) or may not have dedicated parallel processing memory device (s). One or more PPUs 202 in parallel processing subsystem 112 can send data to the display devices 110 spend or any PPU 202 Parallel processing subsystem 112 can transfer data to one or more display devices 110 output.

Im Betrieb ist CPU 102 der Master-Prozessor von Computersystems 100, welcher Operationen von anderen Systemkomponenten steuert und koordiniert. Insbesondere stellt CPU 102 Befehle aus, welche den Betrieb von PPUs 202 steuern. In einigen Ausführungsformen schreibt CPU 102 einen Strom von Befehlen für jede PPU 202 an eine Daten-Struktur (nicht explizit gezeigt in 1 oder 2), welche in System-Speicher 104, Parallel-Verarbeitungs-Speicher 204, oder einer anderen Speicher-Stelle lokalisiert sein kann, welche sowohl für CPU 102 als auch für PPU 202 zugreifbar ist. Ein Zeiger auf jede Daten-Struktur ist an einen Schiebe-Puffer (push buffer) geschrieben, um eine Verarbeitung von dem Strom von Befehlen in der Daten-Struktur zu initiieren. Die PPU 202 liest Befehls-Ströme von einem oder mehr Schiebe-Puffern und führt dann Befehle asynchron relativ zu dem Betrieb von CPU 102 aus. Ausführungs-Prioritäten können für jeden Schiebe-Puffer mittels eines Anwendungs-Programms über den Geräte-Treiber 103 spezifiziert werden, um eine Planung der verschiedenen Schiebe-Puffer zu steuern.In operation is CPU 102 the master processor of computer system 100 which controls and coordinates operations of other system components. In particular, CPU 102 Commands indicating the operation of PPUs 202 Taxes. In some embodiments, CPU writes 102 a stream of commands for each PPU 202 to a data structure (not explicitly shown in FIG 1 or 2 ), which in system memory 104 , Parallel processing memory 204 , or another memory location may be located, which is for both CPU 102 as well as for PPU 202 is accessible. A pointer to each data structure is written to a push buffer to initiate processing from the stream of instructions in the data structure. The PPU 202 reads instruction streams from one or more shift buffers and then executes instructions asynchronously relative to the operation of CPU 102 out. Execution priorities may be determined for each shift buffer by means of an application program via the device driver 103 be specified to control scheduling of the various shift buffers.

Mit Bezug nun zurück auf 1 sowie 2 umfasst jede PPU 202 eine I/O-(Eingabe/Ausgabe)-Einheit 205, welche mit dem Rest des Computersystems 100 über Kommunikationspfad 113 kommuniziert, welcher zu Speicherbrücke 105 (oder in einer anderen Ausführungsform direkt mit CPU 102) verbindet. Die Verbindung von PPU 202 an den Rest des Computersystems 100 kann auch variiert werden. In einigen Ausführungsformen ist das Parallel-Verarbeitungs-Subsystem 112 als eine Hinzufügungskarte implementiert, welche in einen Erweiterungsschlitz oder Erweiterungssteckplatz (expansion slot) von Computersystem 100 eingeführt werden kann. In anderen Ausführungsformen kann eine PPU 202 auf einem einzelnen Chip integriert sein mit einer Bus-Brücke, wie etwa Speicherbrücke 105 oder I/O-Brücke 107. In noch anderen Ausführungsformen können einige oder alle Elemente von PPU 202 auf einem einzelnen Chip mit CPU 102 integriert sein.Referring back to now 1 such as 2 includes every PPU 202 an I / O (input / output) unit 205 that with the rest of the computer system 100 via communication path 113 which communicates to memory bridge 105 (or in another embodiment directly with CPU 102 ) connects. The connection of PPU 202 to the rest of the computer system 100 can also be varied. In some embodiments, the parallel processing subsystem is 112 implemented as an add-on card which enters an expansion slot or expansion slot of computer system 100 can be introduced. In other embodiments, a PPU 202 integrated on a single chip with a bus bridge, such as memory bridge 105 or I / O bridge 107 , In still other embodiments, some or all of the elements of PPU 202 on a single chip with CPU 102 be integrated.

In einer Ausführungsform ist der Kommunikationspfad 113 ein PCI-Express-Link, in welchem dedizierte Spuren oder Bahnen (lanes) an jede PPU 202 alloziert sind, wie es in der Technik bekannt ist. Andere Kommunikationspfade können auch benutzt werden. Eine I/O-Einheit 205 erzeugt Pakete (oder andere Signale) für eine Übermittlung auf Kommunikationspfad 113 und empfängt auch alle einlaufenden oder hereinkommenden (incoming) Pakete (oder andere Signale) von Kommunikationspfad 113, wobei die einlaufenden Pakete zu den geeigneten Komponenten von PPU 202 gerichtet werden. Zum Beispiel können Befehle, welche Verarbeitungs-Aufgaben betreffen, an eine Host-Schnittstelle 206 gerichtet werden, während Befehle, welche Speicher-Operationen betreffen (z. B. Lesen von oder Schreiben auf Parallel-Verarbeitungsspeicher 204) an eine Speicher-Kreuzschiene-Einheit (memory crossbar unit) 202 gerichtet werden können. Host-Schnittstelle 206 liest jeden Push-Puffer und gibt den Befehlsstrom, welches in dem Push-Puffers gespeichert ist, an ein Frontend 212 aus.In one embodiment, the communication path is 113 a PCI express link in which dedicated lanes to each PPU 202 allocated as known in the art. Other communication paths can also be used. An I / O unit 205 generates packets (or other signals) for communication on communication path 113 and also receives all incoming or incoming packets (or other signals) from communication path 113 , where the incoming packages to the appropriate components of PPU 202 be directed. For example, commands related to processing tasks may be sent to a host interface 206 while instructions pertaining to memory operations (e.g., reading or writing to parallel processing memory 204 ) to a memory crossbar unit (memory crossbar unit) 202 can be directed. Host Interface 206 reads each push buffer and passes the instruction stream stored in the push buffer to a frontend 212 out.

Jede PPU 202 implementiert vorteilhafter Weise eine Hochparallel-Verarbeitungs-Architektur. Wie im Detail gezeigt ist, umfasst PPU 202(0) ein Verarbeitungscluster-Feld (processing cluster array) 230, welches eine Anzahl C von Allgemein-Verarbeitungs-Clustern (GPCs) 208 umfasst, wobei C ≥ 1. Jeder GPC 208 ist in der Lage, eine große Anzahl (z. B. Hunderte oder Tausende) von Threads simultan (concurrently) auszuführen, wobei jeder Thread eine Instanz eines Programms ist. In verschiedenen Anwendungen können verschiedene GPCs 208 zur Verarbeitung von verschiedenen Typen von Programmen oder zum Durchführen von verschiedenen Typen von Berechnungen alloziert werden. Die Allozierung von GPCs 208 kann abhängig von der Arbeitsbelastung, welche für jeden Typ von Programm oder Berechnung auftritt, variiert werden.Every PPU 202 advantageously implements a high parallel processing architecture. As shown in detail, PPU includes 202 (0) a processing cluster array 230 containing a number C of General Processing Clusters (GPCs) 208 where C ≥ 1. Each GPC 208 is able to concurrently execute a large number (eg, hundreds or thousands) of threads, each thread being an instance of a program. Different applications can use different GPCs 208 to allocate to different types of programs or to perform various types of calculations. Allocation of GPCs 208 can be varied depending on the workload that occurs for each type of program or calculation.

GPCs 208 empfangen Verarbeitungs-Aufgaben, welche auszuführen sind, von einer Arbeit-Verteilungs-Einheit innerhalb einer Aufgabe-/Arbeit-Einheit 207. Die Arbeit-Verteilungs-Einheit empfängt Zeiger auf Verarbeitungs-Aufgaben, welche als Aufgabe-Metadaten (TMD) kodiert sind und in Speicher gespeichert sind. Die Zeiger auf TMDs sind in dem Befehls-Strom umfasst, welcher als ein Schiebe-Puffer gespeichert ist und mittels der Frontend-Einheit 212 von der Host-Schnittstelle 206 empfangen ist. Verarbeitungs-Aufgaben, welche als TMDs kodiert sein können, können Indizes innerhalb eines Feldes von Daten umfassen, welche zu prozessieren sind, sowie Status-Parameter und Befehle, welche definieren, wie die Daten zu prozessieren sind (z. B. welches Programm auszuführen ist). Die Aufgabe/Arbeit-Einheit 207 empfängt Aufgaben von dem Frontend 212 und stellt sicher, dass GPCs 208 auf einen gültigen Zustand konfiguriert sind, bevor die Prozessierung, welche mittels jeder der TMDs spezifiziert ist, initiiert ist. Eine Priorität kann für jede TMD spezifiziert sein, welche benutzt ist, Ausführung der Verarbeitungs-Aufgabe zu planen. Verarbeitungs-Aufgaben können auch von dem Verarbeitungs-Cluster-Feld 230 empfangen werden. Optional kann die TMD einen Parameter umfassen, welcher steuert, ob die TMD an den Kopf (head) oder den Schwanz (tail) für eine Liste von Verarbeitungs-Aufgaben hinzugefügt wird (oder eine Liste von Zeigern auf die Verarbeitungs-Aufgaben), wodurch ein anderes Niveau von Steuerung über Priorität bereitgestellt ist.GPCs 208 receive processing tasks to be executed from a work distribution unit within a task / work unit 207 , The work distribution unit receives pointers to processing tasks which are encoded as task metadata (TMD) and stored in memory. The pointers to TMDs are included in the instruction stream stored as a shift buffer and by the frontend unit 212 from the host interface 206 is received. Processing tasks that may be coded as TMDs may include indices within a field of data to be processed, as well as status parameters and commands that define how the data is to be processed (eg, what program is to be executed) ). The task / work unit 207 receives tasks from the frontend 212 and make sure GPCs 208 are configured to a valid state before the processing specified by each of the TMDs is initiated. A priority may be specified for each TMD that is used to schedule execution of the processing task. Processing tasks may also be from the processing cluster field 230 be received. Optionally, the TMD may include a parameter that controls whether the TMD is added to the head or the tail for a list of processing tasks (or a list of pointers to the processing tasks) other level of control over priority is provided.

Speicher-Schnittstelle 214 umfasst ein Anzahl D von Partitions-Einheiten 215, welche jeweils direkt mit einem Teil von Parallel-Verarbeitungs-Speicher 204 gekoppelt sind, wobei D ≥ 1. Wie gezeigt, ist die Anzahl von Partitions-Einheiten 215 im Allgemeinen gleich der Anzahl von dynamischerwillkürlicher-Zugriff-Speicher (DRAM) 220. In anderen Ausführungsformen muss die Anzahl von Partitions-Einheiten 215 nicht gleich der Nummer von Speicher-Geräten sein. Gewöhnliche Fachleute in der Technik werden schätzen, dass DRAM 220 durch irgendwelche anderen geeigneten Speicher-Geräte ersetzt werden kann und von einem im Allgemeinen konventionellen Design sein kann. Eine detaillierte Beschreibung wird daher ausgelassen. Render-Ziele (render targets), wie etwa Frame-Puffer oder Textur-Maps können über DRAMs 220 gespeichert sein, was den Partitions-Einheiten 215 erlaubt, Teile von jedem Render-Target in paralleler Weise zu schreiben, um effektiv die verfügbare Bandbreite von Parallel-Verarbeitungs-Speicher 204 zu nutzen.Memory Interface 214 includes a number D of partition units 215 which each directly with a part of parallel processing memory 204 where D ≥ 1. As shown, the number of partition units 215 generally equal to the number of Dynamic Random Access Memory (DRAM) 220 , In other embodiments, the number of partition units 215 not equal to the number of storage devices. Ordinary professionals in the art will appreciate that DRAM 220 can be replaced by any other suitable storage devices and may be of a generally conventional design. A detailed description is therefore omitted. Render targets, such as frame buffers or texture maps, can be accessed through DRAMs 220 be saved, giving the partition units 215 allows to write parts of each render target in parallel to effectively reduce the available bandwidth of parallel processing memory 204 to use.

Irgendeine von GPCs 208 kann Daten verarbeiten, welche auf irgendeinen der DRAMs 220 innerhalb des Parallel-Verarbeitungs-Speichers 204 zu schreiben sind. Kreuzschiene-Einheit 210 ist konfiguriert, um die Ausgabe von jedem GPC 208 an den Eingang irgendeiner Partitions-Einheit 215 oder an irgendeinen GPC 208 für weitere Verarbeitung zu leiten (route). GPCs 208 kommunizieren mit der Speicher-Schnittstelle 214 durch die Kreuzschiene 210, um von/auf verschiedene externe Speicher-Geräte zu schreiben oder zu lesen. In einer Ausführungsform hat die Kreuzschiene-Einheit 210 eine Verbindung der Speicher-Schnittstelle 214, um mit I/O-Einheit 205 zu kommunizieren, sowie eine Verbindung mit lokalem Parallel-Verarbeitungs-Speicher 204, um dadurch den Verarbeitungs-Kernen innerhalb der verschiedenen GPCs 208 zu ermöglichen, mit dem System-Speicher 104 oder einem anderen Speicher zu kommunizieren, welcher nicht lokal zu der PPU 202 ist. In der in 2B gezeigten Ausführungsform ist die Kreuzschiene-Einheit 210 direkt mit I/O-Einheit 205 verbunden. Kreuzschiene-Einheit 210 kann virtuelle Kanäle benutzen, um Verkehrsströme zwischen den GPCs 208 und den Partitions-Einheiten 215 zu separieren.Any of GPCs 208 can process data that is on any of the DRAMs 220 within the parallel processing memory 204 to write. Cross rail unit 210 is configured to monitor the output of each GPC 208 to the entrance of any partition unit 215 or to any GPC 208 for further processing (route). GPCs 208 communicate with the memory interface 214 through the crossbar 210 to write / read from / to various external storage devices. In one embodiment, the crossbar unit has 210 a connection of the memory interface 214 to work with I / O unit 205 communicate with local parallel processing memory 204 to thereby process the processing cores within the various GPCs 208 to enable with the system memory 104 or any other memory that is not local to the PPU 202 is. In the in 2 B the embodiment shown is the crossbar unit 210 directly with I / O unit 205 connected. Cross rail unit 210 Can use virtual channels to control traffic flow between the GPCs 208 and the partition units 215 to separate.

Wiederum können GPCs 208 programmiert sein, Verarbeitungs-Aufgaben durchzuführen, welche eine große Verschiedenheit von Anwendungen betreffen, einschließlich aber nicht darauf beschränkt, lineare oder nichtlineare Daten-Transformationen, Filtern von Video- und/oder Audio-Daten, Modellierungs-Operationen (z. B. Anwenden der Gesetze der Physik, um Position, Geschwindigkeit und andere Attribute von Objekten zu bestimmen), Bild-Render-Operationen (z. B. Tessellations-Schattierung, Vertex-Schattierung, Geometrie-Schattierung und/oder Pixel-Schattierungs-Programme), usw. PPUs 202 können Daten von dem System-Speicher 104 und/oder Lokal-Parallel-Verarbeitungs-Speichern 204 in internen (On-Chip)-Speicher transferieren, können die Daten prozessieren, und können Ergebnis-Daten zurück in den System-Speicher 104 und/oder lokalen Parallel-Verarbeitungs-Speicher 204 schreiben, wo auf solche Daten mittels anderer System-Komponenten zugegriffen werden kann, einschließlich CPU 102 oder ein anderes Parallel-Verarbeitungs-Subsystem 112.Again, GPCs 208 be programmed to perform processing tasks that address a wide variety of applications, including, but not limited to, linear or nonlinear data transformations, filtering of video and / or audio data, modeling operations (e.g., applying the Laws of physics to determine position, velocity, and other attributes of objects), image rendering operations (eg, tessellation shading, vertex shading, geometry shading, and / or pixel shading programs), etc. PPUs 202 can read data from the system memory 104 and / or local parallel processing memory 204 into internal (on-chip) memory, the data can process and return result data back to the system Storage 104 and / or local parallel processing memory 204 write where such data can be accessed by other system components, including CPU 102 or another parallel processing subsystem 112 ,

Eine PPU 202 kann mit irgendeiner Menge/Umfang (amount) von Lokal-Parallel-Verarbeitungs-Speicher 204 bereitgestellt sein, einschließlich keines Lokal-Speichers, und kann Lokal-Speicher und System-Speicher in irgendeiner Kombination benutzen. Zum Beispiel kann eine PPU 202 ein Grafikprozessor in einer unifizierter-Speicher-Architektur(unified memory architecture)(UMA)-Ausführungsform sein. In solchen Ausführungsformen würde wenig oder kein dedizierter Grafik-(Parallel-Verarbeitungs)-Speicher bereitgestellt sein und PPU 202 würde System-Speicher exklusiv oder fast exklusiv benutzen. In UMA-Ausführungsformen kann eine PPU 202 in einen Brücken-Chip oder Prozessor-Chip integriert sein oder als ein diskreter Chip bereitgestellt sein mit einem Hochgeschwindigkeits-Link (z. B. PCI-Express), welcher die PPU 202 mit System-Speicher über einen Brücke-Chip oder ein anderes Kommunikations-Mittel verbindet.A PPU 202 can with any amount of local parallel processing memory 204 be provided, including no local memory, and may use local memory and system memory in any combination. For example, a PPU 202 be a graphics processor in a unified memory architecture (UMA) embodiment. In such embodiments, little or no dedicated graphics (parallel processing) memory would be provided and PPU 202 would use system memory exclusively or almost exclusively. In UMA embodiments, a PPU 202 be integrated into a bridge chip or processor chip or be provided as a discrete chip with a high speed link (e.g., PCI Express) connecting the PPU 202 connects to system memory via a bridge chip or other means of communication.

Wie oben bemerkt ist, kann irgendeine Anzahl von PPUs 202 in einem Parallel-Verarbeitungs-Subsystem 112 umfasst sein. Zum Beispiel können mehrere PPUs 202 auf einer einzelnen Hinzufügungskarte bereitgestellt sein oder mehrere Hinzufügungskarten können mit dem Kommunikationspfad 113 verbunden sein oder eine oder mehrere der PPUs 202 können in einen Brücken-Chip integriert sein. PPUs 202 in einem Mehr-PPU-System können identisch sein oder verschieden voneinander sein. Zum Beispiel könnten verschiedene PPUs 202 verschiedene Anzahlen von Verarbeitungs-Kernen haben, verschiedene Mengen oder Größen von Lokal-Parallel-Verarbeitungs-Speicher, usw. Wo mehrere PPUs 202 vorhanden sind, können diese PPUs in paralleler Weise betrieben werden, um Daten bei einem höheren Durchsatz zu verarbeiten als es mit einer einzelnen PPU 202 möglich ist. Systeme, welche eine oder mehrere PPUs 202 inkorporieren, können in einer Verschiedenheit von Konfigurationen und Formfaktoren implementiert sein, einschließlich Schreibtisch-Computer, Laptop-Computer, oder handgehaltenen Personal-Computern, Servern, Arbeitsstationen, Spielekonsolen, eingebetteten Systemen und dergleichen.As noted above, any number of PPUs 202 in a parallel processing subsystem 112 includes his. For example, several PPUs 202 can be provided on a single add-on card or multiple add-on cards with the communication path 113 be connected or one or more of the PPUs 202 can be integrated into a bridge chip. PPUs 202 in a multi-PPU system may be identical or different from each other. For example, different PPUs could 202 have different numbers of processing cores, different amounts or sizes of local parallel processing memory, etc. Where multiple PPUs 202 These PPUs can be operated in parallel to process data at a higher throughput than with a single PPU 202 is possible. Systems containing one or more PPUs 202 can be implemented in a variety of configurations and form factors, including desktop computers, laptop computers, or handheld personal computers, servers, workstations, game consoles, embedded systems, and the like.

Mehrfach-gleichzeitiges-Aufgabe-PlanenMultiple simultaneous task-Plan

Mehrere Verarbeitungs-Aufgaben können gleichzeitig auf den GPCs 208 ausgeführt werden und eine Verarbeitungs-Aufgabe kann eine oder mehr „Kind”-Verarbeitungs-Aufgaben während der Ausführung erzeugen. Die Aufgabe-/Arbeit-Einheit 207 empfängt die Aufgaben und plant dynamisch die Verarbeitungs-Aufgaben und Kind-Verarbeitungs-Aufgaben zur Ausführung mittels der GPCs 208.Multiple processing tasks can be done simultaneously on the GPCs 208 and a processing task may generate one or more "child" processing tasks during execution. The task / work unit 207 receives the tasks and dynamically schedules the processing tasks and child processing tasks for execution by means of the GPCs 208 ,

3A ist ein Blockdiagramm der Aufgabe-/Arbeit-Einheit 207 von 2 gemäß einer Ausführungsform der vorliegenden Erfindung. Die Aufgabe-/Arbeit-Einheit 207 umfasst eine Aufgabe-Management-Einheit 300 und die Arbeit-Verteilungs-Einheit 340, und Status 304 (dessen Inhalte in Verbindung mit 4A und 4B im Detail unten beschrieben sind). Die Aufgabe-Management-Einheit 300 organisiert Aufgaben, welche zu planen sind, basierend auf Ausführungs-Prioritäts-Niveaus. Für jedes Prioritäts-Niveau speichert die Aufgabe-Management-Einheit 300 eine Liste von Zeigern auf die TMDs 322, welche den Aufgaben entsprechen, in der Planer-Tabelle 321, wobei die Liste als eine verkettete Liste (linked list) implementiert sein kann. Die TMDs 322 können in dem PP-Speicher 204 oder System-Speicher 104 gespeichert sein. Die Rate, bei welcher die Aufgabe-Management-Einheit 300 Aufgaben annimmt und die Aufgaben in der Planer-Tabelle 321 speichert, ist entkoppelt von der Rate, bei welcher die Aufgabe-Management-Einheit 300 Aufgaben zur Ausführung plant. Daher kann die Aufgabe-Management-Einheit 300 einige Aufgaben sammeln, bevor die Aufgaben geplant werden. Jede TMD 322 umfasst Status 324, welcher für die Weise wichtig ist, in welcher die TMD 322 innerhalb der PPU 202 gehandhabt wird, wie in im weiteren Detail hierin beschrieben ist. 3A is a block diagram of the task / work unit 207 from 2 according to an embodiment of the present invention. The task / work unit 207 includes a task management unit 300 and the work distribution unit 340 , and status 304 (its contents in connection with 4A and 4B described in detail below). The task management unit 300 Organizes tasks to be scheduled based on execution priority levels. For each priority level, the task management unit saves 300 a list of pointers to the TMDs 322 , which correspond to the tasks, in the planner table 321 The list may be implemented as a linked list. The TMDs 322 can in the PP memory 204 or system memory 104 be saved. The rate at which the task management unit 300 Tasks and tasks in the planner table 321 stores is decoupled from the rate at which the task management unit 300 Tasks to execute plans. Therefore, the task management unit 300 Collect some tasks before scheduling tasks. Every TMD 322 includes status 324 , which is important for the way in which the TMD 322 within the PPU 202 is handled as described in further detail herein.

Die Arbeit-Verteilungs-Einheit 340 umfasst eine Aufgabe-Tabelle 345 mit Fächern (slots), welche jeweils von einer TMD 322 für eine Aufgabe besetzt sein können, welche ausgeführt wird. Die Aufgabe-Management-Einheit 300 kann Aufgaben zur Ausführung planen, wenn es ein freies Fach in der Aufgabe-Tabelle 345 gibt. Wenn es kein freies Fach gibt, kann eine höhere-Priorität-Aufgabe, welche kein Fach besetzt, eine niedrigere Priorität-Aufgabe verdrängen oder ausschließen (evict), welche ein Fach besetzt. Wenn eine Aufgabe ausgeschlossen wird, wird die Aufgabe gestoppt und wenn Ausführung der Aufgabe nicht vollständig ist, dann wird ein Zeiger auf die Aufgabe an eine Liste von Aufgabe-Zeigern, welche auszuführen sind, so hinzugefügt, dass Ausführung der Aufgabe zu einer späteren Zeit wieder aufnehmen. In einigen Ausführungsformen ist die Stelle bzw. der Platz (place), um die Aufgabe wieder aufzunehmen, in der Aufgabe-TMD 322 gespeichert. Wenn eine Kind-Verarbeitungs-Aufgabe erzeugt ist, während einer Ausführung einer Aufgabe, wird ein Zeiger auf die Kind-Aufgabe an die Liste von Aufgabe-Zeigern, welche zu planen sind, hinzugefügt. Eine Kind-Aufgabe kann mittels einer TMD 322 erzeugt sein, welche in dem Verarbeitungs-Cluster-Feld 230 ausführt. Die Arbeits-Verteilungs-Einheit 340 umfasst auch Streaming-Mehrprozessor-(SM)-Status 342, welcher Status-Daten für jeden SM 310 speichert, welcher in PPU 202 umfasst ist, wie im weiteren Detail hierin beschrieben ist.The work distribution unit 340 includes a task table 345 with slots, each of which is a TMD 322 may be occupied for a task that is being performed. The task management unit 300 can schedule tasks to run if there is a free partition in the task table 345 gives. If there is no free subject, a higher-priority task that does not occupy a subject can evict or exclude a lower priority task occupying a subject. If a task is excluded, the task is stopped, and if execution of the task is not complete, then a pointer to the task is added to a list of task pointers to be executed so that execution of the task at a later time take up. In some embodiments, the place to resume the task is in the task TMD 322 saved. When a child processing task is generated during execution of a task, a pointer to the child task is added to the list of task pointers to be scheduled. A child task can be done using a TMD 322 be generated in the processing cluster field 230 performs. The work distribution unit 340 also includes streaming multiprocessor (SM) status 342 which status data for each SM 310 stores which in PPU 202 is included as described in further detail herein.

Unähnlich einer Aufgabe, welche mittels der Aufgabe-/Arbeit-Einheit 207 von dem Frontend 212 empfangen wird, werden Kind-Aufgaben von dem Verarbeitungs-Cluster-Feld 230 empfangen. Kind-Aufgaben werden nicht in Schiebe-Puffer eingefügt oder an das Frontend übermittelt. Die CPU 102 wird nicht benachrichtigt, wenn eine Kind-Aufgabe erzeugt ist oder Daten für die Kind-Aufgabe im Speicher gespeichert werden. Ein anderer Unterschied zwischen den Aufgaben, welche durch Schiebe-Puffer bereitgestellt sind, und Kind-Aufgaben ist, dass die Aufgaben, welche durch die Schiebe-Puffer bereitgestellt sind, mittels des Anwendungs-Programms definiert sind, wogegen die Kind-Aufgaben dynamisch während einer Ausführung der Aufgaben erzeugt sind.Unlike a task performed by the task / work unit 207 from the frontend 212 receive child tasks from the processing cluster field 230 receive. Child tasks are not inserted in slide buffers or sent to the frontend. The CPU 102 is not notified when a child task is generated or data for the child task is stored in memory. Another difference between the tasks provided by shift buffers and child tasks is that the tasks provided by the shift buffers are defined by the application program, whereas the child tasks are dynamically defined during a task Execution of tasks are generated.

Aufgabe-Verarbeitung-ÜberblickTask Processing Overview

3B ist ein Blockdiagramm eines GPC 208 innerhalb einer der PPUs 202 der 2, gemäß einer Ausführungsform der vorliegenden Erfindung. Jeder GPC 208 kann konfiguriert sein, eine große Anzahl von Threads parallel auszuführen, wobei der Ausdruck „Thread” sich auf eine Instanz eines bestimmten Programms bezieht, welches auf einem bestimmten Satz von Eingabe-Daten ausführt. In einigen Ausführungsformen werden Einzel-Anweisung-, Mehr-Daten-(SIMD)-Befehls-Ausstellungs-Techniken benutzt, um eine parallele Ausführung einer großen Anzahl von Threads zu unterstützen, ohne mehrere unabhängige Anweisungs-Einheiten bereitzustellen. In anderen Ausführungsformen werden Einzel-Anweisung-, Mehrfach-Thread-(SIMT)-Techniken benutzt, um eine parallele Ausführung einer großen Anzahl von im Allgemeinen synchronisierten Threads zu unterstützen, unter Benutzung einer gemeinsamen Anweisungs-Einheit, welche konfiguriert ist, Anweisungen für einen Satz von Verarbeitungs-Maschinen innerhalb jedes der GPCs 208 auszustellen (issue). Unähnlich zu einem SIMD-Ausführungs-Regime, wobei alle Verarbeitungs-Maschinen typischerweise identische Anweisungen ausführen, erlaubt eine große SIMT-Ausführung verschiedenen Threads, leichter divergenten Ausführungspfaden durch ein gegebenes Thread-Programm zu folgen. Gewöhnliche Fachleute in der Technik werden verstehen, dass ein SIMD-Verarbeitungs-Regime eine funktionale Untermenge eines SIMT-Verarbeitungs-Regimes repräsentiert. 3B is a block diagram of a GPC 208 within one of the PPUs 202 of the 2 , according to an embodiment of the present invention. Every GPC 208 may be configured to execute a large number of threads in parallel, where the term "thread" refers to an instance of a particular program executing on a particular set of input data. In some embodiments, single-instruction, multi-data (SIMD) instruction issuing techniques are used to support parallel execution of a large number of threads without providing multiple independent instruction units. In other embodiments, single-instruction, multi-threaded (SIMT) techniques are used to support parallel execution of a large number of generally synchronized threads, using a common instruction unit configured to provide instructions to one Set of processing machines within each of the GPCs 208 Issue (issue). Unlike a SIMD execution regime, where all processing engines typically execute identical instructions, a large SIMT implementation allows different threads to more easily follow divergent execution paths through a given thread program. One of ordinary skill in the art will understand that a SIMD processing regime represents a functional subset of a SIMT processing regime.

Betrieb von GPC 208 wird vorteilhafterweise über einen Pipeline-Manager 305 gesteuert, welcher Verarbeitungs-Aufgaben an Strömungs-Mehrfach-Prozessoren (streaming multiprocessors) (SMs) 310 verteilt. Pipeline-Manager 305 kann auch konfiguriert sein, eine Arbeitsverteilungs-Kreuzschiene (work distribution crossbar) 330 dadurch zu steuern, dass Ziele (destinations) für prozessierte Daten-Ausgaben mittels SMs 310 spezifiziert sind.Operation of GPC 208 is advantageously via a pipeline manager 305 controlling which processing tasks on streaming multiprocessors (SMs) 310 distributed. Pipeline Manager 305 can also be configured to have a work distribution crossbar 330 by controlling destinations for processed data outputs using SMs 310 are specified.

In einer Ausführungsform umfasst jede GPC 208 eine Anzahl M von SMs 310, wobei M ≥ 1, wobei jeder SM 310 konfiguriert ist, eine oder mehrere Thread-Gruppen zu verarbeiten. Auch umfasst jeder SM 310 vorteilhafterweise einen identischen Satz von funktionalen Ausführungseinheiten (z. B. Ausführungseinheiten und Lade-Speicher-Einheiten – gezeigt als Exec-Einheiten 302 und LSUs 303 in 3C), welche in einer Pipeline angeordnet sein können (pipelined), was erlaubt, eine neue Anweisung auszustellen, bevor eine vorherige Anweisung beendet worden ist, wie es in der Technik bekannt ist. Irgendeine Kombination von funktionalen Ausführungs-Einheiten kann bereitgestellt sein. In einer Ausführungsform unterstützen die funktionalen Einheiten eine Verschiedenheit von Operationen, einschließlich Ganzzahl-Arithmetik und Gleitzahl-Arithmetik (z. B. Addition und Multiplikation), Vergleichs-Operationen, Bool'sche Operationen (AND, OR, XOR), Bit-Verschiebung und Berechnen von verschiedenen algebraischen Funktionen (z. B. planare Interpolation, trigonometrische, exponentiale und logarithmische Funktionen); und dieselbe Funktional-Einheit-Hardware kann eingesetzt werden, um verschiedene Operationen durchzuführen.In one embodiment, each GPC includes 208 a number M of SMs 310 where M ≥ 1, where each SM 310 is configured to process one or more thread groups. Also, each SM includes 310 advantageously, an identical set of functional execution units (e.g., execution units and load-store units - shown as execute units 302 and LSUs 303 in 3C ), which may be pipelined, allowing a new instruction to be issued before a previous instruction has been completed, as is known in the art. Any combination of functional execution units may be provided. In one embodiment, the functional units support a variety of operations, including integer arithmetic and glitch arithmetic (e.g., addition and multiplication), comparison operations, Boolean operations (AND, OR, XOR), bit shifting, and Calculating different algebraic functions (eg planar interpolation, trigonometric, exponential and logarithmic functions); and the same functional unit hardware can be used to perform various operations.

Die Serie von Anweisungen, welche an eine bestimmte GPC 208 übermittelt wird, konstituiert einen Thread, wie vorher hierin definiert ist, und die Sammlung einer gewissen Anzahl von simultan ausführenden Threads über die Parallel-Verarbeitungs-Maschinen (nicht gezeigt) innerhalb eines SM 310 wird hierin als ein „Warp” oder „Thread-Gruppe” bezeichnet. Wie hierin benutzt, bezeichnet eine „Thread-Gruppe” eine Gruppe von Threads, welche simultan dasselbe Programm auf verschiedenen Eingabe-Daten ausführen, wobei ein Thread der Gruppe an eine verschiedene Verarbeitungs-Maschine innerhalb eines SM 310 zugewiesen ist. Eine Thread-Gruppe kann weniger Threads umfassen als die Anzahl von Verarbeitungs-Einheiten innerhalb des SM 310, in welchem Fall einige Verarbeitungs-Maschinen während Zyklen untätig sein werden, wenn diese Thread-Gruppe verarbeitet wird. Eine Thread-Gruppe kann auch mehr Threads umfassen als die Anzahl von Verarbeitungs-Maschinen innerhalb des SM 310, in welchem Fall die Verarbeitung über nachfolgende Taktzyklen stattfinden wird. Da jeder SM 310 bis zu G Thread-Gruppen gleichzeitig unterstützen kann, folgt, dass bis zu G·M Thread-Gruppen zu einer gegebenen Zeit in GPC 208 ausführen können.The series of instructions given to a particular GPC 208 is transmitted, constituting a thread as defined hereinbefore and the collection of a certain number of simultaneously executing threads via the parallel processing engines (not shown) within an SM 310 is referred to herein as a "warp" or "thread group". As used herein, a "thread group" refers to a group of threads that simultaneously execute the same program on different input data, with one thread of the group to a different processing engine within an SM 310 is assigned. A thread group may include fewer threads than the number of processing units within the SM 310 in which case some processing engines will be idle during cycles when this thread group is being processed. A thread group may also include more threads than the number of processing machines within the SM 310 in which case the processing will take place over subsequent clock cycles. Because every SM 310 up to G can support thread groups at once, that follows up to G · M thread groups at a given time in GPC 208 can execute.

Zusätzlich kann eine Mehrzahl von bezogenen Thread-Gruppen aktiv sein, in verschiedenen Phasen einer Ausführung, zu derselben Zeit innerhalb eines SPM 310. Diese Sammlung von Thread-Gruppen wird hierin als ein „kooperatives Thread-Feld” (cooperative thread array) („CTA”) oder „Thread-Feld” bezeichnet. Die Größe eines bestimmten CTA ist m·k, wobei k die Anzahl von gleichzeitig ausführenden Threads in einer Thread-Gruppe ist und typischerweise ein ganzzahliges Vielfaches der Anzahl von Parallel-Verarbeitungs-Einheiten innerhalb des SM 310 ist, und wobei m die Anzahl von Thread-Gruppen ist, welche simultan innerhalb des SM 310 aktiv sind. Die Größe eines CTA ist im Allgemeinen mittels des Programmierers bestimmt und mittels der Menge von Hardware-Ressourcen, wie Speicher oder Register, welche für das CTA verfügbar sind.In addition, a plurality of related thread groups may be active at different stages of execution at the same time within an SPM 310 , This collection of thread Groups are referred to herein as a "cooperative thread array"("CTA") or "thread field". The size of a particular CTA is mxk, where k is the number of concurrently executing threads in a thread group, and typically an integer multiple of the number of parallel processing units within the SM 310 where m is the number of thread groups which are simultaneously within the SM 310 are active. The size of a CTA is generally determined by the programmer and by the amount of hardware resources, such as memory or registers, available to the CTA.

Jeder SM 310 beinhaltet einen Level-eins(L1)-Cache (in 3C gezeigt) oder benutzt Raum (space) in einem entsprechenden L1-Cache außerhalb des SM 310, welcher benutzt ist, um Lade- und Speicher-Operationen durchzuführen. Jeder SM 310 hat auch Zugriff auf Level-zwei(L2)-Caches, welche unter allen GPCs 208 gemeinsam benutzt oder geteilt sind (shared) und benutzt werden können, um Daten zwischen Threads zu transferieren. Schließlich haben die SMs 310 Zugriff auf Off-Chip „globalen” Speicher, welcher z. B. Parallel-Verarbeitungs-Speicher 204 oder System-Speicher 104 umfassen kann. Es ist verstanden, dass irgendein Speicher extern zu PPU 202 als globaler Speicher benutzt werden kann. Zusätzlich kann ein Level-eins-Komma-fünf(L1.5)-Cache 335 innerhalb des GPC 208 umfasst sein, welcher konfiguriert ist, Daten zu empfangen und zu halten, welche von dem Speicher über Speicher-Schnittstelle 214 geholt sind, abgefragt mittels SM 310, einschließlich Anweisungen, uniforme Daten und konstante Daten, und die angefragten Daten an SM 310 bereitzustellen. Ausführungsformen, welche mehrere SMs 310 in GPC 208 haben, teilen oder benutzen gemeinsam (share) in vorteilhafter Weise gemeinsame Anweisungen und Daten, welche in L1.5-Cache 335 gecached sind.Every SM 310 includes a level one (L1) cache (in 3C shown) or uses space in a corresponding L1 cache outside the SM 310 which is used to perform load and store operations. Every SM 310 also has access to level two (L2) caches, which are under all GPCs 208 shared or shared (shared) and used to transfer data between threads. Finally, the SMs have 310 Access to off-chip "global" memory, which z. B. Parallel processing memory 204 or system memory 104 may include. It is understood that any memory external to PPU 202 can be used as a global storage. Additionally, a level one-comma five (L1.5) cache 335 within the GPC 208 which is configured to receive and hold data from the memory via memory interface 214 are retrieved, queried by SM 310 , including instructions, uniform data and constant data, and the requested data to SM 310 provide. Embodiments involving multiple SMs 310 in GPC 208 share, share or share sharply share common instructions and data in L1.5 cache 335 are cached.

Jeder GPC 208 kann eine Speicher-Management-Einheit (MMU) 328 umfassen, welche konfiguriert ist, virtuelle Adressen in physikalische Adressen abzubilden (map). In anderen Ausführungsformen, können MMU(s) 328 innerhalb der Speicher-Schnittstelle 214 ansässig sein (reside). Die MMU 328 umfasst einen Satz von Seite-Tabelle-Einträgen (page table entry) (PTEs), welche benutzt werden, um eine virtuelle Adresse in eine physikalische Adresse einer Kachel (tile) und optional einen Cache-Zeilen-Index abzubilden. Die MMU 328 kann Adresse-Übersetzungs-Puffer (translation lookaside buffer) (TLB) oder Caches umfassen, welche innerhalb des Mehrfach-Prozessors SM 310 oder dem L1-Cache oder GPC 208 ansässig sein können. Die physikalische Adresse ist verarbeitet, um Oberflächendaten-Zugriffslokalität zu verteilen, um eine effiziente Abfrage-Verschachtelung (interleaving) unter Partitions-Einheiten 215 zu erlauben. Der Cache-Zeile-Index kann benutzt werden, um zu bestimmen, ob eine Anfrage nach einer Cache-Zeile ein Treffer ist oder eine Verfehlung ist oder nicht.Every GPC 208 can a storage management unit (MMU) 328 which is configured to map virtual addresses into physical addresses (map). In other embodiments, MMUs (s) may be 328 within the memory interface 214 be resident. The MMU 328 comprises a set of page-table entries (PTEs) which are used to map a virtual address into a physical address of a tile and optionally a cache line index. The MMU 328 may include translation lookaside buffer (TLB) or caches which may be within the multiple processor SM 310 or the L1 cache or GPC 208 be resident. The physical address is processed to distribute surface data access locality for efficient interleaving among partition units 215 to allow. The cache line index can be used to determine whether a request for a cache line is a hit or a miss or not.

in Grafik- und Berechnungs-Anwendungen kann eine GPC 208 derart konfiguriert sein, dass jeder SM 310 mit einer Textur-Einheit 315 zum Durchführen von Textur-Abbildungs-Operationen gekoppelt ist, z. B. Bestimmen von Textur-Proben-Positionen (texture sample position), Lesen von Textur-Daten und Filtern der Textur-Daten. Textur-Daten werden von einem internen Textur-L1-Cache (nicht gezeigt) oder in einigen Ausführungsformen von dem L1-Cache innerhalb von SM 310 gelesen und werden von einem L2-Cache, welcher unter allen GPCs 208 geteilt ist, Parallel-Verarbeitungs-Speicher 204 oder System-Speicher 104 wie benötigt geholt. Jeder SPM 310 gibt verarbeitete Aufgaben an die Arbeits-Verteilungs-Kreuzschiene 330 aus, um die verarbeitete Aufgabe an einen anderen GPC 208 für weitere Verarbeitung bereitzustellen oder um die verarbeitete Aufgabe in einem L2-Cache, Parallel-Verarbeitungs-Speicher 204 oder System-Speicher 104 über Kreuzschiene-Einheit 210 zu speichern. Ein preROP (Vorraster-Operationen) 325 ist konfiguriert, um Daten von SM 310 zu empfangen, Daten an ROP-Einheiten innerhalb der Partitions-Einheiten 215 zu richten, und Optimierungen für Farbmischung durchzuführen, Pixel-Farbdaten zu organisieren und Adress-Translationen durchzuführen.in graphics and computing applications, a GPC 208 be configured such that each SM 310 with a texture unit 315 for performing texture mapping operations, e.g. B. Determining texture sample position, reading texture data, and filtering the texture data. Texture data is provided by an internal texture L1 cache (not shown) or, in some embodiments, by the L1 cache within SM 310 read and are from an L2 cache, which is under all GPCs 208 shared, parallel processing memory 204 or system memory 104 brought as needed. Every SPM 310 Gives processed tasks to the work distribution router 330 off to the processed task to another GPC 208 to provide for further processing or the processed task in an L2 cache, parallel processing memory 204 or system memory 104 via crossbar unit 210 save. A preROP (pre-grid operations) 325 is configured to receive data from SM 310 receive data to ROP units within the partition units 215 and optimizing color mixing, organizing pixel color data, and performing address translations.

Es wird geschätzt werden, dass die hierin beschriebene Kern-Architektur illustrativ ist und dass Variationen und Modifikationen möglich sind. Irgendeine Anzahl von Verarbeitungs-Einheiten, z. B. SPMs 310 oder Textur-Einheiten 315, preROPs 325, können innerhalb eines GPC 208 umfasst sein kann. Ferner, wie in 2 gezeigt ist, kann eine PPU 202 irgendeine Anzahl von GPCs 208 umfassen, welche vorteilhafterweise funktionell ähnlich zueinander sind, so dass ein Ausführungs-Verhalten nicht davon abhängt, welcher GPC 208 eine bestimmte Verarbeitungs-Aufgabe empfängt. Ferner operiert jeder GPC 208 vorteilhafterweise unabhängig von anderen GPCs 208 unter Benutzung von separaten und distinkten Verarbeitungs-Einheiten L1-Caches, usw., um Aufgaben für ein oder mehr Anwendungsprogramme auszuführen.It will be appreciated that the core architecture described herein is illustrative and that variations and modifications are possible. Any number of processing units, e.g. B. SPMs 310 or texture units 315 , preROPs 325 , can within a GPC 208 can be included. Further, as in 2 shown is a PPU 202 any number of GPCs 208 which are advantageously functionally similar to each other so that performance does not depend on which GPC 208 receives a specific processing task. Furthermore, every GPC operates 208 advantageously independent of other GPCs 208 using separate and distinct processing units, L1 caches, etc., to perform tasks for one or more application programs.

Gewöhnliche Fachleute in der Technik werden verstehen, dass die in 1, 2, 3A und 3B beschriebene Architektur in keiner Weise den Geltungsbereich der vorliegenden Erfindung begrenzt und dass die hierin gelehrten Techniken auf irgendeiner korrekt konfigurierten Verarbeitungs-Einheit implementiert werden können, einschließlich ohne Begrenzung eine oder mehrere CPUs, eine oder mehrere Mehr-Kern-CPUs, eine oder mehrere PPUs 202, ein oder mehrere GPCs 208, eine oder mehrere Grafik- oder Spezialzweck-Verarbeitungs-Einheiten, oder dergleichen, ohne von dem Geltungsbereich der vorliegenden Erfindung abzuweichen.Ordinary experts in engineering will understand that in 1 . 2 . 3A and 3B described architecture in no way limits the scope of the present invention and that the techniques taught herein can be implemented on any correctly configured processing unit, including without limitation one or more CPUs, one or more multi-core CPUs, one or more PPUs 202 , one or more GPCs 208 , one or more graphics or special purpose processing units, or the like, without departing from the scope of the present invention.

In Ausführungsformen der vorliegenden Erfindung ist es wünschenswert, die PPU 202 oder andere Prozessor(en) eines Computer-Systems zu benutzen, um Allgemeinzweck-Berechnungen unter Benutzung von Thread-Feldern auszuführen. Jedem Thread in dem Thread-Feld ist ein eindeutiger Thread-Identifikator („Thread-ID”) zugewiesen, welcher für den Thread während Ausführung des Threads zugreifbar ist. Die Thread-ID, welche als ein eindimensionaler oder mehrdimensionaler numerischer Wert definiert werden kann, steuert verschiedene Aspekte des Verarbeitungs-Verhaltens des Threads. Zum Beispiel kann eine Thread-ID benutzt werden, um zu bestimmen, welchen Teil des Eingabe-Datensatzes ein Thread zu prozessieren hat, und/oder zu bestimmen, welchen Teil eines Ausgabe-Datensatzes ein Thread zu erzeugen hat oder zu schreiben hat.In embodiments of the present invention, it is desirable to use the PPU 202 or other processor (s) of a computer system to perform general purpose calculations using thread fields. Each thread in the thread field is assigned a unique thread identifier ("thread ID") which is accessible to the thread during execution of the thread. The thread ID, which can be defined as a one-dimensional or multi-dimensional numeric value, controls various aspects of the processing behavior of the thread. For example, a thread ID may be used to determine which part of the input record a thread has to process, and / or to determine which part of an output record a thread has to generate or write.

Eine Sequenz von Pro-Thread-Anweisungen kann zumindest eine Anweisung umfassen, welche ein kooperatives Verhalten zwischen dem repräsentativen Thread und einem oder mehreren anderen Threads des Thread-Feldes definiert. Zum Beispiel könnte die Sequenz von Pro-Thread-Anweisungen eine Anweisung umfassen, um eine Ausführung von Operationen für den repräsentativen Thread bei einem bestimmten Punkt in der Sequenz anzuhalten (suspend), bis zu einer solchen Zeit, wenn einer oder mehrere der anderen Threads diesen bestimmten Punkt erreichen, eine Anweisung für den repräsentativen Thread, Daten in einem gemeinsamen Speicher zu speichern, auf welchen einer oder mehrere der anderen Threads zugreifen können, eine Anweisung für den repräsentativen Thread, um atomar Daten zu lesen und zu aktualisieren, welche in einem gemeinsamen Speicher gespeichert sind, auf welchen einer oder mehrere der anderen Threads Zugriff haben, basierend auf ihren Thread-IDs, oder dergleichen. Das CTA-Programm kann auch eine Anweisung umfassen, um eine Adresse in dem gemeinsamen Speicher zu berechnen, von welchem Daten zu lesen sind, wobei die Adresse eine Funktion einer Thread-ID ist. Mittels eines Definierens von geeigneten Funktionen und mittels eines Bereitstellens von Synchronisations-Techniken können Daten auf eine bestimmte Stelle in dem gemeinsamen Speicher mittels eines Threads eines CTA geschrieben werden und von dieser Stelle mittels eines verschiedenen Threads desselben CTA in einer vorhersagbaren Weise gelesen werden. Folglich kann irgendein gewünschtes Muster von Daten-gemeinsam-Benutzen (data sharing) unter Threads unterstützt werden, und irgendein Thread in einem CTA kann mit irgendeinem anderen Thread in demselben CTA Daten gemeinsam nutzen bzw. teilen (share). Das Ausmaß, wenn überhaupt, eines gemeinsamen Benutzens von Daten unter Threads eines CTA ist mittels des CTA-Programms bestimmt; somit wird verstanden, dass in einer bestimmten Anwendung, welche CTAs benutzt, die Threads eines CTA tatsächlich Daten miteinander teilen bzw. benutzen könnten oder nicht, abhängig von dem CTA-Programm, und die Ausdrucke „CTA” und „Thread-Feld” werden hierin synonym benutzt.A sequence of per-thread instructions may include at least one instruction defining cooperative behavior between the representative thread and one or more other threads of the thread field. For example, the sequence of per-thread instructions could include an instruction to suspend execution of operations for the representative thread at a particular point in the sequence until such time as one or more of the other threads completes it At some point, an instruction for the representative thread to store data in a shared memory to which one or more of the other threads can access, a representative thread instruction to read and update atomically data that is in a common thread Memory are stored on which one or more of the other threads have access, based on their thread IDs, or the like. The CTA program may also include an instruction to calculate an address in the shared memory from which to read data, the address being a function of a thread ID. By defining suitable functions and providing synchronization techniques, data may be written to a particular location in the shared memory by means of a thread of a CTA and read from that location by a different thread of the same CTA in a predictable manner. Thus, any desired pattern of data sharing among threads can be supported, and any thread in a CTA can share data with any other thread in the same CTA. The extent, if any, of sharing data among threads of a CTA is determined by the CTA program; thus it is understood that in a particular application using CTAs, the threads of a CTA could actually share data or not, depending on the CTA program, and the terms "CTA" and "thread field" are used herein used synonymously.

3C ist ein Blockdiagramm des SM 310 von 3B gemäß einer Ausführungsform der vorliegenden Erfindung. Der SM 310 umfasst einen Anweisungs-L1-Cache 370, welcher konfiguriert ist, Anweisungen und Konstanten von Speicher über L1.5-Cache 335 zu empfangen. Eine Warp-Planer-(warp scheduler) und -Anweisungs-Einheit 312 empfängt Anweisungen und Konstanten von dem Anweisungs-L1-Cache 370 und steuert eine lokale Register-Datei 304 und SM 310 funktionale Einheiten gemäß den Anweisungen und Konstanten. Die SM 310 funktionalen Einheiten umfassen N exec(Ausführung- oder Verarbeitung-)-Einheiten 302 und P Lade-Speicher-Einheiten (LSU) 303. 3C is a block diagram of the SM 310 from 3B according to an embodiment of the present invention. The SM 310 includes an instruction L1 cache 370 which is configured to allocate instructions and constants of memory over L1.5 cache 335 to recieve. A warp scheduler and instruction unit 312 receives instructions and constants from the instruction L1 cache 370 and controls a local register file 304 and SM 310 functional units according to the instructions and constants. The SM 310 functional units include N exec (execution or processing) units 302 and P load-memory units (LSU) 303 ,

SM 310 stellt Auf-Chip-(on-chip)-(internen)-Daten-Speicher mit verschiedenen Zugriffs-Niveaus bereit. Spezielle Register (nicht gezeigt) sind lesbar aber nicht schreibbar mittels LSU 303 und werden benutzt, um Parameter zu speichern, welche eine „Position” jedes Threads definieren. In einer Ausführungsform umfassen spezielle Register ein Register pro Thread (oder pro exec-Einheit 302 innerhalb SM 310), welches eine Thread-ID speichert; jedes Thread-ID-Register ist nur mittels einer jeweiligen der exec-Einheit 302 zugreifbar. Spezielle Register können auch zusätzliche Register umfassen, welche mittels aller Threads lesbar sind, welche dieselbe Verarbeitungs-Aufgabe ausführen, welche mittels einer TMD 322 repräsentiert ist (oder mittels aller LSUs 303), welche einen CTA-Identifikator, die CTA-Dimensionen, die Dimensionen eines Gitters, zu welchem die oder das CTA gehört (oder Queue-Position, wenn die TMD 322 eine Queue-Aufgabe anstatt einer Gitter-Aufgabe kodiert), und einen Identifikator der TMD 322, an welche das CTA zugewiesen ist, speichert.SM 310 provides on-chip (internal) data storage with different access levels. Special registers (not shown) are readable but not writable by LSU 303 and are used to store parameters that define a "position" of each thread. In one embodiment, special registers include one register per thread (or per exec unit 302 within SM 310 ) which stores a thread ID; each thread ID register is only by means of a respective one of the exec unit 302 accessible. Special registers may also include additional registers that are readable by all threads that perform the same processing task, using a TMD 322 is represented (or by means of all LSUs 303 ), which is a CTA identifier, the CTA dimensions, the dimensions of a grid to which the CTA or CTA belongs (or queue position if the TMD 322 encodes a queue task instead of a grid task), and an identifier of the TMD 322 to which the CTA is assigned stores.

Wenn die TMD 322 eine Gitter-TMD ist, führt eine Ausführung der TMD 322 dazu, dass eine fixe Anzahl von CTAs angestoßen wird und ausgeführt wird, um die fixe Menge von Daten zu verarbeiten, welche in der Queue 525 gespeichert sind. Die Anzahl von CTAs ist als das Produkt der Gitter-Breite, -Höhe und -Tiefe spezifiziert. Die fixe Menge von Daten kann in der TMD 322 gespeichert werden oder die TMD 322 kann einen Zeiger auf die Daten speichern, welche mittels der CTAs verarbeitet werden. Die TMD 322 speichert auch eine Start-Adresse des Programms, welches mittels der CTAs ausgeführt ist.If the TMD 322 is a lattice TMD, performs an execution of the TMD 322 This causes a fixed number of CTAs to be triggered and executed to process the fixed amount of data in the queue 525 are stored. The number of CTAs is specified as the product of grid width, height and depth. The fixed amount of data may be in the TMD 322 be saved or the TMD 322 can store a pointer to the data being processed by the CTAs. The TMD 322 also stores a start address of the program, which is executed by means of the CTAs.

Wenn die TMD 322 eine Queue-TMD ist, dann wird ein Queue-Merkmal der TMD 322 benutzt, was bedeutet, dass die Menge von zu prozessierenden Daten nicht notwendiger Weise fix ist. Queue-Einträge speichern Daten zur Verarbeitung mittels der CTAs, welche der TMD 322 zugewiesen sind. Die Queue-Einträge können auch eine Kind-Aufgabe repräsentieren, welche mittels einer anderen TMD 322 während einer Ausführung eines Threads erzeugt ist, wodurch ein geschachtelter Parallelismus bereitgestellt ist. Typischerweise wird Ausführung des Threads, oder des CTA, welches den Thread umfasst, angehalten oder unterbrochen (suspended), bis Ausführung der Kind-Aufgabe vollendet. In einigen Ausführungsformen sichern Threads oder CTAs, welche unterbrochen bzw. angehalten sind (suspended), ihren Programm-Status, schreiben Daten an eine Queue TMD, welche eine Weiterführung des Threads oder CTA repräsentiert und steigen dann aus (exit), um anderen Threads oder CTAs abzulaufen erlauben. Die Queue kann in der TMD 322 gespeichert sein oder separat von der TMD 322, in welchem Fall die TMD 322 einen Queue-Zeiger auf die Queue speichert. Vorteilhafter Weise können Daten, welche mittels der Kind-Aufgabe erzeugt sind, an die Queue geschrieben werden, während die TMD 322, welche die Kind-Aufgabe repräsentiert, ausführt. Die Queue kann als eine zirkuläre Queue implementiert sein, so dass die gesamte Menge von Daten nicht auf die Größe der Queue begrenzt ist.If the TMD 322 is a queue TMD, then becomes a queue feature of the TMD 322 used, which means that the amount of data to be processed is not necessarily fixed. Queue entries store data for processing by means of the CTAs, which the TMD 322 are assigned. The queue entries can also represent a child task using another TMD 322 during execution of a thread, thereby providing a nested parallelism. Typically, execution of the thread, or CTA, that includes the thread is suspended or suspended until execution of the child task completes. In some embodiments, threads or CTAs that are suspended secure their program status, write data to a queue TMD representing a continuation of the thread or CTA, and then exit to other threads Allow CTAs to expire. The queue can be found in the TMD 322 be stored or separately from the TMD 322 , in which case the TMD 322 saves a queue pointer to the queue. Advantageously, data generated by the child task may be written to the queue while the TMD 322 , which represents the child task, performs. The queue can be implemented as a circular queue so that the total amount of data is not limited to the size of the queue.

CTAs, welche zu einem Gitter gehören, haben implizit Gitter-Breite-, -Höhe- und -Tiefe-Parameter, welche die Position des entsprechenden CTA innerhalb des Grids anzeigen. Spezielle Register werden während einer Initialisierung in Antwort auf Befehle geschrieben, welche über Frontend 212 von Geräte-Treiber 103 empfangen sind und ändern sich nicht während der Ausführung einer Verarbeitungs-Aufgabe. Das Frontend 212 plant jede Verarbeitungs-Aufgabe zur Ausführung. Jedes CTA ist mit einer spezifischen TMD 322 zur gleichzeitigen Ausführung von einer oder mehr Aufgaben assoziiert. Zusätzlich kann ein einzelnes GPC 208 mehrere Aufgaben gleichzeitig ausführen.CTAs belonging to a grid have implicit grid width, height and depth parameters which indicate the position of the corresponding CTA within the grid. Special registers are written during initialization in response to commands sent via frontend 212 of device drivers 103 are received and do not change during the execution of a processing task. The frontend 212 schedules each processing task to run. Each CTA is with a specific TMD 322 associated with the simultaneous execution of one or more tasks. In addition, a single GPC 208 perform several tasks at the same time.

Ein Parameter-Speicher (nicht gezeigt), welcher an die Aufgabe gebunden ist, speichert Laufzeit-Parameter (Konstanten), welche gelesen werden können aber nicht geschrieben werden können mittels irgendeines Threads dieser Aufgabe (oder irgendeiner LSU 303). In einer Ausführungsform stellt der Gerätetreiber 103 Parameter für den Parameter-Speicher bereit, bevor der SM 310 darauf gerichtet wird, eine Ausführung einer Aufgabe zu beginnen, welche diese Parameter benutzt. Irgendein Thread innerhalb irgendeines CTA (oder irgendeine exec-Einheit 302 innerhalb SM 310) kann auf globalen Speicher durch eine Speicher-Schnittstelle 214 zugreifen. Teile von globalem Speicher können in dem L1-Cache 320 gespeichert sein.A parameter memory (not shown) bound to the task stores runtime parameters (constants) which can be read but can not be written by any thread of that task (or any LSU 303 ). In one embodiment, the device driver 103 Parameters for the parameter memory ready before the SM 310 is directed to starting execution of a task using these parameters. Any thread within any CTA (or any exec unit 302 within SM 310 ) can access global memory through a memory interface 214 access. Parts of global memory can be in the L1 cache 320 be saved.

Die lokale Register-Datei 304 ist mittels jedes Threads als ein Notizzettel-Raum (scratch spase) benutzt; jedes Register wird für die exklusive Benutzung eines Threads alloziert und Daten in irgendeiner Register-Datei 304 sind nur von dem Thread zugreifbar, an welchen das Register alloziert ist. Die lokale Register-Datei 304 kann als eine Register-Datei implementiert sein, welche physikalisch oder logisch in P Spuren oder Bahnen aufgeteilt ist, wobei jede irgendeine Anzahl von Einträgen hat (wobei jeder Eintrag z. B. ein 32-Bit-Wort speichern könnte). Eine Spur ist jeder der N exec-Einheiten 302 und P Lade-Speicher-Einheiten LSU 303 zugewiesen und entsprechende Einträge in verschiedenen Spuren können mit Daten für verschiedene Threads populiert sein, welche dasselbe Programm ausführen, um eine SIMD-Ausführung zu ermöglichen. Verschiedene Teile der Spuren können an verschiedene der G gleichzeitigen Thread-Gruppen alloziert sein, so dass ein gegebener Eintrag in der lokalen Register-Datei 304 nur für einen bestimmten Thread zugreifbar ist. In einer Ausführungsform werden gewisse Einträge innerhalb der lokalen Register-Datei 304 zum Speichern von Thread-Identifikatoren reserviert, welche eines der speziellen Register implementieren. Zusätzlich speichert ein L1-Cache 375 uniforme oder konstante Werte für jede Spur der N exec-Einheiten 302 und P Lade-Speicher-Einheiten LSU 303.The local registry file 304 is used by each thread as a scratch spase; each register is allocated for the exclusive use of a thread and data in any register file 304 are only accessible from the thread to which the register is allocated. The local registry file 304 may be implemented as a register file which is physically or logically divided into P lanes or lanes, each having any number of entries (where each entry could, for example, store a 32-bit word). One track is each of the N exec units 302 and P load storage units LSU 303 and corresponding entries in different tracks may be populated with data for different threads executing the same program to enable SIMD execution. Different parts of the tracks may be allocated to different of the G concurrent thread groups, leaving a given entry in the local register file 304 only accessible to a particular thread. In one embodiment, certain entries within the local register file 304 reserved for storing thread identifiers which implement one of the special registers. Additionally stores an L1 cache 375 uniform or constant values for each track of N exec units 302 and P load storage units LSU 303 ,

Der gemeinsame Speicher 306 ist für alle Threads (innerhalb eines einzelnen CTA) zugreifbar; mit anderen Worten, ist irgendeine Stelle in dem gemeinsamen Speicher 306 für irgendeinen Thread innerhalb desselben CTA zugreifbar (oder für irgendeine Verarbeitungs-Maschine innerhalb SM 310). Der gemeinsame Speicher 306 kann als eine gemeinsam benutzte oder geteilte (shared) Register-Datei oder ein gemeinsamer On-Chip-Cache-Speicher implementiert sein mit einer Zwischenverbindung, welche irgendeiner Verarbeitungs-Maschine erlaubt, von oder auf irgendeine Stelle in dem gemeinsamen Speicher zu lesen oder zu schreiben. In anderen Ausführungsformen könnte der gemeinsame Zustandsraum (shared state space) auf eine Pro-CTA-Region von Off-Chip-Speicher abbilden und könnte in L1-Cache 320 gecached sein. Der Parameter-Speicher kann als ein designierter Abschnitt innerhalb derselben gemeinsamen Register-Datei oder des gemeinsamen Cache-Speichers implementiert sein, welcher den gemeinsamen Speicher 306 implementiert, oder als eine separate gemeinsame Register-Datei oder ein On-Chip-Cache-Speicher, auf welchen die LSUs 303 nur-Lese-Zugriff haben. In einer Ausführungsform ist der Bereich, welcher den Parameter-Speicher implementiert, auch dazu benutzt, um die CTA-ID und die Aufgabe-ID zu speichern, sowie CTA- und Gitter-Dimensionen oder Queue-Position, wobei Teile der speziellen Register implementiert sind. Jede LSU 303 in SM 310 ist mit einer unifizierten Adress-Abbildungs-Einheit 352 gekoppelt, welche eine Adresse, welche für Lade- und Speicher-Befehle bereitgestellt ist, welche in einem unifizierten Speicher-Raum spezifiziert ist, in eine Adresse in jedem distinkten Speicherraum zu konvertieren. Folglich kann eine Anweisung genutzt werden, um auf irgendwelche der lokalen, gemeinsamen oder globalen Speicherräume dadurch zuzutreiben, dass eine Adresse in dem unifizierten Speicherraum spezifiziert wird.The shared memory 306 is accessible to all threads (within a single CTA); in other words, is any location in the shared memory 306 accessible to any thread within the same CTA (or to any processing engine within SM 310 ). The shared memory 306 may be implemented as a shared or shared on-chip cache with an interconnect which allows any processing engine to read or write to or from any location in the shared memory , In other embodiments, the shared state space could map to a pro-CTA region of off-chip memory and could be in L1 cache 320 be cached. The parameter memory may be implemented as a designated portion within the same common register file or shared cache memory which is the shared memory 306 implemented, or as a separate common register file or on-chip cache on which the LSUs 303 have read-only access. In one embodiment, the area implementing the parameter memory is also used to store the CTA ID and the task ID, as well as CTA and grid dimensions or queue position, with parts of the special registers being implemented , Every LSU 303 in SM 310 is with a unified address mapping unit 352 coupled, which has an address, which is provided for load and store instructions specified in unified memory space to convert to an address in each distinct memory space. Thus, an instruction can be used to drive to any of the local, shared or global storage spaces by specifying an address in the unified storage space.

Der L1-Cache 320 in jedem SM 310 kann benutzt werden, um private Pro-Thread-lokale-Daten und auch Pro-Applikation-globale-Daten zu zwischenzuspeichern (cache). In einigen Ausführungsformen können die Pro-CTA-geteilten Daten in dem L1-Cache 320 gecached werden. Die LSUs 303 sind mit dem gemeinsamen Speicher 306 und dem L1-Cache 320 über eine Speicher- und Cache-Zwischenverbindung 380 gekoppelt.The L1 cache 320 in every SM 310 can be used to cache private per-thread local data as well as per-application global data. In some embodiments, the Pro-CTA shared data may be in the L1 cache 320 be cached. The LSUs 303 are with the shared memory 306 and the L1 cache 320 via a memory and cache interconnect 380 coupled.

Planen und Ausführung von RechenaufgabenPlanning and execution of arithmetic tasks

4A bis 4B illustrieren ein Verfahren 400 zum Zuweisen von Aufgaben an SMs 310 von 3A bis 3C gemäß einer Ausführungsform der Erfindung. Obwohl die Verfahrensschritte im Zusammenhang mit den Systemen von 1 bis 3C beschrieben sind, werden Fachleute in der Technik verstehen, dass irgendein System, welches konfiguriert ist, die Verfahrensschritte auszuführen, in irgendeiner Ordnung, innerhalb des Geltungsbereichs der Erfindungen ist. 4A to 4B illustrate a procedure 400 to assign tasks to SMs 310 from 3A to 3C according to an embodiment of the invention. Although the procedural steps related to the systems of 1 to 3C those skilled in the art will understand that any system configured to perform the method steps is in any order within the scope of the inventions.

Wie gezeigt ist, beginnt das Verfahren 400 bei Schritt 402, wo die WDU 340 bestimmt, ob eine oder mehr TMD(s) 322 in der Aufgabe-Tabelle 345 von 3A umfasst ist. Bei Schritt 404 setzt die WDU 340 einen ersten SM, welcher in einer Mehrzahl von SMs (z. B. die SMs 310, welche innerhalb von PPU 202 umfasst sind) umfasst ist, als einem momentanen SM. Bei Schritt 406 setzt die WDU 340 eine erste TMD 322, welche in der Aufgabe-Tabelle 345 umfasst ist, als eine momentane TMD.As shown, the process begins 400 at step 402 where the WDU 340 determines if one or more TMD (s) 322 in the task table 345 from 3A is included. At step 404 sets the WDU 340 a first SM located in a plurality of SMs (eg, the SMs 310 which within PPU 202 are included) as a current SM. At step 406 sets the WDU 340 a first TMD 322 which in the task table 345 is included as a momentary TMD.

Bei Schritt 408 bestimmt die WDU 340, ob das Aufgabe-Tabelle-345-Fach (slot), in welchem die momentane TMD ansässig ist, eine Deallozierungs-Anfrage (deallocation request) empfangen hat. Wenn die WDU 340 bei Schritt 408 bestimmt, dass das Aufgabe-Tabelle-Fach, in welchem die momentane TMD ansässig ist, eine Deallozierungs-Anfrage empfangen hat, dann sollte die momentane TMD nicht irgendein SM 310 sein. Demgemäß schreitet das Verfahren 400 zu Schritt 428 voran, wo die WDU 340 eine nächste TMD 322, welche in der Aufgabe-Tabelle 345 umfasst ist, als eine momentane TMD setzt. Im Gegenzug schreitet das Verfahren 400 zurück zu Schritt 408 vor, der oben beschrieben ist.At step 408 determines the WDU 340 whether the task table 345 Slot in which the current TMD is located has received a deallocation request. When the WDU 340 at step 408 determines that the task table partition in which the current TMD resides has received a deallocation request, then the current TMD should not be any SM 310 be. Accordingly, the method proceeds 400 to step 428 progressing where the WDU 340 a next TMD 322 which in the task table 345 is included as a current TMD sets. In return, the procedure proceeds 400 back to step 408 before, which is described above.

Wenn umgekehrt die WDU 340 bei Schritt 408 bestimmt, dass das Aufgabe-Tabelle-Fach, in welchem die momentane TMD ansässig ist, nicht eine Deallozierungs-Anfrage empfangen hat, dann schreitet das Verfahren 400 zu Schritt 410 fort.If, conversely, the WDU 340 at step 408 determines that the task table partition in which the current TMD resides has not received a deallocation request, then the method proceeds 400 to step 410 continued.

Bei Schritt 410 bestimmt die WDU 340, ob die momentane TMD Arbeit umfasst, welche noch nicht in ein CTA ausgestellt ist. Wenn die WDU 340 bei Schritt 410 bestimmt, dass die momentane TMD nicht Arbeit umfasst, welche noch nicht in ein CTA ausgestellt ist, dann schreitet das Verfahren 400 zu Schritt 428, welcher oben beschrieben ist, fort. Anderenfalls schreitet das Verfahren 400 zu Schritt 412 fort.At step 410 determines the WDU 340 Whether the current TMD includes work that is not yet issued in a CTA. When the WDU 340 at step 410 determines that the current TMD does not include work that is not yet issued in a CTA, then the procedure proceeds 400 to step 428 , which is described above, continues. Otherwise, the procedure proceeds 400 to step 412 continued.

In einer Ausführungsform weist jede TMD 322 einen quasi statischen Status auf, welcher z. B. mittels der Aufgabe-Management-Einheit 300 und Arbeit-Verteilungs-Einheit 340 gesetzt ist, wenn die TMD 322 zur Ausführung geplant wird. Jede TMD 322 weist auch dynamischen Status auf, welcher aktualisiert ist, wenn die TMD 322 ausgeführt ist, z. B. wenn CTA-Starts bzw. -Einführungen (launches) und -Vollendungen für die TMD 322 erfolgen.In one embodiment, each TMD 322 a quasi static status, which z. By means of the task management unit 300 and work distribution unit 340 is set when the TMD 322 is scheduled for execution. Every TMD 322 also has dynamic status, which is updated when the TMD 322 is executed, for. For example, if CTA launches and completions for the TMD 322 respectively.

Es gibt viele Komponenten von Zustand, welche in der TMD 322 umfasst sind, welche für die Weise relevant sind, in welcher die TMD 322 innerhalb der PPU 202 gehandhabt wird. In einer Ausführungsform umfasst die TMD 322 Zustand zum Nachverfolgen (tracking) der Zahl von Arbeits-Artikeln (work items), welche in der TMD 322 umfasst sind, welche nicht vollendet worden sind. In einigen Fällen kann die TMD 322 auch Zustand umfassen, welcher eine minimale Zahl von Arbeits-Artikeln spezifiziert, welche erfordert sind, in jedem CTA umfasst zu sein, welcher an einen SM 310 ausgestellt ist (nachfolgend als „Koaleszierungs-Regeln” bezeichnet) zusammen mit Zustand, welcher einen Schwellwert-Betrag von Zeit spezifiziert, welcher erlaubt ist, zu warten, um die minimale erforderte Zahl von Arbeits-Artikeln zu akkumulieren, bevor ein CTA schließlich zur Ausführung angestoßen bzw. gestartet (launching) wird (nachfolgend als „Koaleszierungs-Zeitbeschränkung” bezeichnet). Wenn eine TMD N Arbeits-Artikel pro CTA spezifiziert, dann werden N Artikel mittels jedes CTA gelesen. Es könnte z. B. eine Mehrzahl von TMDs geben, welche Arbeits-Artikel an die Queue-TMD schreiben, wobei jedes CTA der Queue-TMD N Arbeits-Artikel verarbeitet. Dies „koalesziert” die N separaten Arbeits-Artikel in ein CTA. Die Mehrzahl von TMDs bräuchte jedoch nicht eine Zahl von Arbeits-Artikeln erzeugen, welche durch N dividierbar ist, was zu einem Partialsatz von Arbeits-Artikeln führt, welche ausstehend (outstanding) gelassen sind. Um das Vorangehende zu umgehen, umfasst in einer Ausführungsform die TMD einen Zeitbeschränkungs-Wert (timeout value), welcher erlaubt, dass ein CTA mit M Arbeits-Artikeln gestartet bzw. angestoßen wird, wobei M < N. Der Wert von M ist als eine Eingabe an den CTA genommen und Anweisungen, welche mit dem CTA assoziiert sind, werden geschrieben, um entweder M oder N Arbeits-Artikel zu verarbeiten, abhängig von dem Wert von M.There are many components of condition present in the TMD 322 which are relevant to the way in which the TMD 322 within the PPU 202 is handled. In one embodiment, the TMD 322 Condition for tracking (tracking) the number of work items that are in the TMD 322 are included, which have not been completed. In some cases, the TMD 322 also state that specifies a minimum number of work items required to be included in each CTA which is sent to an SM 310 state (hereinafter referred to as "coalescing rules") together with state which specifies a threshold amount of time allowed to wait to accumulate the minimum required number of work articles before finally executing a CTA is launched (hereinafter referred to as "coalescing time limit"). If a TMD specifies N work items per CTA, then N items are read by each CTA. It could be z. For example, there are a plurality of TMDs that write work items to the queue TMD, with each CTA processing the queue TMD N work items. This "coalesces" the N separate work items into a CTA. However, the plurality of TMDs would not need to produce a number of work items that is divisible by N, resulting in a partial set of work items that are left outstanding. To avoid the above, in one embodiment, the TMD includes a timeout value that allows a CTA to be launched with M work items, where M <N. The value of M is one Input to the CTA and instructions associated with the CTA written to process either M or N work items, depending on the value of M.

Die TMD 322 umfasst auch Status, welcher ein Ausführungs-Prioritäts-Niveau der TMD 322 spezifiziert, z. B. ein Prioritäts-Niveau, welches zwischen den Anzahlen von 1–10 rangiert, wobei die niedrigste Zahl ein höchstes Ausführungs-Prioritäts-Niveau repräsentiert. Die TMD 322 umfasst auch Zustand, welcher anzeigt, ob ein Fach (slot) in der Aufgabe-Tabelle 354, in welchem die TMD 322 ansässig ist, nachdem mittels der Aufgabe-Management-Einheit 300 geplant, ein gültiges Fach ist – d. h., wo eine Deallozierung der TMD 322 nicht angefragt worden ist. Die TMD 322 kann auch Zustand für SM-Affinitäts-Regeln umfassen, welche spezifizieren, auf welchem SMs 310 in der PPU 202 die TMD 322 zugewiesen werden kann, wie im Detail unten im Zusammenhang mit 4A bis 4B beschrieben ist. Jede TMD 322 kann auch Zustand umfassen, welcher anzeigt, ob die TMD 322 nur ausführen kann, wenn die Aufgabe-/Arbeit-Einheit 207 in einem „Drossel”-Modus (throttle mode) arbeitet, welcher involviert, dass ein einzelner CTA Zugriff auf den gesamten gemeinsamen Speicher hat, welcher mittels der SMs 310, welche in PPU 202 umfasst sind, zugreifbar ist. In einer Ausführungsform ist ein Zustand des Drossel-Modus in Zustand 304 gespeichert und mittels WDU 340 aktualisiert, wenn WDU 340 zwischen einem Drossel- und einem Nicht-Drossel-Modus schaltet. Jede TMD 322 kann auch Zustand umfassen, welcher spezifiziert, dass die TMD 322 eine sequentielle Aufgabe ist und daher höchstens einen CTA „im-Flug” (in-flight) (d. h. ausgeführt mittels eines SM 310) zu einer gegebenen Zeit haben kann.The TMD 322 also includes status, which is an execution priority level of the TMD 322 specified, for. A priority level, which ranks between the numbers of 1-10, where the lowest number represents a highest execution priority level. The TMD 322 also includes state which indicates if a slot is in the task table 354 in which the TMD 322 is established after using the task management unit 300 planned, is a valid subject - ie, where a deallocation of the TMD 322 has not been requested. The TMD 322 may also include state for SM affinity rules specifying on which SMs 310 in the PPU 202 the TMD 322 can be assigned as related in detail below 4A to 4B is described. Every TMD 322 may also include state that indicates whether the TMD 322 can only perform if the task / work unit 207 operates in a "throttle" mode, which involves having a single CTA access to all of the common memory which is accessed via the SMs 310 which in PPU 202 are accessible. In one embodiment, a state of the throttle mode is in state 304 saved and using WDU 340 updated when WDU 340 switches between a throttle and a non-throttle mode. Every TMD 322 may also include state, which specifies that the TMD 322 is a sequential task, and therefore at most one in-flight CTA (ie executed by means of an SM 310 ) at a given time.

Bei Schritt 412 bestimmt die WDU 340, ob irgendeine TMD 322 in der Aufgabe-Tabelle 345 eine Drossel-Modus-Eigenschaft anzeigt. Wenn die WDU 340 bei Schritt 412 bestimmt, dass irgendeine TMD eine Drossel-Modus-Eigenschaft anzeigt, dann schreitet das Verfahren 400 zu Schritt 414 fort, um zu bestimmen, ob ein Drossel-Modus innerhalb Aufgabe-/Arbeit-Einheit 207 aktiviert ist. Wenn WDU 340 bei Schritt 414 bestimmt, dass ein Drossel-Modus nicht innerhalb von Aufgabe-/Arbeit-Einheit 207 aktiviert ist, dann schreitet das Verfahren 400 zu Schritt 450 fort. Wie gezeigt ist, wartet bei Schritt 450 die WDU 340, bis alle ausstehenden TMDs 322 ausgeführt sind, d. h. TMDs 322, welche nicht einen Drossel-Modus anzeigen, wie aktiviert. Das Verfahren 400 schreitet dann zu Schritt 452 fort, wo die WDU 340 Drossel-Zustand an jeden der SMs 310 sendet. In einer Ausführungsform weist der Drossel-Zustand für jeden SM 310 sowohl einen Wert auf, welcher eine Größe eines Teils von gemeinsamem Speicher anzeigt, auf welchen der SM 310 zuzugreifen in der Lage ist, zusammen mit einer Basis-Adresse, wo der Teil von gemeinsamem Speicher beginnt. Somit steigt der Wert, welcher die Größe des Teils von gemeinsamem Speicher anzeigt, für jeden SM 310 an, wenn weniger SMs 310 aktiviert bzw. eingeschaltet sind (enabled). Umgekehrt nimmt der Wert, welcher die Größe des Teils von gemeinsamem Speicher anzeigt, für jeden SM 310 ab, wenn mehr SMs 310 aktiviert bzw. eingeschaltet sind.At step 412 determines the WDU 340 whether any TMD 322 in the task table 345 indicates a throttle mode property. When the WDU 340 at step 412 determines that any TMD indicates a throttle mode property, then the method proceeds 400 to step 414 Continue to determine if a throttle mode is within task / work unit 207 is activated. If WDU 340 at step 414 determines that a throttle mode is not within task / work unit 207 is activated, then the procedure proceeds 400 to step 450 continued. As shown, wait at step 450 the WDU 340 until all outstanding TMDs 322 are executed, ie TMDs 322 which do not indicate a throttle mode as activated. The procedure 400 then walk to step 452 away, where the WDU 340 Throttle state to each of the SMs 310 sends. In one embodiment, the throttle state for each SM 310 both a value indicating a size of a part of shared memory to which the SM 310 is able to access, along with a base address, where the part of shared memory begins. Thus, the value indicating the size of the portion of shared memory increases for each SM 310 if less SMs 310 activated or switched on (enabled). Conversely, the value indicating the size of the portion of shared memory takes for each SM 310 off, if more SMs 310 activated or turned on.

Bei Schritt 454 aktiviert die WDU 340 den Drossel-Modus, woraufhin das Verfahren 400 zurück zu Schritt 402 fortschreitet. Die WDU 340 dauert an, in dem Drossel-Modus zu arbeiten, bis der Schritt 412 falsch ist, d. h., bis die WDU 340 bestimmt, dass keine TMDs 322, welche in der Aufgabe-Tabelle 345 umfasst ist, eine Drossel-Modus-Eigenschaft anzeigt. Demgemäß deaktiviert die WDU 340 den Drossel-Modus bei Schritt 413, woraufhin das Verfahren 400 bei Schritt 416 wieder aufnimmt.At step 454 activates the WDU 340 the throttle mode, whereupon the procedure 400 back to step 402 progresses. The WDU 340 continues to work in the throttle mode until the step 412 is wrong, that is, until the WDU 340 determines that no TMDs 322 which in the task table 345 includes, indicates a throttle mode property. Accordingly, the WDU disables 340 the throttle mode at step 413 , whereupon the procedure 400 at step 416 resumes.

Bei Schritt 416 bestimmt die WDU 340, ob die momentane TMD eine sequentielle Aufgabe ist. Wenn die WDU 340 bei Schritt 416 bestimmt, dass die momentane TMD eine sequentielle Aufgabe ist, dann schreitet das Verfahren 400 zu Schritt 418 fort, wo die WDU 340 bestimmt, ob die momentane TMD einen CTA im Flug hat, d. h. einen CTA, welcher momentan mittels eines SM 310 ausgeführt ist. Wenn die WDU 340 bei Schritt 418 bestimmt, dass die momentane TMD einen CTA im Flug hat, dann schreitet das Verfahren 400 zu Schritt 428 fort, oben beschrieben. Anderenfalls schreitet das Verfahren 400 zu Schritt 420 fort, unten beschrieben.At step 416 determines the WDU 340 Whether the current TMD is a sequential task. When the WDU 340 at step 416 determines that the current TMD is a sequential task, then the method proceeds 400 to step 418 away, where the WDU 340 determines whether the current TMD has a CTA in flight, ie a CTA, which is currently using an SM 310 is executed. When the WDU 340 at step 418 determines that the current TMD has a CTA in flight, then goes the process 400 to step 428 continued, described above. Otherwise, the procedure proceeds 400 to step 420 continued, described below.

Mit Bezug zurück nun zu Schritt 416 schreitet, wenn die WDU 340 bestimmt, dass die momentane TMD nicht eine sequentielle Aufgabe ist, dann das Verfahren 400 zu Schritt 419 fort. Bei Schritt 419 bestimmt die WDU 340, ob eine Start-Quote der momentanen TMD 322, wenn es irgendeine gibt, erfüllt ist. In einer Ausführungsform umfasst jede TMD 322 sowohl ein Start-Quote-Bereit-Bit (launched quota enabled bit) als auch einen Start-Quote-Wert. Wenn das Start-Quote-Bereit-Bit auf „wahr” gesetzt ist, dann bestimmt die WDU 340, ob eine Anzahl von CTAs, welche dem Start-Quote-Wert äquivalent sind, gestartet bzw. angestoßen worden sind (launched). Demgemäß schreitet, wenn die WDU 340 bei Schritt 419 bestimmt, dass die Start-Quote der TMD 322, wenn irgendeine vorhanden ist, erfüllt worden ist, dann das Verfahren 400 zu Schritt 460 fort.With reference back to step now 416 progresses when the WDU 340 determines that the current TMD is not a sequential task, then the method 400 to step 419 continued. At step 419 determines the WDU 340 , whether a starting quota of the current TMD 322 if there is any, is satisfied. In one embodiment, each TMD 322 both a start quota enabled bit and a start quota value. If the start quota ready bit is set to true, then the WDU determines 340 whether a number of CTAs that are equivalent to the start quota value have been started. Accordingly, when the WDU 340 at step 419 determines that the starting rate of TMD 322 If any is present, then the procedure has been fulfilled 400 to step 460 continued.

Bei Schritt 460 parst bzw. analysiert textuell (parses) die WDU 340 die Aufgabe-Tabelle 345 und wählt eine TMD 322 aus, welche ein selbes Prioritäts-Niveau wie die momentane TMD 322 hat, woraufhin die WDU 340 die ausgewählte TMD 322 als die momentane TMD 322 setzt. Das Verfahren 400 schreitet dann zu Schritt 402 fort.At step 460 parses or analyzes the WDU textually (parses) 340 the task table 345 and choose a TMD 322 which has the same priority level as the current TMD 322 has, whereupon the WDU 340 the selected TMD 322 as the current TMD 322 puts. The procedure 400 then walk to step 402 continued.

Mit Bezug zurück nun auf Schritt 419, wenn die WDU 340 bestimmt, dass die Start-Quote der TMD 322 nicht erfüllt worden ist, oder dass keine Start-Quote für die TMD 322 spezifiziert ist, dann schreitet das Verfahren zu Schritt 420 fort.With reference back to step now 419 when the WDU 340 determines that the starting quota of TMD 322 has not been met, or that no start quota for the TMD 322 is specified, then the method moves to step 420 continued.

Bei Schritt 420 bestimmt die WDU 340, ob Affinitäts-Regeln der momentanen TMD oder Drossel-Modus-Parameter die momentane TMD daran hindern, an den momentanen SM zugewiesen zu werden. Wenn die WDU 340 bei Schritt 420 bestimmt, dass Affinitäts-Regeln der momentanen TMD oder Drossel-Modus-Parameter die momentane TMD daran hindern, an den momentanen SM zugewiesen zu werden, dann schreitet das Verfahren 400 zu Schritt 428 fort, wie oben beschrieben ist. Anderenfalls fügt bei Schritt 424 die WDU 340 die momentane TMD an eine Aufgabe-Liste hinzu, welche dem momentanen SM entspricht.At step 420 determines the WDU 340 whether affirmative rules of the current TMD or throttle mode parameters prevent the current TMD from being assigned to the current SM. When the WDU 340 at step 420 determines that affinity rules of the current TMD or throttle mode parameters prevent the current TMD from being assigned to the current SM, then the method proceeds 400 to step 428 continued as described above. Otherwise, add at step 424 the WDU 340 Add the current TMD to a task list corresponding to the current SM.

Bei Schritt 426 bestimmt die WDU 340, ob zusätzliche TMDs 322 in der Aufgabe-Tabelle umfasst sind. Wenn die WDU 340 bei Schritt 426 bestimmt, dass zusätzliche TMDs 322 in der Aufgabe-Tabelle 345 umfasst sind, dann schreitet das Verfahren 400 zu Schritt 428 fort, oben beschrieben. In dieser Weise wird jede TMD 322, welche in der Aufgabe-Tabelle 345 umfasst ist, gegen den momentanen SM verglichen, um zu bestimmen, welche TMD 322 am geeignetsten bzw. am berechtigtsten ist, an den momentanen SM zugewiesen zu werden, wie unten in Schritt 434 beschrieben ist.At step 426 determines the WDU 340 whether additional TMDs 322 included in the task table. When the WDU 340 at step 426 determines that additional TMDs 322 in the task table 345 are included, then the process proceeds 400 to step 428 continued, described above. In this way, every TMD 322 which in the task table 345 is compared against the current SM to determine which TMD 322 most appropriate is to be assigned to the current SM as below in step 434 is described.

Wenn die WDU 340 jedoch bei Schritt 426 bestimmt, dass zusätzliche TMDs 322 nicht in der Aufgabe-Tabelle 345 umfasst sind, dann sind alle der TMDs 322 gegen den momentanen SM verglichen worden und demgemäß schreitet das Verfahren 400 zu Schritt 430 fort. Bei Schritt 430 führt die WDU 340 eine primäre Sortierung der Aufgabe-Liste basierend auf dem Ausführungs-Prioritäts-Wert aus, welcher mit jeder TMD 322 assoziiert ist, welche in der Aufgabe-Liste umfasst ist. Bei Schritt 432 führt die WDU 340 eine sekundäre Sortierung der Aufgabe-Liste basierend auf einen Zeitstempel-Wert aus, welcher mit jeder TMD 322 assoziiert ist, welche in der Aufgabe-Liste umfasst ist, wobei der Zeitstempel-Wert die Zeit repräsentiert, bei welcher die TMD 322 in die Aufgabe-Tabelle 345 eingefügt wurde. In einer Ausführungsform werden die Zeitstempel-Werte im Zustand 304 gehalten oder können als eine Spalte innerhalb der Aufgabe-Tabelle 345 umfasst sein.When the WDU 340 however at step 426 determines that additional TMDs 322 not in the task table 345 are included, then all of the TMDs 322 has been compared against the current SM, and accordingly, the method proceeds 400 to step 430 continued. At step 430 leads the WDU 340 a primary sort of the task list based on the execution priority value associated with each TMD 322 which is included in the task list. At step 432 leads the WDU 340 a secondary sort of the task list based on a timestamp value associated with each TMD 322 associated with the task list, the time stamp value representing the time at which the TMD 322 in the task table 345 was inserted. In one embodiment, the timestamp values are in the state 304 held or can be considered a column within the task table 345 includes his.

In einigen Ausführungsformen hält die WDU 340 anstatt von Zeitstempeln eine sortierte Liste der Fächer, welche in der Aufgabe-Tabelle 345 umfasst sind, wobei Einträge in der Liste eingefügt oder gelöscht werden, jedes Mal, wenn eine neue Aufgabe alloziert bzw. dealloziert wird. Somit bleibt die sortierte Liste von Fächern organisiert und wird nur jedes Mal umsortiert, wenn eine Aufgabe alloziert oder gelöscht wird, so dass die älteste TMD 322 mit dem höchsten Prioritäts-Wert leicht identifiziert und an den momentanen SM zugewiesen werden kann, wie unten bei Schritt 434 beschrieben ist.In some embodiments, the WDU stops 340 instead of timestamps, a sorted list of partitions found in the task table 345 are included, with entries in the list being inserted or deleted, each time a new task is allocated or deallocated. Thus, the sorted list of bins remains organized and is only resorted each time a task is allocated or deleted, leaving the oldest TMD 322 with the highest priority value easily identified and assigned to the current SM, as below at step 434 is described.

Bei Schritt 434 weist die WDU 340 dem momentanen SM die TMD 322 mit dem höchsten Prioritäts-Wert und den ältesten Zeitstempel zu. In einer Ausführungsform hat der momentane SM damit assoziierten Status, welcher mittels WDU 340 gesetzt ist und in SM-Status 342 gespeichert ist, wenn die TMD 322 dem momentanen SM bei Schritt 434 zugewiesen ist. Danach modifiziert WDU 340 den Zustand, wenn CTAs, welche der TMD 322 entsprechen, welche dem momentanen SM zugewiesen ist, auf dem momentanen SM ausgeführt werden, wie im Detail unten in Verbindung mit 5 beschrieben ist. In einer Ausführungsform weist der Status einige Eigenschaften auf, einschließlich „Aufgabe_zuweisen” („TASK-ASSIGN”), was anzeigt, ob oder nicht eine geeignete bzw. berechtigte TMD dem momentanen SM zugewiesen ist. Der Status kann auch eine „Status_sync”-Eigenschaft umfassen, welche anzeigt, ob die WDU 340 darauf wartet, eine TMD 322 Status-Aktualisierung für den momentanen SM auszustellen, oder ob die WDU 340 darauf wartet, dass der momentane SM eine Status-Aktualisierung bestätigt, wie im weiteren Detail unten bei Schritt 438 beschrieben ist. Der Status kann auch eine „CTA_Start”(CTA_launch)-Eigenschaft aufweisen, welche anzeigt, dass der momentane SM bereit ist, einen CTA von der TMD 322 bei Schritt 434 zu empfangen und auszuführen (gemäß dem momentanen SM, welcher eine Kapazität hat, das CTA anzunehmen und auszuführen). Anderer Zustand kann benutzt werden, um einen CTA-Verfügbarkeits-Wert abzuleiten, beschrieben unten in Verbindung mit 5, für den momentanen SM, welcher die Zahl von zusätzlichen CTAs repräsentiert, welche die WDU 340 unmittelbar für den momentanen SM starten oder einführen könnte (d. h. bevor die WDU 340 irgendwelche weiteren CTA-Komplettierungs-Botschaften von dem momentanen SM empfängt).At step 434 indicates the WDU 340 the current SM the TMD 322 with the highest priority value and the oldest timestamp too. In one embodiment, the current SM has associated status associated with it by WDU 340 is set and in SM status 342 is stored when the TMD 322 the current SM at step 434 is assigned. After that, WDU modifies 340 the condition, if CTAs, which of the TMD 322 which is assigned to the current SM on which current SM is being executed, as described in detail below in connection with FIG 5 is described. In one embodiment, the status has some properties, including "Assign Task"("TASKASSIGN"), which indicates whether or not a proper TMD is assigned to the current SM. The status may also include a "Status_sync" property indicating whether the WDU 340 waiting for a TMD 322 Status update for the current SM issue, or if the WDU 340 waiting for the current SM to acknowledge a status update, as described in more detail below at step 438 is described. The status may also include a "CTA_start" (CTA_launch) property indicating that the current SM is ready, a CTA from the TMD 322 at step 434 to receive and execute (according to the current SM, which has a capacity to accept and execute the CTA). Another state may be used to derive a CTA availability value, described below in connection with 5 , for the current SM, which represents the number of additional CTAs that the WDU 340 directly for the current SM could start or introduce (ie before the WDU 340 receiving any further CTA completion messages from the current SM).

Bei Schritt 436 bestimmt die WDU 340, ob eine TMD 322, welche nicht die momentane TMD ist, vorher dem momentanen SM zugewiesen wurde. Wenn die WDU 340 bei Schritt 436 bestimmt, dass eine TMD 322, welche nicht die momentane TMD ist, vorher dem momentanen SM zugewiesen ist, dann schreitet das Verfahren 400 zu Schritt 438 fort, wo die WDU 340 Status- oder Zustands-Daten, welche mit der momentanen TMD assoziiert sind, an den momentanen SM sendet. Anderenfalls schreitet das Verfahren 400 zu Schritt 440 fort.At step 436 determines the WDU 340 , whether a TMD 322 which is not the current TMD, was previously assigned to the current SM. When the WDU 340 at step 436 that determines a TMD 322 which is not the current TMD, is previously assigned to the current SM, then the procedure proceeds 400 to step 438 away, where the WDU 340 Send status or state data associated with the current TMD to the current SM. Otherwise, the procedure proceeds 400 to step 440 continued.

Bei Schritt 440 bestimmt die WDU 340, ob zusätzliche SMs 310 in der Mehrzahl von SMs 310 umfasst sind. Wenn die WDU 340 bei Schritt 440 bestimmt, dass zusätzliche SMs 310 in der Mehrzahl von SMs 310 umfasst sind, dann schreitet das Verfahren 400 zu Schritt 442 fort, wo die WDU 440 einen nächsten SM 310, welcher in der Mehrzahl von SMs 310 umfasst ist, als den momentanen SM setzt. Wenn die WDU 340 jedoch bei Schritt 440 bestimmt, dass keine zusätzlichen SMs in der Mehrzahl von SMs umfasst sind, dann schreitet das Verfahren 400 zurück zu Schritt 402 fort, und das Verfahren 400 ist gemäß den Techniken hierin wiederholt.At step 440 determines the WDU 340 whether additional SMs 310 in the majority of SMs 310 are included. When the WDU 340 at step 440 determines that additional SMs 310 in the majority of SMs 310 are included, then that progresses method 400 to step 442 away, where the WDU 440 a next SM 310 which is in the majority of SMs 310 is included as the current SM sets. When the WDU 340 however at step 440 determines that no additional SMs are included in the plurality of SMs, then the method proceeds 400 back to step 402 away, and the procedure 400 is repeated according to the techniques herein.

Somit sind bei dem Ende von Verfahren 400 Null oder mehr der SMs 310 eine TMD 322 zugewiesen worden, abhängig z. B. von den Zustands-Daten der TMDs 322, wenn irgendwelche vorhanden sind, welche in der Aufgabe-Tabelle 345 umfasst sind. Im Zusammenhang mit einem kontinuierlichen Zuweisen von verschiedenen TMDs 322 an verschiedene SMs 310 ist die Arbeit-Verteilungs-Einheit 340 auch konfiguriert, kontinuierlich einen SM auszuwählen, an welchen ein CTA von einer TMD 322, welche dem einen SM zugewiesen ist, ausgestellt werden sollte, was unten im Zusammenhang mit 5 beschrieben ist.Thus, at the end of proceedings 400 Zero or more of the SMs 310 a TMD 322 been assigned, depending on z. From the state data of the TMDs 322 if there are any which in the task table 345 are included. In the context of a continuous assignment of different TMDs 322 to different SMs 310 is the work distribution unit 340 also configured to continuously select an SM to which a CTA from a TMD 322 which is assigned to the one SM should be issued, which is related below 5 is described.

5 illustriert ein Verfahren 500 zum Auswählen eines SM 310, um Arbeit zu empfangen, welche eine Aufgabe betrifft, gemäß einer Ausführungsform der Erfindung. Obwohl die Verfahrensschritte im Zusammenhang mit Systemen von 1 bis 3C beschrieben sind, werden Fachleute in der Technik verstehen, dass irgendein System, welches konfiguriert ist, die Verfahrensschritte auszuführen, in irgendeiner Ordnung, innerhalb des Geltungsbereichs der Erfindungen ist. 5 illustrates a process 500 to select a SM 310 to receive work related to a task according to an embodiment of the invention. Although the process steps related to systems of 1 to 3C those skilled in the art will understand that any system configured to perform the method steps is in any order within the scope of the inventions.

Wie gezeigt ist, beginnt das Verfahren 500 bei Schritt 502, wo die WDU 340 von jedem SM 310, welcher in der PPU 302 umfasst ist, eine Indikation empfängt, ob der SM 310 berechtigt oder geeignet ist, ein CTA von einer TMD 322 zu empfangen, wenn irgendeine vorhanden ist, welche mit dem SM 310 assoziiert ist. In einer Ausführungsform wird die Indikation in der Form eines „Bereit”-Status, welcher von dem Status abgeleitet ist, welcher mit dem SM 310 assoziiert ist und in SM-Zustand 342 von 3A gespeichert ist, übermittelt. In einem Beispiel ist der SM 310 als „Bereit” bestimmt, wenn dem SM 310 eine TMD 322 zugewiesen worden ist (z. B. gemäß den Verfahrensschritten 400, welche im Zusammenhang mit 4A–4B oben beschrieben sind) und dass der Zustand, welcher mit der TMD 322 assoziiert ist, an den SM 310 gesendet worden ist und mittels des SM 310 bestätigt worden ist (z. B. gemäß des Verfahrensschrittes 438 des Verfahrens 400). Der SM 310 kann auch als aktiviert oder deaktiviert (enabled oder disabled) basierend darauf bestimmt werden, ob die WDU 340 in dem Drossel-Modus arbeitet, welcher oben im Zusammenhang mit 4A–4B beschrieben ist. Die dem SM 310 zugewiesene TMD 322 erfordert, dass der hierin beschriebene Drossel-Modus aktiv ist und dass die Aufgabe-/Arbeit-Einheit 207 tatsächlich in dem Drossel-Modus arbeitet. Der SM 310 kann ferner als „bereit” basierend darauf bestimmt werden, ob die dem SM 310 zugewiesene TMD 322 irgendwelche Koaleszierungs-Regeln erfüllt. Zum Beispiel kann die dem SM 310 zugewiesene TMD 322 anzeigen, dass ein Minimum von acht ausstehenden Arbeits-Artikeln in z. B. einer Arbeit-Artikel-Queue umfasst sein müssen, welche mit der TMD 322 assoziiert ist, bevor ein CTA an den SM 310 ausgestellt ist. Außerdem kann eine Koaleszierungs-Zeitbeschränkung (coalescing timeout), wie oben in Verbindung mit 4A–4B beschrieben ist, implementiert sein, um die Situationen zu umgehen, wo die Zahl von ausstehenden Arbeits-Artikeln, welche in der TMD 322 umfasst sind, größer als Null ist, aber niemals die Schwellwert-Minimalzahl von ausstehenden Arbeits-Artikeln pro CTA überschreitet bzw. übersteigt. Wenn die Koaleszierungs-Zeitbeschränkung auftritt, wird der SM 310 berechtigt bzw. geeignet, einen CTA von der TMD 322 zu empfangen, unter der Annahme, dass die zusätzlichen Berechtigungs-Anforderungen, welche in Verbindung mit Schritt 502 beschrieben sind, von der TMD 322 und/oder SM 310 erfüllt sind.As shown, the process begins 500 at step 502 where the WDU 340 from every SM 310 which is in the PPU 302 an indication is received whether the SM 310 eligible or eligible to receive a CTA from a TMD 322 to receive, if any is available, which with the SM 310 is associated. In one embodiment, the indication is in the form of a "ready" status derived from the status associated with the SM 310 is associated and in SM state 342 from 3A stored, transmitted. In one example, the SM is 310 determined as "ready" if the SM 310 a TMD 322 has been assigned (eg according to the method steps 400 which related to 4A - 4B described above) and that state associated with the TMD 322 is associated with the SM 310 has been sent and by means of the SM 310 has been confirmed (eg according to the method step 438 of the procedure 400 ). The SM 310 can also be set as enabled or disabled (enabled or disabled) based on whether the WDU 340 operating in the throttle mode, which is associated with above 4A - 4B is described. The SM 310 assigned TMD 322 requires that the throttle mode described herein is active and that the task / work unit 207 actually works in the throttle mode. The SM 310 can also be determined as "ready" based on whether the SM 310 assigned TMD 322 fulfilled any coalescing rules. For example, the SM 310 assigned TMD 322 indicate that a minimum of eight pending work items in z. B. a work item queue must be included, which with the TMD 322 is associated before a CTA to the SM 310 is issued. In addition, a coalescing timeout may be used, as discussed above in connection with 4A - 4B described, be implemented to circumvent the situations where the number of outstanding work articles, which in the TMD 322 is greater than zero, but never exceeds or exceeds the threshold minimum number of outstanding work items per CTA. When the coalescing time constraint occurs, the SM 310 authorized or suitable to use a CTA from the TMD 322 to receive, assuming that the additional authorization requirements, which in connection with step 502 described by the TMD 322 and / or SM 310 are fulfilled.

Bei Schritt 506 bestimmt die WDU 340, ob ein Last-Balance- bzw. -Ausgleichs-Modus oder ein Ringverteilungs-Modus (round robin mode) aktiv ist. In einer Ausführungsform wird der aktive Modus mittels eines einzelnen Bit-Wertes gemanaged, welcher in Zustand 304 von Aufgabe-/Arbeit-Einheit 207 gespeichert ist.At step 506 determines the WDU 340 whether a load-balance mode or a round robin mode is active. In one embodiment, the active mode is managed by means of a single bit value which is in state 304 from task / work unit 207 is stored.

Bei Schritt 508 empfängt die WDU 340 von jedem der geeigneten bzw. berechtigten SMs 310 einen CTA-Verfügbarkeits-Wert. In einer Ausführungsform ist der CTA-Verfügbarkeits-Wert ein numerischer Wert, welcher die Gesamt-Kapazität anzeigt, welche der SM 310 akzeptieren muss und zusätzliche CTAs ausführen muss. Diese Zahl wird von jedem SM 310 berechnet und ist z. B. basierend auf der momentanen Anzahl von CTAs, welche von SM 310 ausgeführt sind, den Pro-CTA-Ressourcen-Anforderungen der Aufgabe, welche kürzlichst an den SM 310 zugewiesen ist, und der Gesamt-Menge von freien Ressourcen, welche für den SM 310 verfügbar sind, und dergleichen.At step 508 receives the WDU 340 from each of the eligible SMs 310 a CTA availability value. In one embodiment, the CTA availability value is a numerical value that indicates the total capacity that the SM 310 must accept and perform additional CTAs. This number is from each SM 310 calculated and is z. Based on the current number of CTAs, that of SM 310 Run the Pro-CTA resource requirements of the task, which has recently been sent to the SM 310 and the total amount of free resources allocated to the SM 310 are available, and the like.

Bei Schritt 510 führt die WDU 340 eine Sortierung der geeigneten bzw. berechtigten SMs 310 basierend auf den CTA-Verfügbarkeits-Werten aus. Bei Schritt 512 bestimmt die WDU 340, ob zwei oder mehr SMs 310 denselben höchsten CTA-Verfügbarkeits-Wert teilen. Wenn die WDU 340 bei Schritt 512 bestimmt, dass zwei oder mehr SMs 310 denselben höchsten CTA-Verfügbarkeits-Wert teilen, dann schreitet das Verfahren 500 zu Schritt 514 fort, wo die WDU 340 einen der zwei oder mehr SMs 310 basierend auf einer fixen SM-Prioritäts-Liste auswählt. In einer Ausführungsform ist die fixe SM-Prioritäts-Liste im Zustand 304 von Aufgabe-/Arbeit-Einheit 207 umfasst.At step 510 leads the WDU 340 a sorting of the appropriate or authorized SMs 310 based on the CTA availability values. At step 512 determines the WDU 340 whether two or more SMs 310 share the same highest CTA availability value. When the WDU 340 at step 512 determines that two or more SMs 310 share the same highest CTA availability value, then the procedure proceeds 500 to step 514 away, where the WDU 340 one of the two or more SMs 310 based on a fixed SM priority list. In one embodiment, the fixed SM priority list is in the state 304 from task / work unit 207 includes.

Mit Bezug zurück nun auf Schritt 512 schreitet dann, wenn die WDU 340 bestimmt, dass zwei oder mehr SMs 310 nicht denselben höchsten CTA-Verfügbarkeits-Wert teilen, das Verfahren 500 zu Schritt 516 fort, wo die WDU 340 den SM 310 mit dem höchsten CTA-Verfügbarkeitswert auswählt.With reference back to step now 512 then proceeds when the WDU 340 determines that two or more SMs 310 do not share the same highest CTA availability value, the procedure 500 to step 516 away, where the WDU 340 the SM 310 with the highest CTA availability value.

Bei Schritt 518 stellt die WDU 340, für den ausgewählten SM 310, einen CTA der TMD 322, welche dem ausgewählten SM 310 zugewiesen ist, aus. Das Verfahren 500 schreitet dann zurück zu Schritt 502 fort, wo die Verfahrensschritte 500 derart wiederholt werden, dass die WDU 340 kontinuierlich CTAs an einen oder mehr SMs 310 ausstellt, solange es zumindest eine TMD 322 gibt, welche dem einen oder mehr SMs 310 zugewiesen ist, und Arbeit umfasst, welche noch nicht mittels irgendeines SM 310 ausgeführt worden ist.At step 518 represents the WDU 340 , for the selected SM 310 , a CTA of the TMD 322 which the selected SM 310 is assigned, off. The procedure 500 then walk back to step 502 continue where the process steps 500 be repeated in such a way that the WDU 340 continuously CTAs to one or more SMs 310 exhibits as long as there is at least one TMD 322 which gives the one or more SMs 310 is assigned, and includes work that is not yet using any SM 310 has been executed.

Mit Bezug zurück auf Schritt 506 schreitet dann, wenn die WDU 340 bestimmt, dass der aktive Modus von Aufgabe-/Arbeit-Einheit 207 einen Ringverteilungs-Modus anzeigt, das Verfahren 500 zu Schritt 520 fort. Bei Schritt 520 wählt die WDU 340 einen numerisch nächsten SM 310 von den berechtigten bzw. geeigneten SMs 310 aus, welche bei Schritt 502 bestimmt sind. In einer Ausführungsform hält die WDU 340 einen Identifikations-Wert im Zustand 304 des letzten SM, an welchen ein CTA ausgestellt wurde. In dieser Weise kann die WDU 340 eine Ringverteilungs-Technik (round robin technic) dadurch implementieren, dass kontinuierlich ein CTA an den SM mit einem numerisch nächsten SM-Identifikations-Wert ausgestellt wird und dadurch, dass der Identifikations-Wert in Status 304 entsprechend aktualisiert wird.With reference back to step 506 then proceeds when the WDU 340 determines that the active mode of task / work unit 207 indicates a ring distribution mode, the method 500 to step 520 continued. At step 520 chooses the WDU 340 a numerically next SM 310 from the authorized or suitable SMs 310 out, which at step 502 are determined. In one embodiment, the WDU stops 340 an identification value in the state 304 of the last SM on which a CTA was issued. In this way, the WDU 340 implement a round robin technique by continuously issuing a CTA to the SM with a numeric next SM identification value and by changing the identification value to status 304 is updated accordingly.

Eine Ausführungsform der Erfindung kann als ein Programm-Produkt zur Benutzung mit einer Computer-System implementiert sein. Das Programm oder die Programme des Programm-Produkts definieren Funktionen der Ausführungsformen (einschließlich der hierin beschriebenen Verfahren) und können auf einer Verschiedenheit von Computer-lesbaren Speichermedien beinhaltet sein. Illustrative Computer-lesbare Speichermedien umfassen, sind jedoch nicht darauf beschränkt: (i) nicht-schreibbare Speichermedien, z. B. Nur-Lese-Speicher-Geräte innerhalb eines Computers (wie CD-ROM-Platten, welche mittels eines CD-ROM-Laufwerks lesbar sind, Flash-Speicher, ROM-Chips oder irgendein anderer Typ von Festkörper-nicht-volatilem Halbleiter-Speicher), auf welchen Informationen permanent gespeichert ist; und (ii) schreibbare Speichermedien (z. B. Floppy-Disks innerhalb eines Disketten-Laufwerks oder eines Festplatten-Laufwerks oder irgendein anderer Typ von Festkörper-Halbleiter-Speicher mit willkürlichem Zugriff), auf welchen veränderbare Informationen gespeichert ist.An embodiment of the invention may be implemented as a program product for use with a computer system. The program or programs of the program product define functions of the embodiments (including the methods described herein) and may be included on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media, e.g. B. Read only memory devices within a computer (such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips, or any other type of solid-state nonvolatile semiconductor device). Memory) on which information is permanently stored; and (ii) writable storage media (e.g., floppy disks within a floppy disk drive or a hard disk drive or any other type of random access solid state semiconductor memory) on which changeable information is stored.

Die Erfindung ist oben mit Bezug auf spezifische Ausführungsformen beschrieben worden. Fachleute in der Technik werden jedoch verstehen, dass verschiedene Modifikationen und Änderungen daran gemacht werden können, ohne von dem weiteren Geist und Geltungsbereich abzuweichen, wie in den angehängten Ansprüchen ausgeführt. Die vorangehende Beschreibung und die Zeichnungen sind demgemäß in einem illustrativen anstatt in einem restriktiven Sinne anzusehen.The invention has been described above with reference to specific embodiments. However, those skilled in the art will appreciate that various modifications and changes may be made thereto without departing from the broader spirit and scope as set forth in the appended claims. The foregoing description and drawings are, thus, to be considered in an illustrative rather than a restrictive sense.

Claims

A computer-implemented method for selecting a first processor included in a plurality of processors to receive work related to a computational task, the method comprising: Analyzing status data of each processor in the plurality of processors to identify one or more processors to which a computational task has already been assigned and which are adapted to receive work relating to a computational task; Receiving, from each of the one or more processors identified as appropriate, an availability value indicating the capacity of the processor to receive new work; Selecting a first processor to receive work related to the one computing task based on the availability values received from the one or more processors; and Exhibit, to the first processor via a cooperative thread field (CTA), the work that concerns the one computational task.

The computer-implemented method of claim 1, wherein a processor is identified as being suitable if the state data associated with the one arithmetic task has been received from the processor and acknowledged.

The computer-implemented method of claim 1, wherein a processor is identified as being suitable if the one computational task is associated with a number of outstanding work items that is greater than or equal to a threshold number of work items per thread, which is indexed by means of the one calculation task.

A computer implemented method according to claim 1, wherein a processor is identified as suitable when a time constraint period has occurred, and a number of outstanding work items associated with the one computational task does not exceed a threshold number of work items per thread indicated by the one calculation task.

The computer-implemented method of claim 1, wherein a processor is identified as being suitable if the one computing task indicates that a throttle mode should be activated and the plurality of processors are operating in the throttle mode, and wherein, in the throttle mode. Mode, the first processor is included in a limited subset of the plurality of processors, and wherein each processor within the limited subset is permitted to access a first portion of memory that is greater than a second portion of memory that is normally allocated to each processor in the majority of processors are available when processing arithmetic tasks in a non-throttle mode.

A computer-implemented method for assigning a computational task to a first processor included in a plurality of processors, the method comprising: Analyzing each computational task in a plurality of computational tasks to identify one or more computational tasks suitable for assignment to a first processor, each computational task being listed in a first table and being associated with a priority value and an allocation order, which indicates a time at which the arithmetic task has been added to the first table; Selecting a first arithmetic task from the identified one or more arithmetic tasks based on the priority value and / or the allocation order; and Assign the first calculation task to the first processor for execution.

The computer-implemented method of claim 6, wherein a computational task is identified as appropriate if a deallocation request associated with the computational task has not been issued.

The computer-implemented method of claim 6, wherein a computational task is identified as appropriate when the computational task comprises work that has not yet been issued to any of the processors in the plurality of processors via a cooperative thread field (CTA).

The computer-implemented method of claim 6, wherein a computational task is identified as appropriate when the computational task needs to be processed in a throttling mode, and wherein in the throttling mode the first processor is included in a limited subset of the plurality of processors, and wherein each processor in the bounded subset is allowed to access a first portion of memory that is larger than a second portion of memory that is normally available to each processor in the plurality of processors when performing computational tasks in a non-throttling mode to process.

The computer-implemented method of claim 6, wherein a computational task is identified as being appropriate when the computational task requires that only one CTA can be executed at a given time and that no CTAs associated with the computational task are executed by any of the processors in the A plurality of processors are currently running.

The computer-implemented method of claim 6, wherein a computational task is identified as appropriate if affinity rules associated with the computational task do not prevent any of the CTAs associated with the computational task from being executed by the first processor ,

The computer-implemented method of claim 6, wherein a computational task is identified as appropriate when a number of executed CTAs associated with the computational task have not reached a threshold.