DE102013018380A1

DE102013018380A1 - System for compilation or run-time execution of data-parallel program in single-instruction multi-strand processor, has strand association unit connected to partition unit and determining strand from parent group to run as parent strand

Info

Publication number: DE102013018380A1
Application number: DE201310018380
Authority: DE
Inventors: Yuan Lin; Gautam CHAKRABARTI; Jaydeep MARATHE; Okwan Kwon; Amit Sabne
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2012-11-05
Filing date: 2013-11-04
Publication date: 2014-05-08

Abstract

The system (2) has a partition unit (202) designed to divide groups in a parent group and a working group. A strand association unit (204) is connected to the partition unit. The strand association unit determines a strand from the parent group to run as a parent strand. The strand association unit determines all of the strands in the working group to run as workers strands. The strand association unit is connected to a strand disposition unit (206). A function processor (208) is coupled to the strand association unit for generating and processing an ingress function as a non-entry function.

Description

QUERVERWEIS AUF VERWANDTE ANMELDUNGCROSS-REFERENCE TO RELATED APPLICATION

Diese Anmeldung beansprucht die Priorität der vorläufigen US-Anmeldung mit der Seriennummer 61/722661, die von Lin et al. am 5. November 2012 eingereicht wurde mit dem Titel „AUSFÜHRUNG EINES SEQUENZIELLEN CODES UNTER ANWENDUNG EINER GRUPPE AUS STRÄNGEN” und der US-Anmeldung mit der Seriennummer 13/724359, die von Lin, et al. am 21. Dezember 2012 eingereicht wurde mit dem Titel „SYSTEM UND VERFAHREN ZUR KOMPILIERUNG ODER LAUFZEITAUSFÜHRUNG EINES DATENPARALLELEN PROGRAMMS MIT AUFTEILUNG-VEREINIGUNG MIT FUNKTIONSAUFRUFEN IN EINEM EINZELBEFEHL-MULTI-STRANG-PROZESSOR”, die beide die gleiche Anmelderin wie diese Anmeldung haben und hierin durch Bezugnahme mit eingeschlossen sind.This application claims priority to US Provisional Application Ser. No. 61 / 722,661, which is assigned to Lin et al. filed November 5, 2012, entitled "IMPLEMENTATION OF A SEQUENCIAL CODE USING A GROUP OF STRANDS" and US Serial No. 13 / 724,359 filed by Lin, et al. on December 21, 2012, entitled "SYSTEM AND METHOD FOR COMPILING OR RUNNING A DATA PARALLEL PROGRAM WITH DISTRIBUTION ASSOCIATION WITH FUNCTION CALLS IN A SINGLE COMMAND MULTI-STRESS PROCESSOR", both of which are assigned to the same assignee as this application and herein Reference are included.

TECHNISCHES GEBIETTECHNICAL AREA

Diese Anmeldung betrifft generell parallele Prozessoren und insbesondere ein System und ein Verfahren zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms mit Aufteilung-Vereinigung mit Funktionsaufrufen in einem Einzelbefehl-Multi-Strang-(SIMT-)Prozessor.This application relates generally to parallel processors and, more particularly, to a system and method for compiling or executing a data parallelled partition-sharing program with function calls in a single-instruction multi-strand (SIMT) processor.

HINTERGRUNDBACKGROUND

Wie der Fachmann auf diesem Gebiet weiß, können Anwendungen oder Programme parallel ausgeführt werden, um ihr Leistungsverhalten zu verbessern. Datenparallele Programme führen den gleichen Prozess gleichzeitig an unterschiedlichen Daten aus. Aufgabenparallele Programme führen unterschiedliche Prozesse gleichzeitig an den gleichen Daten aus. Statische parallele Programme sind Programme mit einem Grad an Parallelität, der vor der Ausführung bestimmt werden kann. Im Gegensatz dazu kann die Parallelität, die von dynamischen parallelen Programmen erreichbar ist, nur ermittelt werden, wenn sie ausgeführt werden. Unabhängig davon, ob das Programm datenparallel oder aufgabenparallel oder statisch oder dynamisch parallel ist, kann es in einer Pipeline bzw. Parallelverarbeitungslinie ausgeführt werden, was häufig der Fall ist für graphische Programme.As those skilled in the art know, applications or programs may be executed in parallel to improve their performance. Data-parallel programs execute the same process simultaneously on different data. Task-parallel programs execute different processes simultaneously on the same data. Static parallel programs are programs with a degree of parallelism that can be determined before execution. In contrast, the parallelism achievable by dynamic parallel programs can only be determined when they are executed. Regardless of whether the program is data-parallel or task-parallel or static or dynamically parallel, it can be executed in a pipeline or parallel-processing line, which is often the case for graphical programs.

Ein SIMT-Prozessor ist besonders geschickt bei der Ausführung datenparalleler Programme. Eine Steuereinheit in dem SIMT-Prozessor erzeugt Gruppen aus Strängen zur Ausführung und disponiert diese für die Ausführung, während welcher alle Stränge in der Gruppe den gleichen Befehl gleichzeitig ausführen. In einem speziellen Prozessor hat jede Gruppe oder „Wölbung bzw. Kette” 32 Stränge, die 32 Ausführungs-Pipelines oder Bahnen in dem SIMT-Prozessor entsprechen.A SIMT processor is particularly adept at executing data-parallel programs. A controller in the SIMT processor generates strings of strings for execution and schedules them for execution, during which all threads in the group execute the same instruction concurrently. In a particular processor, each group or "warp" has 32 strands corresponding to 32 execution pipelines or lanes in the SIMT processor.

Ein datenparalleles Programm mit Aufteilung-Vereinigung beginnt mit einem Hauptprogramm, das nur einen Strang aufweist. Das Programm ist in dieser Phase in einer sequenziellen Phase oder einem sequenziellen Bereich. Bei einem gewissen Punkt während der Ausführung des Hauptprogramms trifft der Haupt- oder „Master-”Strang auf eine Sequenz aus parallelen Phasen oder Bereichen. Jeder parallele Bereich hat einen unabhängigen Datensatz und kann von mehreren Strängen gleichzeitig ausgeführt werden. Die Anzahl an gleichzeitigen Aufgaben in jedem parallelen Bereich wird bestimmt, wenn der parallele Bereich beginnt, und sich während des parallelen Bereichs nicht ändert. Wenn ein paralleler Bereich angetroffen wird, führt der Haupt-Strang eine Aufteilung in eine Gruppe aus Strängen (die als Arbeiter-Stränge bezeichnet werden) durch, um die parallelen Bereiche parallel abzuarbeiten. Das Programm tritt dann in den parallelen Bereich ein. Wenn ein Arbeiter-Strang einen neuen parallelen Bereich antrifft, wird der neue parallele Bereich serialisiert, d. h. der parallele Bereich wird von dem eintreffenden Arbeiter-Strang selbst ausgeführt. Der Haupt-Strang wartet, bis der parallele Bereich beendet ist. Beim Austritt aus dem parallelen Bereich vereinigen sich die Arbeiter-Stränge mit dem Haupt-Strang, der dann die Ausführung des Hauptprogramms wieder fortgesetzt, wobei dann das Programm einen sequenziellen Bereich betritt.A data-parallel program with split-union begins with a main program, which has only one strand. The program is in a sequential or sequential area at this stage. At some point during execution of the main program, the master or "master" thread encounters a sequence of parallel phases or regions. Each parallel area has an independent data set and can be executed by several strands simultaneously. The number of concurrent tasks in each parallel area is determined when the parallel area starts and does not change during the parallel area. When a parallel region is encountered, the main strand splits into a group of strands (referred to as worker strands) to work the parallel regions in parallel. The program then enters the parallel area. When a worker strand encounters a new parallel region, the new parallel region is serialized, i. H. the parallel area is executed by the incoming worker strand itself. The main line waits until the parallel area is finished. Upon exiting the parallel area, the worker strands unite with the main strand, which then resumes execution of the main program, and then the program enters a sequential area.

Die nachfolgende Tabelle 1 gibt ein Beispiel für ein datenparalleles Programm mit Aufteilung-Vereinigung an.Table 1 below gives an example of a data-parallel program with split-union.

Tabelle 1 – ein Beispiel eines datenparallelen Programms mit Aufteilung-Vereinigung

Table 1 - an example of a data-parallel program with split-union

Zum Zwecke des Verständnisses der Tabelle 1 und des Restes dieser Offenbarung sind die Begriffe „foo” und „bar” willkürliche Namen von Funktionen. Es kann daher eine beliebige Funktion anstelle von „foo” oder „bar” verwendet werden.For purposes of understanding Table 1 and the remainder of this disclosure, the terms "foo" and "bar" are arbitrary names of functions. It is therefore possible to use any function instead of "foo" or "bar".

Das Datenparallele Modell mit Aufteilung-Verzweigung wird häufig in der parallelen Programmierung eingesetzt. Beispielsweise verwendet der OpenMP-Standard dieses Modell als ein grundlegendes Strang-Ausführungsmodell. Der OpenACC-Standard verwendet dieses Modell für die Arbeiter-Stränge in einer Gruppe, die als eine „Arbeitsgruppe” bezeichnet wird.The data-parallel split-split model is often used in parallel programming. For example, the OpenMP standard uses this model as a basic thread execution model. The OpenACC standard uses this model for worker strands in a group called a "workgroup".

ÜBERBLICKOVERVIEW

Ein Aspekt stellt ein System zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms mit Aufteilung-Vereinigung mit Funktionsaufrufen bereit. In einer Ausführungsform umfasst das System: (1) eine Partitionseinheit, die ausgebildet ist, Ketten in eine übergeordnete Kette und mindestens eine Arbeiter-Kette zu unterteilen, und (2) eine Strang-Zuweisungseinheit, die mit der Partitionseinheit verbunden und ausgebildet ist, nur einen einzelnen Strang aus der übergeordneten Kette zur Ausführung und alle Stränge in der mindestens einen Arbeiter-Kette zur Ausführung anzuweisen bzw. zu bestimmen.One aspect provides a compile-time or run-time execution system of a data parallelled split-function program with function calls. In one embodiment, the system comprises: (1) a partition unit configured to partition chains into a parent chain and at least one worker chain, and (2) a thread assignment unit connected and configured with the partition unit only to designate a single strand from the parent chain for execution and all strands in the at least one worker chain for execution.

Ein weiterer Aspekt stellt ein Verfahren zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms mit Aufteilung-Vereinigung mit Funktionsaufrufen bereit. In einer Ausführungsform umfasst das Verfahren: (1) Unterteilen von Ketten in eine übergeordnete Kette und mindestens eine Arbeiter-Kette und (2) Anweisen bzw. Bestimmen nur eines einzelnen Strangs aus der übergeordneten Kette zur Ausführung und aller Stränge in der mindestens einen Arbeiter-Kette für die Ausführung.Another aspect provides a method for compiling or executing runtime a data parallels program with partitioning association with function calls. In one embodiment, the method comprises: (1) dividing chains into a parent chain and at least one worker chain, and (2) instructing only a single strand from the parent chain to execute and all strands in the at least one worker chain Chain for the execution.

KURZE BESCHREIBUNG SHORT DESCRIPTION

Es wird nun auf die folgenden Beschreibungen in Verbindung mit den begleitenden Zeichnungen verwiesen, in denen:Reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

1 eine Blockansicht eines SIMT-Prozessors ist, der ausgebildet ist, ein System oder ein Verfahren zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms mit Aufteilung-Verzweigung und mit Funktionsaufrufen zu enthalten oder auszuführen; 1 FIG. 4 is a block diagram of a SIMT processor configured to include or execute a system or method for compiling or executing runtime a data-parallel program with split-branch and with function calls;

2 eine Blockansicht einer Ausführungsform eines Systems zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms mit Aufteilung-Verzweigung und mit Funktionsaufrufen ist; und 2 Figure 12 is a block diagram of one embodiment of a compile-time or run-time execution system of a split-branch data-parallel program with function calls; and

3 ein Flussdiagramm einer Ausführungsform eines Verfahrens zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms mit Aufteilung-Verzweigung und mit Funktionsaufrufen ist. 3 Figure 3 is a flowchart of one embodiment of a method for compilation or run-time execution of a data-parallel program with split-branch and with function calls.

DETAILLIERTE BESCHREIBUNGDETAILED DESCRIPTION

In einem SIMT-Prozessor werden mehrere Ausführungs-Stränge in Gruppen unterteilt. Alle Stränge in der Gruppe führen den gleichen Befehl zur gleichen Zeit aus. In grafischen Verarbeitungseinheiten (GPUs), die kommerziell von der Firma Nvidia, Santa Clara, Kalifornien erhältlich sind, die eine Art von SIMT-Prozessoren sind, werden Gruppen als „Wölbungen bzw. Ketten” bezeichnet, und sie werden in Blöcken ausgeführt.In a SIMT processor, multiple execution strands are divided into groups. All strands in the group execute the same command at the same time. In graphical processing units (GPUs), commercially available from Nvidia, Santa Clara, California, which are a type of SIMT processor, groups are referred to as "bumps" and are executed in blocks.

Eine Pipeline-Steuereinheit des SIMT-Prozessors erzeugt, verwaltet, disponiert, führt aus und stellt bereit einen Mechanismus, um Gruppen zu synchronisieren. Nvidia-GPUs stellen einen bar.sync-Befehl zum Synchronisieren von Gruppen bereit. Nvidia-GPUs unterstützen ferner die Ausführung einer „divergenten” bedingten Verzweigung durch eine Gruppe; einige Stränge der Gruppe müssen die Verzweigung nehmen (da die Verzweigungsbedingungs-Vorbestimmung ein „wahr” ermittelt), und die anderen Stränge müssen auf den nächsten Befehl springen (da die Verzweigungsbedingungs-Vorbestimmung ein „falsch” ermittelt). Die Pipeline-Steuereinheit überwacht aktive Stränge in der Gruppe. Sie führt zuerst einen der Pfade (Verzweigung genommen oder Verzweigung nicht genommen) aus und führt dann den anderen Pfad aus; die entsprechenden Stränge werden in jedem Pfad aktiviert.A SIMT processor pipeline controller generates, manages, schedules, executes, and provides a mechanism to synchronize groups. Nvidia GPUs provide a bar.sync command to synchronize groups. Nvidia GPUs also support the execution of a "divergent" conditional branch by a group; some strands of the group must take the branch (since the branch condition predetermination determines a "true") and the other strands must jump to the next command (since the branch condition predetermination determines a "false"). The pipeline control unit monitors active threads in the group. It first takes one of the paths (taken branch or branch not taken) and then executes the other path; the corresponding strands are activated in each path.

Es wird hierin erkannt, dass, während alle Stränge innerhalb eines GPU-Strangblocks an der gleichen Programmadresse starten, die Pipeline-Steuereinheit von einem Software-Mechanismus profitieren würde, der die Stränge in einen Haupt-Strang und Arbeiter-Stränge unterteilt und disponiert, so dass diese in dem Aufteilung-Vereinigungs-Modell ausführbar sind.It is recognized herein that while all threads within a GPU thread block start at the same program address, the pipeline control unit would benefit from a software mechanism that divides and schedules the threads into a main thread and worker threads that these are executable in the split-union model.

Es wird ferner hierin erkannt, dass gewisse Ausführungsformen des Software-Mechanismus die Stränge als Gruppen verwalten und synchronisieren sollten, da die Pipeline-Steuereinheit die Stränge als Gruppen verwaltet.It is further appreciated herein that certain embodiments of the software mechanism should manage and synchronize the strands as groups because the pipeline control unit manages the strands as groups.

Es wird ferner hierin erkannt, dass, da das Hauptprogramm ein Einzel-Strang-Programm in dem Aufteilung-Vereinigungs-Modell ist, gewisse Ausführungsformen des Software-Mechanismus eine Semantik eines sequenziellen Bereichs erreichen sollten, ohne dass Nebenwirkungen eingeführt werden. Beispiele von Nebenwirkungen hervorrufenden Befehlen sind jene, die gemeinsam benutzte Ressourcen verwenden, etwa einen Lesebefehl oder einen Schreibbefehl für einen gemeinsam benutzten Speicher oder eine beliebige Code-Operation, die eine gemeinsam genutzte Ausnahmebehandlung (beispielsweise Division) aufrufen kann.It is further recognized herein that since the main program is a single-strand program in the split-merge model, certain embodiments of the software mechanism should achieve semantics of a sequential range without introducing side-effects. Examples of side-effecting commands are those that use shared resources, such as a read command or a shared memory write command, or any code operation that can call a shared exception handler (for example, division).

Es wird ferner hierin erkannt, dass gewisse Ausführungsformen des Software-Mechanismus Funktionen unterstützen sollten, die innerhalb des sequenziellen Bereichs und innerhalb des parallelen Bereichs aufgerufen werden können. Und derartige Funktionen können selbst parallele Konstrukte enthalten. Es wird ferner hierin erkannt, dass gewisse Ausführungsformen des Software-Mechanismus Funktionsaufrufe unterstützen sollten, die parallele Bereiche aufteilen-vereinigen können.It is further recognized herein that certain embodiments of the software mechanism should support functions that may be invoked within the sequential area and within the parallel area. And such functions may even contain parallel constructs. It is further appreciated herein that certain embodiments of the software mechanism should support function calls that may divide-unite parallel portions.

Weiterhin wird hierin erkannt, dass gewisse Ausführungsformen des Software-Mechanismus außenstehende bzw. äußere Funktionen unterstützen sollten, das heißt Funktionen, die nicht von dem gleichen Compiler wie das Programm kompiliert werden. Beispielsweise mathematische Funktionen in bestehenden GPU-Mathematik-Bibliotheken, und Systemfunktionen wie malloc, free und print. In gewissen Ausführungsformen sollten sowohl der übergeordnete Strang in den sequenziellen Bereichen als auch die Arbeiter-Stränge in dem parallelen Bereich in der Lage sein, eine äußere Funktion aufzurufen.Further, it is recognized herein that certain embodiments of the software mechanism should support outside functions, that is, functions that are not compiled by the same compiler as the program. For example, mathematical functions in existing GPU math libraries, and system functions such as malloc, free, and print. In certain embodiments, both the parent strand in the sequential regions and the worker strands in the parallel region should be able to invoke an outside function.

Folglich sind hierin diverse Ausführungsformen eines Systems und eines Verfahrens zur Kompilierung und Ausführung von datenparallelen Programmen mit Aufteilung-Vereinigung und mit Funktionsaufrufen in einem SIMT-Prozessor, etwa einer GPU, beschrieben. Thus, various embodiments of a system and method for compiling and executing data parallels with partition sharing and function calls in a SIMT processor, such as a GPU, are described herein.

Vor der Beschreibung gewisser Ausführungsformen des Systems und des Verfahrens wird ein SIMT-Prozessor beschrieben, der ausgebildet ist, ein System oder ein Verfahren zur Kompilierung oder zur Laufzeitausführung von datenparallelen Programmen mit Aufteilung-Vereinigung mit Funktionsaufrufen zu enthalten oder auszuführen.Prior to describing certain embodiments of the system and method, a SIMT processor is described that is configured to include or execute a system or method for compiling or executing data-parallel split-union programs with function calls.

1 ist eine Blockansicht eines SIMT-Prozessors 100. Der SIMT-Prozessor 100 enthält mehrere Strang-Prozessoren oder Kerne 106, die in Stranggruppen 104 oder „Wölbungen bzw. Ketten” eingeteilt sind. Der SIMT-Prozessor 100 enthält J Stranggruppen 104-1 bis 104-J, wovon jede K Kerne 106-1 bis 106-K aufweist. In gewissen Ausführungsformen können die Stranggruppen 104-1 bis 104-J weiter in einen oder mehrere Strangblöcke 102 eingeteilt sein. Gewisse Ausführungsformen umfassen zweiunddreißig Kerne 106 pro Stranggruppe 104. Andere Ausführungsformen können weniger als vier Kerne in einer Stranggruppe und bis zu mehrere zehntausend Kerne enthalten. Gewisse Ausführungsformen teilen die Kerne 106 in eine einzelne Stranggruppe 104 ein, während andere Ausführungsformen Hunderte oder sogar Tausende von Stranggruppen 104 aufweisen. Alternative Ausführungsformen des SIMT-Prozessors 100 können die Kerne 106 ausschließlich in die Stranggruppen 104 einteilen, wobei die Einteilung auf Ebene von Strangblöcken nicht vorhanden ist. 1 is a block diagram of a SIMT processor 100 , The SIMT processor 100 contains several strand processors or cores 106 that are in strand groups 104 or "bulges or chains" are divided. The SIMT processor 100 contains J string groups 104-1 to 104-J , of which each K cores 106-1 to 106-K having. In certain embodiments, the strand groups 104-1 to 104-J further into one or more strand blocks 102 be divided. Certain embodiments include thirty-two cores 106 per strand group 104 , Other embodiments may include less than four cores in a strand group and up to several tens of thousands of cores. Certain embodiments share the cores 106 into a single strand group 104 while other embodiments include hundreds or even thousands of strand groups 104 exhibit. Alternative embodiments of the SIMT processor 100 can the cores 106 exclusively in the strand groups 104 divide, where the division on the level of strand blocks is not present.

Der SIMT-Prozessor 100 umfasst ferner eine Pipeline-Steuereinheit 108, einen von Blöcken gemeinsam benutzten Speicher 110 und ein Array aus lokalen Speichern 112-1 bis 112-J, die den Stranggruppen 104-1 bis 104-J zugeordnet sind. Die Pipeline-Steuereinheit 108 verteilt Aufgaben an die diversen Stranggruppen 104-1 bis 104-J über einen Datenbus 114. Die Kerne 106 in einer Stranggruppe 106-j arbeiten parallel zueinander. Die Stranggruppen 104-1 bis 104-J kommunizieren über einen Speicherbus 116 mit dem von Blöcken gemeinsam benutzten Speicher 110. Die Stranggruppen 104-1 bis 104-J kommunizieren entsprechend mit den lokalen Speichern 112-1 bis 112-J über lokale Busse 118-1 bis 118-J. Beispielsweise verwendet eine Stranggruppe 104-J den lokalen Speicher 112-J mittels Kommunikation über einen lokalen Bus 118-J. Gewisse Ausführungsformen des SIMT-Prozessors 100 weisen einen gemeinsam benutzten Teilbereich des von Blöcken gemeinsam benutzten Speichers 110 jedem Strangblock 102 zu und ermöglichen Zugriff auf gemeinsam benutzte Teilbereiche des von Blöcken gemeinsam benutzten Speichers 110 für alle Stranggruppen 104 innerhalb eines Strangblocks 102. Gewisse Ausführungsformen umfassen Stranggruppen 104, die ausschließlich den lokalen Speicher 112 benutzen. Viele andere Ausführungsformen umfassen Stranggruppen 104, die eine ausgewogene Nutzung des lokalen Speichers 112 und des von Blöcken gemeinsam benutzten Speichers 110 bewerkstelligen.The SIMT processor 100 further includes a pipeline control unit 108 , a memory shared by blocks 110 and an array of local stores 112-1 to 112-J that are the strand groups 104-1 to 104-J assigned. The pipeline control unit 108 distributes tasks to the various strand groups 104-1 to 104-J via a data bus 114 , The cores 106 in a strand group 106-j work parallel to each other. The strand groups 104-1 to 104-J communicate via a memory bus 116 with memory shared by blocks 110 , The strand groups 104-1 to 104-J communicate accordingly with the local memories 112-1 to 112-J via local buses 118-1 to 118-J , For example, a strand group uses 104-J the local store 112-J by means of communication via a local bus 118-J , Certain embodiments of the SIMT processor 100 have a shared portion of the memory shared by blocks 110 every strand block 102 and allow access to shared portions of the block-shared memory 110 for all strand groups 104 within a strand block 102 , Certain embodiments include strand groups 104 that exclusively uses the local store 112 to use. Many other embodiments include strand groups 104 that balance the local store 112 and the memory shared by blocks 110 accomplish.

Die Ausführungsform aus 1 enthält eine übergeordnete Stranggruppe 104-1. Jede der verbleibenden Stranggruppen 104-2 bis 104-J wird als „Arbeiter-”Stranggruppe betrachtet. Die übergeordnete Stranggruppe 104-1 enthält zahlreiche Kerne, wovon einer ein übergeordneter Kern 106-1, der schließlich einen übergeordneten Strang ausführt. Programme, die in dem SIMT-Prozessor 100 ausgeführt werden, sind als eine Sequenz aus Kernels aufgebaut. Typischerweise beendet jeder Kernel seine Ausführung, bevor der nächste Kernel beginnt. In gewissen Ausführungsformen kann der SIMT-Prozessor 100 mehrere Kernels parallel ausführen, wobei dies von der Größe der Kernels abhängt. Jeder Kernel ist als eine Hierarchie aus Strängen aufgebaut, die in den Kernen 106 auszuführen sind.The embodiment of 1 contains a parent strand group 104-1 , Each of the remaining strand groups 104-2 to 104-J is considered a "worker" strand group. The parent strand group 104-1 contains numerous cores, one of which is a superordinate core 106-1 who eventually performs a parent strand. Programs included in the SIMT processor 100 are executed as a sequence of kernels. Typically, each kernel ends its execution before the next kernel begins. In certain embodiments, the SIMT processor 100 run multiple kernels in parallel, depending on the size of the kernel. Each kernel is built as a hierarchy of strands that are in the cores 106 are to be executed.

Nach der Beschreibung eines SIMT-Prozessors, in welchem das System oder das Verfahren, wie sie hierin eingeführt sind, enthalten oder ausgeführt werden kann, werden nunmehr diverse Ausführungsformen des Systems und des Verfahrens beschrieben.Having described a SIMT processor in which the system or method as incorporated herein can be incorporated or embodied, various embodiments of the system and method will now be described.

Eine Ausführungsform des hierin eingeführten Systems umfasst einen Compiler und eine Geräte-Laufzeitbibliothek. Die Geräte-Laufzeitbibliothek realisiert die Funktion der Strang- und Gruppenverwaltung. Der Compiler übersetzt ein datenparalleles Programm mit Aufteilung-Vereinigung in ein Haupt-Strangprogramm und eine Gruppe aus äußeren Funktionen, wovon jede einem parallelen Konstrukt entspricht. Der übersetzte Code nimmt Aufrufe von Funktionen der Geräte-Laufzeitbibliothek vor, um die Strang- und Gruppenverwaltung auszuführen.One embodiment of the system introduced herein includes a compiler and a device runtime library. The device runtime library implements the function of thread and group management. The compiler translates a data-parallel splitting-union program into a main thread program and a set of outer functions, each corresponding to a parallel construct. The translated code makes calls to device runtime library functions to perform thread and group management.

Die nachfolgende Tabelle 2 zeigt ein Beispielprogramm, um die Compiler-Übersetzung und die Realisierung der Geräte-Laufzeit darzustellen.Table 2 below shows an example program to illustrate the compiler translation and device runtime implementation.

Tabelle 2 – Beispielprogramm für die Compiler-Übersetzung und die Realisierung der Geräte-Laufzeit

Table 2 - Example program for the compiler translation and the realization of the device runtime

Der Ablauf des main()-Programms aus Tabelle 2 beginnt mit dem einzelnen übergeordneten Strang. Der übergeordnete Strang ruft die Funktion foo() auf, die einen Körper aufweist, der für diesen Compiler transparent ist und von diesem übersetzt wird. Der übergeordnete Strang ruft dann eine Funktion ext() auf, d. h. eine externe oder außen liegende Funktion mit einem Körper, der für diesen Compiler unsichtbar ist. Aufrufe von außen liegenden Funktionen werden übersetzt, wie sie sind, ohne dass eine spezielle Handhabung durch den Compiler erfolgt. Der übergeordnete Strang trifft dann auf den ersten parallelen Bereich. Arbeiter-Stränge führen den parallelen Bereich aus, während der übergeordnete Strang auf deren Beendigung wartet. Innerhalb des parallelen Bereichs ruft jeder Arbeiter-Strang die Funktion foo() und bar() auf. Die Funktion bar() enthält einen weiteren parallelen Bereich; jedoch liegt bar() bereits innerhalb eines parallelen Bereichs. Da bar() bereits innerhalb eines Parallelenbereichs liegt, wird der parallele Bereich innerhalb von bar() sequenziell von jedem Arbeiter-Strang ausgeführt.The flow of the main () program from Table 2 begins with the single parent thread. The parent thread calls foo (), which has a body that is transparent to and translated by this compiler. The parent thread then calls an ext () function, d. H. an external or external function with a body that is invisible to this compiler. External function calls are translated as they are without any special handling by the compiler. The parent strand then encounters the first parallel region. Worker strands execute the parallel area while the parent strand is waiting to complete. Within the parallel area, each worker strand calls foo () and bar (). The function bar () contains another parallel area; however, bar () is already within a parallel range. Since bar () is already within a parallel region, the parallel region within bar () is executed sequentially by each worker strand.

Nach dem ersten parallelen Bereich trifft der übergeordnete Strang auf einen zweiten parallelen Bereich. Innerhalb des zweiten parallelen Bereichs ruft jeder Arbeiter-Strang die externe außen liegende Funktion ext() auf. Nach dem zweiten parallelen Bereich ruft der übergeordnete Strang die Funktion bar() auf. Innerhalb von bar() trifft der übergeordnete Strang auf einen dritten parallelen Bereich, der wiederum von den Arbeiter-Strängen abgearbeitet wird.After the first parallel region, the parent strand strikes a second parallel region. Within the second parallel area, each worker thread calls the external external function ext (). After the second parallel area, the higher-level thread calls the bar () function. Within bar (), the parent strand encounters a third parallel region, which in turn is processed by the worker strands.

Die Funktion main() ist als eine Eintrittsfunktion bekannt, da dies der Punkt ist, an welchem das Programm startet. Funktionen, etwa foo() und bar() sind Nicht-Eintrittsfunktionen. The function main () is known as an entry function, since this is the point where the program starts. Functions such as foo () and bar () are non-entry functions.

Für eine Eintrittsfunktion stellt der Compiler zuerst eine geklonte Kopie, die als main_core() bezeichnet wird. Die geklonte Kopie wird dann als eine Nicht-Eintrittsfunktion verarbeitet, wie nachfolgend beschrieben ist. Für die main()-Funktion erzeugt der Compiler einen Code, der in Tabelle 3 nachfolgend gezeigt ist, wobei groupID() die ID der Stranggruppe zurückgibt, die einen Befehl ausführenden Strang enthält. threadID() gibt die ID des Strangs zurück. init(), signal_done() und scheduler() sind Funktionen in der Geräte-Laufzeitbibliothek.For an entry function, the compiler first makes a cloned copy called main_core (). The cloned copy is then processed as a non-entry function, as described below. For the main () function, the compiler generates a code shown in Table 3 below, where groupID () returns the ID of the thread group containing a thread executing command. threadID () returns the ID of the thread. init (), signal_done () and scheduler () are functions in the device runtime library.

Tabelle 3 – Compiler-erzeugter Beispiels-Code

Table 3 - Compiler Generated Example Code

Wenn ein GPU-Strangblock startet, führen alle Stränge innerhalb des Blocks main() aus; jedoch nehmen sie unterschiedliche Pfade. Der Strang 0 ist der übergeordnete Strang und führt init(), main_core() und signal_done() aus. Andere Stränge in der Gruppe 0 gehen geradewegs zum Ende der main()-Funktion und warten dort. Die Stränge in den verbleibenden Gruppen führen scheduler() aus.When a GPU thread block starts, all strands within the block execute main (); however, they take different paths. String 0 is the parent thread and runs init (), main_core (), and signal_done (). Other strands in group 0 go straight to the end of the main () function and wait there. The strands in the remaining groups execute scheduler ().

Für eine Nicht-Eintrittsfunktion wie foo(), bar() und main core() übersetzt der Compiler den Code so, als ob kein paralleles Konstrukt existieren würde. Wenn eine Nicht-Eintrittsfunktion ein paralleles Konstrukt enthält, dann erzeugt der Compiler für jedes parallele Konstrukt eine Funktion, die den Körper des parallelen Konstrukts enthält (eine ausgelagerte Funktion), und erzeugt dann eine bedingte Verzweigung, die prüft, ob der gerade ausgeführte Strang der übergeordnete Strang ist. In der falschen Verzweigung fügt der Compiler Code ein, der die Schleife ausführt. In der wahren Verzweigung fügt der Compiler Aufrufe in die Geräte-Laufzeitbibliothek ein, um Aufgaben zuzuordnen, Arbeiter-Stränge aufzuwecken und eine Barriere auszuführen. Die Bedingung ist wahr, wenn die Nicht-Eintrittsfunktion außerhalb der parallelen Bereiche aufgerufen wird. Die Bedienung ist falsch, wenn die Nicht-Eintrittsfunktion innerhalb eines parallelen Bereichs aufgerufen wird, in welchem Falle die parallele Schleife von dem ausführenden Strang sequenziell ausgeführt wird.For a non-entry function such as foo (), bar (), and main core (), the compiler translates the code as if no parallel construct existed. If a non-entry function contains a parallel construct, then for each parallel construct the compiler generates a function containing the body of the parallel construct (a paged function) and then generates a conditional branch that checks whether the thread currently executing is is parent strand. In the wrong branch, the compiler inserts code that executes the loop. In true branching, the compiler inserts calls into the device runtime library to map tasks, wake worker strands, and execute a barrier. The condition is true if the non-entry function is called outside the parallel areas. The operation is false when the non-entry function is called within a parallel area, in which case the parallel loop is executed sequentially by the executing strand.

Zum Beispiel ist der übersetzte Code für die Funktion bar() nachfolgend in Tabelle 4 gezeigt.For example, the translated code for the function bar () is shown in Table 4 below.

Tabelle 4 – übersetzter Code für die Funktion bar ()

Table 4 - translated code for the function bar ()

signal task() und barrier() sind Funktionen in der Geräte-Laufzeitbibliothek. bar_par_frunc() ist die ausgelagerte Funktion, die dem parallelen Konstrukt in der ursprünglichen Funktion bar() entspricht.signal task () and barrier () are functions in the device runtime library. bar_par_frunc () is the paged function that corresponds to the parallel construct in the original function bar ().

In dieser Ausführungsform enthält die Geräte-Laufzeitbibliothek unter anderem die folgenden Funktionen: init(), scheduler(), signal task(), signal done(), und barrier(). Die Bibliothek realisiert auch die folgenden Funktionen für die interne Verwendung: signal(), wait() und fetch_task().In this embodiment, the device runtime library includes, among other things, the following functions: init (), scheduler (), signal task (), signal done (), and barrier (). The library also implements the following functions for internal use: signal (), wait (), and fetch_task ().

Alle Arbeiter-Stränge führen die scheduler()-Funktion aus. Die Arbeiter-Stränge durchlaufen einen Schlaf-Aufwach-Ausführen-Zyklus, bis sie angewiesen werden, zu enden.All worker strands execute the scheduler () function. The worker strands go through a sleep-wake-up-run cycle until instructed to end.

Tabelle 5 – Beispiel-Code unter Verwendung einer Marke zum Verlassen des Programms

Table 5 - Example code using a mark to exit the program

Es wird eine Bool'sche Variable 'exit_flag' in dem von Blöcken gemeinsam benutzten Speicher abgelegt und auf diese kann von allen Strängen innerhalb des Strangblocks zugegriffen werden. Sie wird von dem übergeordneten Strang verwendet, um den Arbeiter-Strängen mitzuteilen, ob sie alle die Ausführung beenden sollen. Die 'exit_flag' wird in der init()-Funktion auf falsch gesetzt, und wird in der signal_done()-Funktion auf wahr gesetzt. Beide Funktionen werden von dem übergeordneten Strang aufgerufen.A Boolean variable 'exit_flag' is stored in the memory shared by blocks and can be accessed by all threads within the thread block. It is used by the parent thread to tell the worker strands if they should all finish the execution. The 'exit_flag' is set to false in the init () function, and is set to true in the signal_done () function. Both functions are called by the parent thread.

Tabelle 6 – Beispiel-Code für Änderung des Zustands der Marke für das Verlassen des Programms

Table 6 - Example code for changing the state of the program exit program

Ein weiterer Teil des von Blöcken gemeinsam benutzten Speichers wird verwendet, um die aktuelle Aufgabe mitzuteilen. Die aktuelle Aufgabe wird von dem übergeordneten Strang in der signal_task()-Funktion festgelegt und wird von dem Arbeiter-Strang in der fetch_task()-Funktion abgeholt. Der von Blöcken gemeinsam benutzte Speicher enthält den Zeiger auf die ausgelagerte Funktion, die dem parallelen Konstrukt entspricht.Another part of the memory shared by blocks is used to communicate the current task. The current task is set by the parent thread in the signal_task () function and is fetched by the worker thread in the fetch_task () function. The memory shared by blocks contains the pointer to the paged function that corresponds to the parallel construct.

Tabelle 7 – Beispiel-Code zur Verwendung eines Zeigers, um eine aktuelle Aufgabe zu kennzeichnen

Table 7 - Example code for using a pointer to identify a current task

Da die parallelen Bereiche der Reihe nach innerhalb eines Strangblocks ausgeführt werden, ist zu jedem Zeitpunkt nur eine einzelne Aufgabe aktiv. Wenn die parallelen Bereiche asynchron ausgeführt werden können, wird typischerweise eine kompliziertere Datenstruktur, etwa ein Stapel, eine Warteschlange oder ein Datenbaum benötigt, um die aktiven Aufgaben zu speichern.Since the parallel areas are executed sequentially within a thread block, only a single task is active at any one time. If the parallel areas can be executed asynchronously, typically a more complicated data structure, such as a stack, queue, or tree, is needed to store the active tasks.

barrier(), signal() und wait()-Funktionen werden unter Anwendung einer Hardware-Barriere realisiert.barrier (), signal (), and wait () functions are implemented using a hardware barrier.

Tabelle 8 – Beispielfunktionen für barrier(), signal() und wait()

Table 8 - Example functions for barrier (), signal (), and wait ()

2 ist eine Blockansicht einer Ausführungsform eines Systems 200 zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms unter Aufteilung-Vereinigung mit Funktionsaufrufen. Das Ein Programm 210 enthält Eintrittsfunktionen 212, Nicht-Eintrittsfunktionen 214 und äußere Funktionen 216. Das System 200 umfasst eine Partitionseinheit 202, eine Strang-Zuordnungseinheit 204, eine Strang-Disponiereinheit 206, einen Funktionsprozessor 208, eine Geräte-Laufzeitbibliothek 218 und einen SIMT-Prozessor 100 aus 1. 2 Figure 13 is a block diagram of one embodiment of a system 200 for compilation or runtime execution of a data-parallel program under splitting-union with function calls. The one program 210 contains entry functions 212 , Non-entry functions 214 and external functions 216 , The system 200 includes a partition unit 202 , a strand allocation unit 204 , a string scheduler 206 , a functional processor 208 , a device runtime library 218 and a SIMT processor 100 out 1 ,

Der SIMT-Prozessor 100 umfasst die Pipeline-Steuereinheit 108, den Datenbus 114, die lokalen Busse 118-1 und 118-2 und den gemeinsam benutzten Speicher 110 aus 1. In der Ausführungsform aus 2 ist der SIMT-Prozessor 100 so dargestellt, dass er einen einzelnen Strangblock hat, der zwei Stranggruppen enthält: die übergeordnete Stranggruppe 104-1 und die Arbeiter-Stranggruppe 104-2. Die Stranggruppen 104-1 und 104-2 enthalten jeweils Stränge 106.The SIMT processor 100 includes the pipeline control unit 108 , the data bus 114 , the local buses 118-1 and 118-2 and the shared memory 110 out 1 , In the embodiment of 2 is the SIMT processor 100 shown to have a single strand block containing two strand groups: the parent strand group 104-1 and the workers strand group 104-2 , The strand groups 104-1 and 104-2 each contain strands 106 ,

Die Partitionseinheit 202 bestimmt bzw. weist die Stranggruppe 104-1 als die übergeordnete Stranggruppe und die verbleibenden Stranggruppen als Arbeiter-Stranggruppen aus. In der Ausführungsform aus 2 ist eine einzelne Arbeiter-Stranggruppe 104-2 gezeigt. In alternativen Ausführungsformen können viele Arbeiter-Stranggruppen verwendet werden. Die Strang-Zuordnungseinheit 204 bestimmt einen übergeordneten Strang 106-1 der übergeordneten Stranggruppe 104-1. Alle anderen Stränge in der übergeordneten Stranggruppe 104-1 sind untätig. Die Strang-Zuordnungseinheit 204 bestimmt bzw. weist auch jeden der Stränge 106 in der Arbeiter-Stranggruppe 104-2 als Arbeiter-Stränge aus.The partition unit 202 determines or assigns the strand group 104-1 as the parent strand group and the remaining strand groups as worker strand groups. In the embodiment of 2 is a single worker strand group 104-2 shown. In alternative embodiments, many worker strand groups can be used. The strand allocation unit 204 determines a parent strand 106-1 the parent strand group 104-1 , All other strands in the parent strand group 104-1 are idle. The strand allocation unit 204 also determines each of the strands 106 in the workers strand group 104-2 as workers strands out.

Die Strang-Disponiereinheit 206 übersetzt das Programm 210 derart, dass die Pipeline-Steuereinheit 108 die Ausführung des übergeordneten Strangs 106-1 und der diversen Arbeiter-Stränge in der Arbeiter-Stranggruppe 104-2 in geeigneter Weise gesteuert. Die Strang-Disponiereinheit 206 übersetzt das Programm 210 derart, dass, wenn die Ausführung des übergeordneten Strangs beginnt, eine Marke für das Verlassen eines Programms zurückgesetzt wird. Die Strang-Disponiereinheit 206 disponiert den übergeordneten Strang 106-1 so, dass der ausgeführt wird, bis ein paralleler Bereich oder das Ende des Programms 210 erreicht sind. Wenn ein paralleler Bereich des Programms 210 erreicht wird, legt die Strang-Disponiereinheit 206 eine parallele Aufgabe fest und die Arbeiter-Stränge in der Arbeiter-Stranggruppe 104-2 beginnen mit der Ausführung. Die Strang-Disponiereinheit 206 legt ferner eine Barriere für jeden der Arbeiter-Stränge so fest, dass, wenn in die Barriere eingetreten wird, der übergeordnete Strang 106-2 seine Ausführung fortsetzt. Wenn das Ende des Programms 210 erreicht wird, wird die Marke für das Verlassen des Programms aktiviert, wodurch bewirkt wird, dass alle Arbeiter-Stränge die Ausführung beenden.The string scheduler 206 translates the program 210 such that the pipeline control unit 108 the execution of the parent strand 106-1 and the various worker strands in the workers strand group 104-2 controlled in a suitable manner. The string scheduler 206 translates the program 210 such that when the parent thread execution begins, a program exit flag is reset. The string scheduler 206 schedules the parent thread 106-1 so that it runs until a parallel area or the end of the program 210 are reached. If a parallel area of the program 210 is reached, sets the strand scheduler 206 a parallel task and the worker strands in the worker strand group 104-2 start with the execution. The string scheduler 206 Further establishes a barrier for each of the worker strands so that when the barrier is entered, the parent strand 106-2 continues its execution. When the end of the program 210 is reached, the program exit flag is activated, causing all worker strands to finish execution.

Der Funktionsprozessor 208 operiert auf den Funktionen des Programms 210. Die Verarbeitung der Eintrittsfunktionen 212 beinhaltet die Erzeugung einer Klon-Kopie einer Eintrittsfunktion, die dann als eine Nicht-Eintrittsfunktion verarbeitet wird. Die ursprüngliche Eintrittsfunktion wird so verarbeitet, dass der übergeordnete Strang 106-1 die Klon-Kopie zusätzlich zu anderen Aufrufen verarbeiten wird, und die Arbeiter-Stränge führen Zyklen mit Schlafmodus, Aufwachen, Abholung und Ausführung der parallelen Aufgabe aus, die von der Strang-Disponiereinheit 206 festgelegt ist.The function processor 208 operates on the functions of the program 210 , The processing of the entry functions 212 involves generating a clone copy of an entry function, which is then processed as a non-entry function. The original entry function is processed so that the parent strand 106-1 the clone copy will process in addition to other calls, and the worker strands will execute sleep mode, wakeup, retrieve, and parallel task cycles performed by the strand dispatcher 206 is fixed.

Der Funktionsprozessor 208 übersetzt die Nicht-Eintrittsfunktionen 214 auf zwei Arten. Wenn kein paralleles Konstrukt in einer Nicht-Eintrittsfunktion vorhanden ist, dann wird die Funktion einfach verarbeitet, wie sie ist. Wenn ein paralleles Konstrukt vorhanden ist, wird eine äußere Funktion, die den Körper des parallelen Konstrukts enthält, erzeugt. Der Funktionsprozessor 208 erzeugt dann eine Verzweigungsbedingung, die das parallele Konstrukt entweder sequenziell ausführt oder die Geräte-Laufzeitbibliothek 218 verwendet, um eine Aufgabe zuzuweisen, Arbeiter-Stränge aufzuwecken und eine Barriere auszuführen, wie dies zuvor beschrieben ist. Die Aufwachen- und Schlaffunktion werden unter Verwendung von Hardware-Barrierenfunktionen der Geräte-Laufzeitbibliothek 218 realisiert. Stränge an Barrieren werden nicht für die Ausführung durch die Hardware disponiert, so dass sie keine Arbeitszyklen verschwenden. In der übergeordneten Stranggruppe 104-1 nimmt nur der übergeordnete Strang 106-1 an den Barrieren teil. Dies ist deswegen so, weil die Hardware-Barriere gruppenbasiert ist. Eine Gruppe wird als an einer Barriere liegend erachtet, wenn jeder Strang innerhalb der Gruppe an der Barriere steht. The function processor 208 translates the non-entry functions 214 in two ways. If there is no parallel construct in a non-entry function, then the function is simply processed as it is. When a parallel construct exists, an outer function containing the body of the parallel construct is created. The function processor 208 then creates a branch condition that either executes the parallel construct sequentially or the device runtime library 218 used to assign a task to wake up worker strands and perform a barrier as previously described. The wakeup and sleep functions are performed using hardware barrier functions of the device runtime library 218 realized. Strands on barriers are not scheduled for execution by the hardware, so they do not waste work cycles. In the parent strand group 104-1 takes only the parent strand 106-1 participate in the barriers. This is because the hardware barrier is group based. A group is considered to be at a barrier when each strand within the group is at the barrier.

Ähnlich zu der Verarbeitung von Nicht-Eintrittsfunktionen, die kein paralleles Konstrukt aufweisen, werden äußere Funktionen 216 von dem Funktionsprozessor 208 so verarbeitet, wie sie sind.Similar to the processing of non-entry functions that do not have a parallel construct, they become outer functions 216 from the function processor 208 processed as they are.

Ein datenparalleles Programm mit Aufteilung-Vereinigung wird in ein übergeordnetes Programm und eine Gruppe aus parallelen Aufgaben unterteilt. Das übergeordnete Programm ist das Programm, das von dem übergeordneten Strang ausgeführt wird. Eine parallele Aufgabe entspricht einem parallelen Bereich, der von den Arbeiter-Strängen ausgeführt wird. Das übergeordnete Programm enthält Disponierpunkte, an denen der übergeordnete Strang eine parallele Aufgabe zuweisen bzw. bestimmen wird, die Arbeiter-Stränge aufwecken wird und darauf wartet, dass die Arbeiter-Stränge abgeschlossen sind.A data-parallel program with split-union is divided into a higher-level program and a group of parallel tasks. The parent program is the program that is executed by the parent thread. A parallel task corresponds to a parallel area performed by the worker strands. The parent program contains scheduling points where the parent thread will assign a parallel task that will wake worker strands and wait for the worker strands to complete.

Der spezielle übergeordnete Strang in der speziellen übergeordneten Gruppe wird den sequenziellen Bereich des Programms ausführen.The special parent thread in the special parent group will execute the sequential part of the program.

Alternativ kann ein Einzel–Strangverhalten in dem sequenziellen Bereich nachgebildet werden, während alle Stränge in der Gruppe den Code ausführen. Jedoch weisen Nachbildungsschemata Beschränkungen im Hinblick auf die Komplexität hinsichtlich des Leistungsverhaltens und der Erzeugung auf, so dass diese weniger brauchbar sind. Die notwendige Vorbestimmung und Synchronisierung ergibt zusätzlichen Aufwand bei der Ausführung. Ferner müssen alle Funktionen, die von dem sequenziellen Bereich und dem parallelen Bereich aufgerufen werden, unterschiedlich geklont und kontrolliert werden.Alternatively, a single-strand behavior may be replicated in the sequential region while all strands in the group execute the code. However, replication schemes have limitations in terms of performance and generation complexity, making them less useful. The necessary predetermination and synchronization results in additional effort in the execution. Furthermore, all functions called from the sequential area and the parallel area must be cloned and controlled differently.

Bei gegebener Unterteilung der Stränge und Gruppen vollführen die Arbeiter-Stränge und der übergeordnete Strang die folgenden Lebenszyklen aus: Eine Ausführungsform eines Arbeiter-Strangs durchläuft die folgenden Phasen in einem Lebenszyklus:

1) der Strangblock startet;
2) im Schlafmodus, bis das Aufwecken durch den übergeordneten Strang erfolgt;
3) Verlassen des Programms, wenn die Marke zum Verlassen des Programms auf wahr gesetzt ist;
4) Abholen und Ausführung der Aufgabe, die von dem übergeordneten Strang zugewiesen wird;
5) Eintreten in eine Barriere und
6) Zurückkehren zur Stufe 2.

Given the subdivision of the strands and groups, the worker strands and parent strand perform the following life cycles: An embodiment of a worker strand goes through the following phases in a life cycle:

1) the strand block starts;
2) in sleep mode until awakening by the parent strand;
3) exit the program if the mark to exit the program is true;
4) fetching and executing the task assigned by the parent thread;
5) entering a barrier and
6) Return to level 2.

Eine Ausführungsform eines übergeordneten Strangs durchläuft die folgenden Stufen in einem Lebenszyklus:

1) der Strangblock startet;
2) Setzen der Marke für das Verlassen des Programms auf falsch;
3) Ausführen des übergeordneten Programms, bis ein paralleler Bereich oder das Ende des übergeordneten Programms erreicht werden;
4) am Anfang eines parallelen Bereichs:
a. Festlegen einer parallelen Aufgabe,
b. Aufwecken der Arbeiter-Stränge,
c. Eintreten in eine Barriere, und
d. Fortsetzen des übergeordneten Programms (Stufe 3); und
5) am Ende des übergeordneten Programms:
a. Setzen der Marke für das Verlassen eines Programms auf wahr,
b. Aufwecken der Arbeiter-Stränge, und c. Ende.

One embodiment of a parent thread goes through the following stages in a lifecycle:

1) the strand block starts;
2) set the mark for leaving the program to false;
3) execute the parent program until a parallel area or the end of the parent program is reached;
4) at the beginning of a parallel area:
a. Defining a parallel task,
b. Awakening the worker strands,
c. Entering a barrier, and
d. Continue the parent program (level 3); and
5) at the end of the parent program:
a. Setting the mark for leaving a program to true,
b. Awakening the worker strands, and c. The End.

Die anderen Stränge in der übergeordneten Gruppe warten im Wesentlichen auf das Ende des Programms in untätiger Weise. Das Programm wird abwechselnd von dem übergeordneten Strang und den Arbeiter-Strängen ausgeführt. Dies führt zu einer guten Auslastung des Befehls-Cache-Speichers, die besser ist als die Auslastung, die durch ein Verfahren hervorgerufen wird, in welchem sowohl der übergeordnete Strang als auch die Arbeiter-Stränge aktiv sind und unterschiedliche Code-Pfade ausführen.The other threads in the parent group are essentially waiting for the end of the program to idle. The program is executed alternately by the parent strand and the worker strands. This results in a good utilization of the instruction cache that is better than the load caused by a method in which both the parent thread and the worker threads are active and execute different code paths.

3 ist ein Flussdiagramm einer Ausführungsform eines Verfahrens zur Kompilierung oder Laufzeitausführung eines datenparallelen Programms unter Aufteilung-Vereinigung mit Funktionsaufrufen. Das Verfahren beginnt in einem Startschritt 310. In einem Schritt 320 werden Stranggruppen innerhalb eines Strangblocks in eine übergeordnete Stranggruppe und mindestens eine Arbeiter-Stranggruppe aufgeteilt. In einem Schritt 330 wird ein einzelner Strang aus der übergeordneten Stranggruppe als der übergeordnete Strang ausgewiesen bzw. bestimmt. Die verbleibenden Stränge der übergeordneten Gruppe sind während der Ausführung im Wesentlichen untätig. Ferner werden im Schritt 330 alle Stränge in der mindestens einen Arbeiter-Stranggruppe als Arbeiter-Stränge ausgewiesen bzw. bestimmt. Das Verfahren endet in einem Schlussschritt 340. 3 FIG. 10 is a flowchart of one embodiment of a method for compiling or running-time execution of a data-parallel program under partition-union with function calls. The process starts in a starting step 310 , In one step 320 strand groups within a strand block are split into a parent strand group and at least one worker strand group. In one step 330 a single thread from the parent thread group is designated or designated as the parent thread. The remaining strands of the parent group are essentially idle during execution. Further, in step 330 all strands in the at least one worker strand group are designated or determined as worker strands. The procedure ends in a final step 340 ,

Der Fachmann auf diesem Gebiet, an den sich diese Anmeldung richtet, erkennt, dass andere und weitere Hinzufügungen, Streichungen, Ersetzungen und Modifizierungen an den beschriebenen Ausführungsformen vorgenommen werden können.Those skilled in the art to which this application pertains recognize that other and further additions, deletions, substitutions, and alterations can be made to the described embodiments.

Claims

A system for compiling or running a data-parallel program with split-union with function calls, with: a partition unit configured to divide groups into a parent group and at least one worker group; and a strand allocation unit connected to the partition unit and configured to designate only one thread from the parent group for execution as a parent thread and to determine all threads in the at least one worker group for execution as worker threads.

The system of claim 1, further comprising a thread scheduler coupled to the thread allocation unit and configured to cause a single-instruction multi-strand processor pipeline control unit to schedule the execution of the parent thread as follows: a program exit flag is set to a first state when the parent strand begins execution; the parent thread is executed until it reaches a parallel area or an end of the program; upon reaching the parallel area, a parallel task is set, entered a first barrier, entered a second barrier, and continued to execute the parent strand; and on reaching the end, the flag is set to a second state to exit a program, entered a barrier, and the parent thread terminates execution.

The system of claim 1 or 2, further comprising a strand scheduler coupled to the strand allocation unit and configured to cause a single-instruction multi-strand processor pipeline control unit to execute the worker threads as follows to schedule: the worker strands enter a first barrier; the worker strands begin execution of a parallel task specified by the parent thread when the parent thread reaches a parallel region and the parent strand enters a barrier; the worker strands enter a second barrier at the completion of the parallel task; and the worker strands end when a program exit flag is set to a second state and the parent strand enters a barrier.

The system of claim 1, further comprising a strand scheduler coupled to and configured with the strand mapper, a barrier function of a single-instruction multi-strand pipeline controller to control execution and termination of the pipeline To use worker strands.

The system of any one of claims 1-4, further comprising a function processor coupled to the strand allocation unit and configured to generate a cloned copy of an entry function and to process the entry function as a non-entry function.

The system of any one of claims 1-5, further comprising a function processor coupled to the strand mapper and configured to translate a non-ingress function having a parallel construct by: Creating a function containing a body of the parallel construct; and Inserting calls that target a device runtime library when the function runs in the parent thread.

The system of any one of claims 1-6, further comprising a function processor coupled to the thread allocation unit and configured to translate calls from outer functions as they are.

The system of any one of claims 1-7, further comprising a function processor coupled to the thread mapper and configured to use a device runtime library that provides functions that can be invoked by compiled user code and internal functions ,

The system of any of claims 1-8, further comprising a functional processor coupled to the strand mapper and configured to use an egress tag stored in a shared memory and used to facilitate the parent thread to tell the worker strands when to finish their execution.

The system of any of claims 1-9, wherein the system is configured to configure a shared memory of a single-instruction multi-processor to identify a current task.